Parsing Untrusted Input with Elixir

I’ve spent a lot of time in the past year learning PureScript and it has drastically changed the way I think about programming in general. The biggest change in my thinking is described by the excellent blog post Parse, don’t validate.

The most important passage in the post, I think, is this:

Consider: what is a parser? Really, a parser is just a function that consumes less-structured input and produces more-structured output.

I think a lot about trying to produce more-structured data at the edges of the systems I’m building. In a strongly-typed language like PureScript, this often means building a parser to take some JSON input and turn it into a custom data type via a parser combinator library. I’m now building systems with Elixir at work and here the concept of “parsing” is more fuzzy. Still, we end up with some untrusted input from the outside world that we need to process. In this post we’ll assume this untrusted input represents JSON in the form of an elixir map produced by a Plug.Parser.

Even though we say that JSON has been “parsed” into a map, the structure of this map is still completely untrusted. We have no idea if the keys we require in the map’s structure are present nor if the values are semantically correct. We’ll cover a simple technique to parse this untrusted map into a trusted struct.

Tools of the Trade

We’ll use two libraries to help us build our parser:

  1. Ecto
  2. TypedEctoSchema

The TypedEctoSchema library isn’t strictly necessary but I’ve found the benefits of generating a typespec for my structs very compelling.

I’m assuming the reader is familiar with Ecto, embedded_schema and Changesets. This post does not aim to teach these building-blocks but rather presents a simple technique for composing them to create parsers.

Let’s build a simple parser that expects to parse a map with a single key called name and a string value.

defmodule Parsers.SimplePerson do
  use TypedEctoSchema

  import Ecto.Changeset

  alias Parsers.SimplePerson

  @primary_key false

  typed_embedded_schema do
    field :name, :string, enforce: true
  end

  def changeset(person, attrs \\ %{}) do
    person
    |> cast(attrs, [:name])
    |> validate_required([:name])
  end

  def build(attrs) do
    struct(SimplePerson)
    |> changeset(attrs)
    |> apply_action(:build)
  end
end

Let’s look at a few examples of using this parser:

iex(1)> Parsers.SimplePerson.build(%{"name" => "Drew"})
{:ok, %Parsers.SimplePerson{name: "Drew"}}

iex(2)> Parsers.SimplePerson.build(%{"name" => "Drew", "unknown" => 1})
{:ok, %Parsers.SimplePerson{name: "Drew"}}

iex(3)> Parsers.SimplePerson.build(%{"foo" => "bar"})
{:error,
 #Ecto.Changeset<
   action: :build,
   changes: %{},
   errors: [name: {"can't be blank", [validation: :required]}],
   data: #Parsers.SimplePerson<>,
   valid?: false
 >}

iex(4)> Parsers.SimplePerson.build(%{"name" => 1})
{:error,
 #Ecto.Changeset<
   action: :build,
   changes: %{},
   errors: [name: {"is invalid", [type: :string, validation: :cast]}],
   data: #Parsers.SimplePerson<>,
   valid?: false
 >}

Even this very simple example demonstrates the power of this technique. We are able to ensure our key is present and that it is of the appropriate type. We also discard any keys that are not specified and ultimately end up with a struct or an error. We could imagine using this parser in a with statement where our system processes external requests.

with(
  {:ok, person} <- Parsers.SimplePerson.build(input),
  {:ok, result} <- process_person(result)
) do
  {:ok, build_response(result)}
else
  {:error, error} -> handle_error(error)
end

Note that we build an empty SimplePerson struct using struct(SimplePerson) rather than %SimplePerson{}. This allows us to bypass the protection we added with the enforce: true option on our schema. If a user in our system attempts to create a SimplePerson directly without providing the name attribute, they will be greeted with a compiler failure.

iex(1)> %Parsers.SimplePerson{}
** (ArgumentError) the following keys must also be given when building
      struct Parsers.SimplePerson: [:name]
    (parsers 0.1.0) expanding struct: Parsers.SimplePerson.__struct__/1
    iex:1: (file)

Composing Parsers

We can easily compose parsers using embeds_one and embeds_many.

defmodule Parsers.Address do
  use TypedEctoSchema

  import Ecto.Changeset

  alias Parsers.Address

  @primary_key false

  typed_embedded_schema do
    field :city, :string, enforce: true
    field :zip, :string, enforce: true
  end

  def changeset(person, attrs \\ %{}) do
    person
    |> cast(attrs, [:city, :zip])
    |> validate_required([:city, :zip])
  end

  def build(attrs) do
    struct(Address)
    |> changeset(attrs)
    |> apply_action(:build)
  end
end

defmodule Parsers.Person do
  use TypedEctoSchema

  import Ecto.Changeset

  alias Parsers.Address
  alias Parsers.Person

  @primary_key false

  typed_embedded_schema do
    field :name, :string, enforce: true
    embeds_one :address, Address, enforce: true
  end

  def changeset(person, attrs \\ %{}) do
    person
    |> cast(attrs, [:name])
    |> cast_embed(:address)
    |> validate_required([:name, :address])
  end

  def build(attrs) do
    struct(Person)
    |> changeset(attrs)
    |> apply_action(:build)
  end
end
iex(1)> Parsers.Person.build(%{
          "name" => "Drew",
          "address" => %{
            "city" => "Chicago",
            "zip" => "60606"
          }
        })
{:ok,
 %Parsers.Person{
   address: %Parsers.Address{city: "Chicago", zip: "60606"},
   name: "Drew"
 }}

iex(2)> Parsers.Person.build(%{
          "name" => "Drew",
          "address" => %{
            "city" => "Chicago",
            "zip" => 60606
          }
        })
{:error,
 #Ecto.Changeset<
   action: :build,
   changes: %{
     address: #Ecto.Changeset<
       action: :insert,
       changes: %{city: "Chicago"},
       errors: [zip: {"is invalid", [type: :string, validation: :cast]}],
       data: #Parsers.Address<>,
       valid?: false
     >,
     name: "Drew"
   },
   errors: [],
   data: #Parsers.Person<>,
   valid?: false
 >}

Removing Duplication

You’ll note in the last example we have a lot of duplicated boilerplate. The last step of our work is creating a module that extracts the duplication from our parser definitions.

defmodule Parsers.Schema do
  @callback changeset(struct(), map()) :: Ecto.Changeset.t()

  defmacro __using__(opts) do
    quote do
      @behaviour Parsers.Schema

      use TypedEctoSchema

      import Ecto.Changeset

      @primary_key false

      unquote(add_builder(opts))
    end
  end

  defmacro __before_compile__(_) do
    quote do
      def build(attrs) do
        struct(__MODULE__)
        |> changeset(attrs)
        |> apply_action(:build)
      end
    end
  end

  defp add_builder(opts) do
    if Keyword.get(opts, :builder, true) do
      quote do
        @before_compile Parsers.Schema
      end
    end
  end
end

Using Parser.Schema we can re-write our previous Person and Address parsers.

defmodule Parsers.Address do
  use Parsers.Schema, builder: false

  typed_embedded_schema do
    field :city, :string, enforce: true
    field :zip, :string, enforce: true
  end

  @impl Parsers.Schema
  def changeset(person, attrs \\ %{}) do
    person
    |> cast(attrs, [:city, :zip])
    |> validate_required([:city, :zip])
  end
end

defmodule Parsers.Person do
  use Parsers.Schema

  alias Parsers.Address

  typed_embedded_schema do
    field :name, :string, enforce: true
    embeds_one :address, Address, enforce: true
  end

  @impl Parsers.Schema
  def changeset(person, attrs \\ %{}) do
    person
    |> cast(attrs, [:name])
    |> cast_embed(:address)
    |> validate_required([:name, :address])
  end
end

Our Parsers.Schema module does several important things.

  1. Ensures we have defined the changeset callback
  2. Tells Ecto not to include a primary_key for our embedded schema
  3. Adds a builder function unless we explicitly exclude its generation

Parse Your Elixir Input

Using the above simple techniques, we can force our untrusted input into a known-good struct as early as possible in our application. We’ve centralized the logic for parsing our external input and all functions called later can trust that they will receive well-structured input – at least to the extent possible in a dynamically typed programming language.