Parser Combinators in Elixir

Delving into the world of pure functional programming caused me to learn about parser combinators. Upon returning to Elixir, I was excited to see that nimble_parsec is a great example of a parser combinator library for the Elixir ecosystem.

Parser combinators can be notoriously confusing when first learned. In this post I’ll provide a gentle introduction to parser combinators via nimble_parsec.

What is a Parser Combinator?

Have you ever found yourself writing a regular expression to parse input? I know I have. You finally have the syntax correct and then new requirements get added. Suddenly you need to support optional tokens, lists of values and other complicated types of input. When regular expressions start to break down because of complexity, it’s time to reach for a more powerful abstraction. Enter parser combinators.

A parser combinator library allows you define parsers for specific types of input and then combine (see what I did there) these parsers together to create larger, more complicated parsers. This lets us test each part of our parser in isolation and then combine these parts just like other functions.

Our First Parser

First, create a new Elixir project via mix new parser. Next, we’ll add nimble_parsec as a dependency in mix.exs.

defp deps do
  [
    {:nimble_parsec, "~> 0.5"}
  ]
end

We’ll only be working with two modules – the top-level Parser module and a Parser.Helpers module. First things first, we’re going to add a parser that parses a specific string, "hello".

defmodule Parser do
  import NimbleParsec

  alias Parser.Helpers

  @doc ~S"""
      iex> {:ok, result, _, _, _, _} = Parser.constant("hello")
      iex> result
      ["hello"]

      iex> {:error, message, _, _, _, _,} = Parser.constant("other")
      iex> message
      "expected string \"hello\""
  """
  defparsec(:constant, Helpers.constant())
end

We’ve created a module Parser, imported NimbleParsec and aliased our Parser.Helpers module. Next, we’re using the macro defparsec provided by NimbleParsec to create a parser called constant. Finally, we delegate the implementation of the constant parser to a function called constant in our Parser.Helpers module.

We’re also using Elixir’s doctest feature to test our parser. Our tests show that we expect our parser to parse exactly the string "hello", otherwise we’ll return an error.

All of our parsers will follow the same pattern:

Let’s look at the implementation for Parser.Helpers.constant.

defmodule Parser.Helpers do
  import NimbleParsec

  def constant do
    string("hello")
  end
end

Our constant function only uses one other function, string. The string function takes one argument, a literal string to match. It returns a combinator that only parses the literal string it was passed. All our tests now pass.

Note that the return value of our parser above is a list with the value "hello" as the only element. Any parser defined with defparsec always returns a list of parsed tokens. We can see above in the doctest that the successful result is ["hello"].

Natural Numbers

Next, let’s create a parser that parses the natural numbers (positive integers). Here’s our parser definition along with the tests (from now on we’ll exclude the module definitions).

@doc ~S"""
    iex> {:ok, result, _, _, _, _} = Parser.natural("1234")
    iex> result
    [1234]

    iex> {:error, message, _, _, _, _} = Parser.natural("-12")
    iex> message
    "expected byte in the range ?0..?9"
"""
defparsec(:natural, Helpers.natural())

Conveniently, nimble_parsec includes a built-in combinator that can help us. Here’s our definition.

def natural do
  integer(min: 1)
end

We use the built-in integer combinator and provide it the min: 1 option, saying we want to parse the integer 1 or larger.

Composition

So far, the two parsers we’ve defined could have easily been implemented using regular expressions. But now, we can start to combine these helper functions (called combinators) to produce more complicated parsers. Let’s parse the name from a line like this: Name: Drew. In this case, we want the result of our parser to be ["Drew"]. Here’s the parser definition.

@doc ~S"""
    iex> {:ok, result, _, _, _, _} = Parser.name("Name: Drew")
    iex> result
    ["Drew"]

    iex> {:error, message, _, _, _, _} = Parser.name("other stuff")
    iex> message
    "expected string \"Name: \""
"""
defparsec(:name, Helpers.name())

Let’s look at the definition. It’ll be more complicated than any combinator we’ve seen so far.

def name do
  ignore(string("Name: "))
  |> ascii_string([?a..?z, ?A..?Z], min: 1)
end

There’s a bunch of things to notice here.

First, we’re using a new built-in combinator called ignore. It takes another combinator as its argument and returns a new combinator that parses and then throws away the result that matches the given combinator. In this case, we want to parse and then ignore the string "Name: ".

After that, we want to parse some input. This means we need to combine the ignoring with the next combinator. To do this, we use Elixir’s |> operator.

Finally, we use the built-in ascii_string combinator to match any string of lower or uppercase letters with length of at least 1.

Let’s look at one more example of this style of parsing. This time, we’ll parse a line like Age: 37 and return the result [37].

@doc ~S"""
    iex> {:ok, result, _, _, _, _} = Parser.age("Age: 37")
    iex> result
    [37]

    iex> {:error, message, _, _, _, _} = Parser.age("other stuff")
    iex> message
    "expected string \"Age: \""
"""
defparsec(:age, Helpers.age())
def age do
  ignore(string("Age: "))
  |> concat(natural())
end

This looks identical to our previous parser save for two key differences: we’re ignoring a different static string and we’re using concat(natural()) as our second combinator.

What does concat(natural()) do? We want to reuse our previously-defined natural combinator to parse a positive integer. But why do we have to use concat? Most built-in combinators work either as a stand-alone combinator or allow composition via the |> operator. Our natural combinator, however, doesn’t take any arguments, so we can’t |> into it. The built-in concat combinator takes an existing combinator as an argument and returns a new combinator that expects to be composed with |>. That’s exactly what we want here!

More Composition

Now, let’s parse a line that combines our name and age combinators. Here’s the parser.

@doc ~S"""
    iex> {:ok, result, _, _, _, _} = Parser.person("Name: Drew, Age: 37")
    iex> result
    [{:person, ["Drew", 37]}]

    iex> {:error, message, _, _, _, _} = Parser.person("something else")
    iex> message
    "expected string \"Name: \""
"""
defparsec(:person, Helpers.person())

This time, our result looks different in that it is “tagged”. That is, rather than being just a plain list, it is a tuple with a name and a list of values. Let’s look at the combinator definition.

def person do
  name()
  |> ignore(string(", "))
  |> concat(age())
  |> tag(:person)
end

Here’s we’re using our existing name and age combinators, we’re ignoring the ", " between them and finally tagging the result with :person. If we exclude the tag line, the result of our parser would be ["Drew", 37]. With the tag, the result is what we want {:person, ["Drew, 37]}.

Tagging can help us organize the output from our parsers that return more complex data.

A More Complicated Parser

Let’s look at a parser that is a bit more complicated – parsing a list of numbers.

@doc ~S"""
    iex> {:ok, result, _, _, _, _} =
      Parser.list_of_numbers("1, 2, 456, 79")
    iex> result
    [1, 2, 456, 79]

    iex> {:error, message, _, _, _, _} =
      Parser.list_of_numbers("1, 2, hi, 79")
    iex> message
    "expected end of string"
"""
defparsec(:list_of_numbers, Helpers.list_of_numbers())

We’re going to use the times combinator provided by nimble_parsec to help here. But this parser is more tricky to implement than we might think. Here’s our first try at the combinator.

def list_of_numbers do
  natural()
  |> ignore(string(", "))
  |> times(min: 1)
end

At first glace, this combinator seems reasonable. We want to parse a natural number, followed by the string ", " which we ignore. And we want to do this process at least one time. Uh oh, we have a doctest failure.

  1) doctest Parser.list_of_numbers/2 (5) (ParserTest)
     test/parser_test.exs:3
     Doctest failed
     doctest:
       iex> {:ok, result, _, _, _, _} =
         Parser.list_of_numbers("1, 2, 456, 79")
       iex> result
       [1, 2, 456, 79]
     code:  result === [1, 2, 456, 79]
     left:  [1, 2, 456]
     right: [1, 2, 456, 79]
     stacktrace:
       lib/parser.ex:51: Parser (module)

What happened? Well, we’re only parsing things that look like "<natural>, " and the last number in our list doesn’t have a comma.

Ok, let’s take the following approach: we’ll always parse one natural number first, then any number of ", <natural>" patterns (zero or more).

def list_of_numbers do
  comma_natural =
    ignore(string(", "))
    |> concat(natural())

  natural()
  |> times(comma_natural, min: 0)
end

To help with readability, we created a named combinator inside of our function. Then, we parse a natural followed by zero or more ", <natural>" strings. Awesome, our test passes! Wait, we have another failure now.

  1) doctest Parser.list_of_numbers/2 (6) (ParserTest)
     test/parser_test.exs:3
     match (=) failed
     code:  {:error, message, _, _, _, _} =
       Parser.list_of_numbers("1, 2, hi, 79")
     left:  {:error, message, _, _, _, _}
     right: {:ok, [1, 2], ", hi, 79", %{}, {1, 0}, 4}
     stacktrace:
       (for doctest at) lib/parser.ex:55: (test)

What happened? We expected the parser to fail if it reached a non-number, but instead it just parsed as far as it could successfully. We need some way to tell the parser that all of the string we provide must match the combinator. To do this, we can use a built-in combinator called eos. The eos combinator matches the end of the string and ignores its result.

def list_of_numbers do
  comma_natural =
    ignore(string(", "))
    |> concat(natural())

  natural()
  |> times(comma_natural, min: 0)
  |> eos()
end

Great, now all of our tests pass.

Code Reuse in Parsers

Finally, let’s parse a list of people, using our previously defined person combinator.

@doc ~S"""
    iex> {:ok, result, _, _, _, _} =
      Parser.list_of_people("Name: Drew, Age: 37, Name: Jane, Age: 34")
    iex> result
    [{:person, ["Drew", 37]}, {:person, ["Jane", 34]}]

    iex> {:error, message, _, _, _, _} =
      Parser.list_of_people("Name: Drew, Age: 37, bad")
    iex> message
    "expected end of string"
"""
defparsec(:list_of_people, Helpers.list_of_people())

Much of the logic here is going to be the same as our list_of_numbers combinator. Let’s pull out a helper called separated_by that will allow use to implement both combinators without duplication.

def separated_by(comb, sep) do
  sep_comb =
    ignore(sep)
    |> concat(comb)

  comb
  |> times(sep_comb, min: 0)
end

Our separated_by helper takes two arguments: a combinator called comb that parses the tokens we want to keep and a second combinator called sep that separates the tokens. We want to ignore this one.

Now, let’s re-implement list_of_numbers using this helper.

def list_of_numbers do
  natural()
  |> separated_by(string(", "))
  |> eos()
end

Note that eos isn’t inside of separated_by because we want to be able to reuse separated_by anywhere in our parsers rather than being forced to consume the entire input.

Finally, let’s implemented list_of_people.

def list_of_people do
  person()
  |> separated_by(string(", "))
  |> eos()
end

That was easy!

Wrap Up

I hope this post helps you understand some of the power provided by parser combinators. The next time you need to parse some complicated input in your Elixir application, consider using nimble_parsec.