Delving into the world of pure functional programming caused me to learn about parser combinators. Upon returning to Elixir, I was excited to see that nimble_parsec is a great example of a parser combinator library for the Elixir ecosystem.
Parser combinators can be notoriously confusing when first learned. In this post
I’ll provide a gentle introduction to parser combinators via nimble_parsec
.
What is a Parser Combinator?
Have you ever found yourself writing a regular expression to parse input? I know I have. You finally have the syntax correct and then new requirements get added. Suddenly you need to support optional tokens, lists of values and other complicated types of input. When regular expressions start to break down because of complexity, it’s time to reach for a more powerful abstraction. Enter parser combinators.
A parser combinator library allows you define parsers for specific types of input and then combine (see what I did there) these parsers together to create larger, more complicated parsers. This lets us test each part of our parser in isolation and then combine these parts just like other functions.
Our First Parser
First, create a new Elixir project via mix new parser
. Next, we’ll add
nimble_parsec
as a dependency in mix.exs
.
defp deps do
[
{:nimble_parsec, "~> 0.5"}
]
end
We’ll only be working with two modules – the top-level Parser
module and a
Parser.Helpers
module. First things first, we’re going to add a parser that
parses a specific string, "hello"
.
defmodule Parser do
import NimbleParsec
alias Parser.Helpers
@doc ~S"""
iex> {:ok, result, _, _, _, _} = Parser.constant("hello")
iex> result
["hello"]
iex> {:error, message, _, _, _, _,} = Parser.constant("other")
iex> message
"expected string \"hello\""
"""
defparsec(:constant, Helpers.constant())
end
We’ve created a module Parser
, imported NimbleParsec
and aliased our
Parser.Helpers
module. Next, we’re using the macro defparsec
provided by
NimbleParsec
to create a parser called constant
. Finally, we delegate the
implementation of the constant
parser to a function called constant
in our
Parser.Helpers
module.
We’re also using Elixir’s
doctest feature to test our
parser. Our tests show that we expect our parser to parse exactly the string
"hello"
, otherwise we’ll return an error.
All of our parsers will follow the same pattern:
defparsec
defines a parser and delegates to a function inParser.Helpers
- A
doctest
comment demonstrates the parser’s behavior - The parser implementation lives in
Parser.Helpers
Let’s look at the implementation for Parser.Helpers.constant
.
defmodule Parser.Helpers do
import NimbleParsec
def constant do
string("hello")
end
end
Our constant
function only uses one other function, string
. The string
function takes one argument, a literal string to match. It returns a combinator
that only parses the literal string it was passed. All our tests now pass.
Note that the return value of our parser above is a list with the value
"hello"
as the only element. Any parser defined with defparsec
always
returns a list of parsed tokens. We can see above in the doctest that the
successful result is ["hello"]
.
Natural Numbers
Next, let’s create a parser that parses the natural numbers (positive integers). Here’s our parser definition along with the tests (from now on we’ll exclude the module definitions).
@doc ~S"""
iex> {:ok, result, _, _, _, _} = Parser.natural("1234")
iex> result
[1234]
iex> {:error, message, _, _, _, _} = Parser.natural("-12")
iex> message
"expected byte in the range ?0..?9"
"""
defparsec(:natural, Helpers.natural())
Conveniently, nimble_parsec
includes a built-in combinator that can help us.
Here’s our definition.
def natural do
integer(min: 1)
end
We use the built-in integer
combinator and provide it the min: 1
option,
saying we want to parse the integer 1
or larger.
Composition
So far, the two parsers we’ve defined could have easily been implemented using
regular expressions. But now, we can start to combine these helper functions
(called combinators) to produce more complicated parsers. Let’s parse the name
from a line like this: Name: Drew
. In this case, we want the result of our
parser to be ["Drew"]
. Here’s the parser definition.
@doc ~S"""
iex> {:ok, result, _, _, _, _} = Parser.name("Name: Drew")
iex> result
["Drew"]
iex> {:error, message, _, _, _, _} = Parser.name("other stuff")
iex> message
"expected string \"Name: \""
"""
defparsec(:name, Helpers.name())
Let’s look at the definition. It’ll be more complicated than any combinator we’ve seen so far.
def name do
ignore(string("Name: "))
|> ascii_string([?a..?z, ?A..?Z], min: 1)
end
There’s a bunch of things to notice here.
First, we’re using a new built-in combinator called ignore
. It takes another
combinator as its argument and returns a new combinator that parses and then
throws away the result that matches the given combinator. In this case, we
want to parse and then ignore the string "Name: "
.
After that, we want to parse some input. This means we need to combine the
ignoring with the next combinator. To do this, we use Elixir’s |>
operator.
Finally, we use the built-in ascii_string
combinator to match any string of
lower or uppercase letters with length of at least 1.
Let’s look at one more example of this style of parsing. This time, we’ll parse
a line like Age: 37
and return the result [37]
.
@doc ~S"""
iex> {:ok, result, _, _, _, _} = Parser.age("Age: 37")
iex> result
[37]
iex> {:error, message, _, _, _, _} = Parser.age("other stuff")
iex> message
"expected string \"Age: \""
"""
defparsec(:age, Helpers.age())
def age do
ignore(string("Age: "))
|> concat(natural())
end
This looks identical to our previous parser save for two key differences: we’re
ignoring a different static string and we’re using concat(natural())
as our
second combinator.
What does concat(natural())
do? We want to reuse our previously-defined
natural
combinator to parse a positive integer. But why do we have to use
concat
? Most built-in combinators work either as a stand-alone combinator or
allow composition via the |>
operator. Our natural
combinator, however,
doesn’t take any arguments, so we can’t |>
into it. The built-in concat
combinator takes an existing combinator as an argument and returns a new
combinator that expects to be composed with |>
. That’s exactly what we want
here!
More Composition
Now, let’s parse a line that combines our name
and age
combinators. Here’s
the parser.
@doc ~S"""
iex> {:ok, result, _, _, _, _} = Parser.person("Name: Drew, Age: 37")
iex> result
[{:person, ["Drew", 37]}]
iex> {:error, message, _, _, _, _} = Parser.person("something else")
iex> message
"expected string \"Name: \""
"""
defparsec(:person, Helpers.person())
This time, our result looks different in that it is “tagged”. That is, rather than being just a plain list, it is a tuple with a name and a list of values. Let’s look at the combinator definition.
def person do
name()
|> ignore(string(", "))
|> concat(age())
|> tag(:person)
end
Here’s we’re using our existing name
and age
combinators, we’re ignoring the
", "
between them and finally tagging the result with :person
. If we exclude
the tag
line, the result of our parser would be ["Drew", 37]
. With the
tag
, the result is what we want {:person, ["Drew, 37]}
.
Tagging can help us organize the output from our parsers that return more complex data.
A More Complicated Parser
Let’s look at a parser that is a bit more complicated – parsing a list of numbers.
@doc ~S"""
iex> {:ok, result, _, _, _, _} =
Parser.list_of_numbers("1, 2, 456, 79")
iex> result
[1, 2, 456, 79]
iex> {:error, message, _, _, _, _} =
Parser.list_of_numbers("1, 2, hi, 79")
iex> message
"expected end of string"
"""
defparsec(:list_of_numbers, Helpers.list_of_numbers())
We’re going to use the times
combinator provided by nimble_parsec
to help
here. But this parser is more tricky to implement than we might think. Here’s
our first try at the combinator.
def list_of_numbers do
natural()
|> ignore(string(", "))
|> times(min: 1)
end
At first glace, this combinator seems reasonable. We want to parse a natural
number, followed by the string ", "
which we ignore. And we want to do this
process at least one time. Uh oh, we have a doctest failure.
1) doctest Parser.list_of_numbers/2 (5) (ParserTest)
test/parser_test.exs:3
Doctest failed
doctest:
iex> {:ok, result, _, _, _, _} =
Parser.list_of_numbers("1, 2, 456, 79")
iex> result
[1, 2, 456, 79]
code: result === [1, 2, 456, 79]
left: [1, 2, 456]
right: [1, 2, 456, 79]
stacktrace:
lib/parser.ex:51: Parser (module)
What happened? Well, we’re only parsing things that look like "<natural>, "
and the last number in our list doesn’t have a comma.
Ok, let’s take the following approach: we’ll always parse one natural number
first, then any number of ", <natural>"
patterns (zero or more).
def list_of_numbers do
comma_natural =
ignore(string(", "))
|> concat(natural())
natural()
|> times(comma_natural, min: 0)
end
To help with readability, we created a named combinator inside of our
function. Then, we parse a natural
followed by zero or more ", <natural>"
strings. Awesome, our test passes! Wait, we have another failure now.
1) doctest Parser.list_of_numbers/2 (6) (ParserTest)
test/parser_test.exs:3
match (=) failed
code: {:error, message, _, _, _, _} =
Parser.list_of_numbers("1, 2, hi, 79")
left: {:error, message, _, _, _, _}
right: {:ok, [1, 2], ", hi, 79", %{}, {1, 0}, 4}
stacktrace:
(for doctest at) lib/parser.ex:55: (test)
What happened? We expected the parser to fail if it reached a non-number, but
instead it just parsed as far as it could successfully. We need some way to tell
the parser that all of the string we provide must match the combinator. To do
this, we can use a built-in combinator called eos
. The eos
combinator
matches the end of the string and ignores its result.
def list_of_numbers do
comma_natural =
ignore(string(", "))
|> concat(natural())
natural()
|> times(comma_natural, min: 0)
|> eos()
end
Great, now all of our tests pass.
Code Reuse in Parsers
Finally, let’s parse a list of people, using our previously defined person
combinator.
@doc ~S"""
iex> {:ok, result, _, _, _, _} =
Parser.list_of_people("Name: Drew, Age: 37, Name: Jane, Age: 34")
iex> result
[{:person, ["Drew", 37]}, {:person, ["Jane", 34]}]
iex> {:error, message, _, _, _, _} =
Parser.list_of_people("Name: Drew, Age: 37, bad")
iex> message
"expected end of string"
"""
defparsec(:list_of_people, Helpers.list_of_people())
Much of the logic here is going to be the same as our list_of_numbers
combinator. Let’s pull out a helper called separated_by
that will allow use to
implement both combinators without duplication.
def separated_by(comb, sep) do
sep_comb =
ignore(sep)
|> concat(comb)
comb
|> times(sep_comb, min: 0)
end
Our separated_by
helper takes two arguments: a combinator called comb
that
parses the tokens we want to keep and a second combinator called sep
that
separates the tokens. We want to ignore this one.
Now, let’s re-implement list_of_numbers
using this helper.
def list_of_numbers do
natural()
|> separated_by(string(", "))
|> eos()
end
Note that eos
isn’t inside of separated_by
because we want to be able to
reuse separated_by
anywhere in our parsers rather than being forced to consume
the entire input.
Finally, let’s implemented list_of_people
.
def list_of_people do
person()
|> separated_by(string(", "))
|> eos()
end
That was easy!
Wrap Up
I hope this post helps you understand some of the power provided by parser
combinators. The next time you need to parse some complicated input in your
Elixir application, consider using nimble_parsec
.