Build a query parser

Why and how, with a tutorial using Ruby, Parslet, and Elasticsearch

By Luke Francl (look@recursion.org), June 2017

More than a few times in my career, I've been part of a project that needed search. A Lucene-based search engine fits the bill. Somebody¹ finds the search engine's built-in query parser, wires it up, and that is that. It seems like a good idea and it's easy.

However, it's better to write your own query parser, for two reasons. First, built-in parsers are too powerful. They are confusing and allow users to trigger expensive queries that kill performance. Second, built-in parsers are too generic. There is a tension between queries that are safe to execute and giving users a powerful query language—which they expect. However, built-in query parsers tend to be all-or-nothing: either they are safe, or they provide extraordinary power that can be too dangerous to expose. You can't select only the features you need. When you control your own parser, you can add features to it and customize your application's search behavior for your users.

This might sound daunting, but thanks to easy-to-use parser libraries, it's straightforward. There is no magic in the built-in query parser. It constructs low-level queries the same way as using those objects directly.

In this tutorial, we'll create a query parser that can generate queries for the Elasticsearch query DSL. Building up from a simple term parser, we'll add boolean operators, phrases, and close by adding an application-specific heuristic that would never make sense as part of a generic query parser.

Want to skip to the code? Each step of the tutorial is available as a self-contained file so it's easy to follow along.

Problems with generic query parsers

Most search engines have a very powerful query parser built in, which can take a string and convert it to the underlying query objects. I'm most familiar with Lucene's query parser which is exposed by Solr and Elasticsearch, but other search APIs provide similar functionality (for example, Google Cloud Platform's search API).

There is no magic: The safety-power gradient

There is no magic in the Lucene query parser. It accepts as input a string, parses it and constructs lower-level queries. You can see this yourself if you look at the code (the linked-to class adds should clauses to a boolean query.)

The following Elasticsearch query looks simple:

{
  "query": {
    "simple_query_string" : {
        "query": "cat in the hat",
        "fields": ["title"]
    }
  }
}

But it is essentially the same as this longer query that has been implemented using bool and match:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "cat"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "in"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "the"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "hat"
            }
          }
        }
      ]
    }
  }
}

And match is implemented using term, which is a query that operates on the exact terms that are stored in Lucene's inverted index.

The higher-level queries are all implemented in terms of the lower-level queries.

term: Exact match for a single term
match: Constructs a bool query of analyzed terms (no parsing)
simple_query_string: Parses a string and constructs a bool query with phrases, fuzziness, and prefixes
query_string: Parses a very complicated syntax and constructs a bool query with phrases, fuzziness, prefixes, ranges, regular expressions, etc.

match is very safe to expose to users, but also limited in what it can do. The jump up to simple_query_parser adds a lot of features you may not need, and query_string is explicitly not recommended for end users.

However, since there is no magic, there is no downside to generating lower-level queries in your application, rather than having Elasticsearch do it.

User input may contain special characters

The built-in query parser has its own syntax, which users may not understand. For example, Elasticsearch's query_string syntax reserves +, -, =, &&, ||, >, <, !, (, ), {, }, [, ], ^, ", ~, *, ?, :, \, and /.

Using the syntax incorrectly will either trigger an error or lead to unexpected results. For example, to prevent the query string alpha:cat "cat:hat" from generating a query limiting the search for cat to the field alpha (which might not exist), it needs to be escaped as alpha\:cat "cat:hat".

Escaping characters in a query string with regular expressions ranges from difficult to impossible. And some characters can't be escaped, period:

< and > can't be escaped at all. The only way to prevent them from attempting to create a range query is to remove them from the query string entirely.

Exposing query_string to end users is now explicitly discouraged. Considering its complexity and lack of composability, it's not a great interface for programmers, either.

Users can trigger expensive query features

Related to the above point, users can intentionally or unintentionally trigger advanced query features. For example, limiting a search term to a single field with field_name:term or boosting a term with term^10. The results can range from confusing to malicious.

Some of these advanced operators can cause very expensive queries. In Lucene-based tools, certain queries are very expensive because they require enumerating terms from the term dictionary in order to create the low-level query objects. A query with a wildcard (especially a leading wildcard!) or regular expression may require reading many terms. Regular expressions are particularly dangerous:

A query string such as the following would force Elasticsearch to visit every term in the index:

/.*n/

Use with caution!

Range queries may seem harmless, but they also have this problem. Beware queries for wide ranges on high resolution data!

Users can submit a huge number of terms

Passing a user input directly to the query parser can be dangerous for more prosaic reasons. For example, the user may simply enter a large number of terms. This will generate a query with many clauses, which will take longer to execute. Truncating the query string is a simple work around, but if you truncate in the middle of an expression (for example, by breaking a quoted phrase or parenthetical expression), it could lead to an invalid query.

Avoiding the foot gun

Lucene 4.7 added a new SimpleQueryParser that improved things quite a bit. In Elasticsearch, this is available as simple_query_string in the search DSL. Unlike query_string, simple_query_string is designed to be exposed to end users and reduces the complexity of queries that can be created.

But SimpleQueryParser is still powerful. Users can specify queries with the following operators:

+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount

This syntax is both complicated and may let users generate expensive queries.

Take control of your search box

If you go down the built-in query parser route, you'll get something working quickly. But you'll probably run into problems sooner or later. Users (or your exception monitoring software) will complain that queries don't work; or expensive queries will slow your service down for everyone and you'll have to figure out why.

Aside: Search box versus search interface

This tutorial focuses on parsing what the user types into a search box: full-text search in its most basic sense.

Your application's search interface may require more advanced search features, such as faceting or filtering. However, almost all applications with search allow full-text search, so the code presented here is widely applicable. Additional search features can be layered into the query generation process by sending additional input along with the query string as part of your search API.

That's why it's worth the time to build your own query parser. Here's some advantages:

Limit queries to the features you need
Handle expensive queries up front (for example, by limiting the number of terms that can be searched for)
Better and faster error feedback for users without a round-trip to the search engine
Allows programmatic modification of search queries before running them (applications include query optimization, synonym expansion, and spelling correction)
Build in heuristics specific to your application that are not possible for a general-purpose parser

Fortunately, building a simple query parser is not difficult. In this tutorial, we will create a query parser using the Ruby library Parslet that can generate queries for the Elasticsearch query DSL. Our query parser will be more limited than the syntax supported by SimpleQueryParser, but the syntax will be controlled by our code now, so we can add new features if we need to.

Building a term-based query parser

The first version of the query parser will be extremely limited. Given input like cat in the hat it will be able to generate this Elasticsearch query:

{
  "query": {
    "match": {
      "title": {
        "query": "cat in the hat",
        "operator": "or"
      }
    }
  }
}

This is a match query, which does not interpret its input. You may notice that...we could just take the user input and put it in JSON structure. But we need to start somewhere!

Defining a query language grammar with BNF

First, let's define a grammar for our simple query language using Backus–Naur form (BNF).

<query> ::= <term> | <query>
<term> ::= <alphanumeric> | <term>

Where alphanumeric is defined as the characters [a-zA-Z0-9]. The definitions are recursive to allow repetition. A railroad diagram helps visualize the syntax.

Query

Term

Defining a grammar with Parslet

Aside: PEG parsing in other languages

This tutorial uses Ruby, but there are great, easy-to-use PEG parsing libraries available for most languages. JavaScript has PEG.js, Python has Arpeggio and Parsimonious, and Java has Parboiled.

BNF describes a context-free grammar (CFG), which is a set of rules that can be used to generate any legal expression of the language. Because of its origins in natural, human languages, CFGs can be ambiguous. Parsing CFGs in the general case is slow (The best known algorithms for parsing CFGs run in O(n³) time). This makes CFGs inappropriate for machine-oriented languages. Therefore, various limitations must be observed to avoid ambiguity and poor performance (despite this, ambiguous parse trees still happen).

Building a parser with traditional tools means defining a generative grammar with BNF (or, more likely, extended BNF), writing a lexer to convert input into tokens, and then feeding the tokens into a parser generator.

The Parsing Expression Grammar (PEG) model makes parsers easier to write by eliminating the separate lexing step. In practice, a PEG parser looks like an executable BNF grammar, but PEGs are recognition-based rather than generative like CFGs.

Parsing expression grammars have the following characteristics:

Cannot be ambiguous: the first alternative that matches is always chosen
Operators consume as much input as matches and do not backtrack
Allows infinite amount of look-ahead
Can be parsed in linear time with memoization

Parslet is a Ruby library for generating PEG-style parsers. You define the rules of your grammar, and Parslet creates a parser that returns a parse tree for input. Then you define a transformer that takes the parse tree and converts it to an abstract syntax tree (AST). Finally, your code evaluates the AST to produce a result.

To define a parser with Parslet, subclass Parslet::Parser and define rules, which are called atoms, the building blocks of your grammar:

class MyParser < Parslet::Parser
  # match matches one character; repeat allows 1 or more repetitions
  rule(:term) { match('[a-zA-Z0-9]').repeat(1) } 

  rule(:space) { match('\s').repeat(1) }

  # >> means "followed by"; maybe is equivalent to repeat(0, 1)
  rule(:query) { (term >> space.maybe).repeat }

  # The root tells Parslet where to start parsing the input
  root(:query)
end

Aside: Using Parslet from the console

When you want to quickly test out a Parslet parser, you can include Parslet in irb to add its API:

require 'parslet'
include Parslet

match('\d').repeat(1).parse("1234")
 => "1234"@0

Notice how rules can be used by other rules. They plain old Ruby objects.

Now that we have a parser, we can instantiate it and parse a string:

MyParser.new.parse("hello parslet")
# => "hello parslet"@0

It doesn't look like much! But notice the @0. The result is a Parslet::Slice, and @0 indicates where in the input string the match occurred. This is really useful for more complicated parsers.

This humble parser is also capable of rejecting invalid input:

MyParser.new.parse("hello, parslet")
# => Parslet::ParseFailed: Extra input after last repetition at line 1 char 6.

The error message pinpoints exactly which character violated the grammar. With the #ascii_tree, you can get more details.

begin
  MyParser.new.parse("hello, parslet")
rescue Parslet::ParseFailed => e
  puts e.parse_failure_cause.ascii_tree
end

Prints:

Extra input after last repetition at line 1 char 6.
`- Failed to match sequence (TERM SPACE?) at line 1 char 6.
   `- Expected at least 1 of [a-zA-Z0-9] at line 1 char 6.
      `- Failed to match [a-zA-Z0-9] at line 1 char 6.

Building a parse tree

The simple parser above can recognize strings that match the grammar, but can't do anything with it. Using #as, we can capture parts of the input that we want to keep and save them as a parse tree. Anything not named with #as is discarded.

We need to capture the terms and the overall query. To be more flexible with user input, the term rule has been updated to match any non-whitespace character.

class QueryParser < Parslet::Parser
  rule(:term) { match('[^\s]').repeat(1).as(:term) }
  rule(:space) { match('\s').repeat(1) }
  rule(:query) { (term >> space.maybe).repeat.as(:query) }
  root(:query)
end

This produces a parse tree rooted at :query that contains a list of :term objects. The value of the :term is a Parslet::Slice, as we saw above.

QueryParser.new.parse("cat in the hat")
# =>
{
  :query => [
    {:term => "cat"@0},
    {:term => "in"@4},
    {:term => "the"@7},
    {:term => "hat"@11}
  ]
}

Once you have defined your parse tree, you can create a Parslet::Transform to convert the parse tree into an abstract syntax tree; or in our case, an object that knows how to convert itself to the Elasticsearch query DSL.

A Parslet::Transform defines rules for matching part of the parse tree and converting it to something else. Walking up from the leaf nodes, the entire tree is consumed. For this parse tree, the transformation is simple: match a :term Hash and convert it to a String, then match an array of terms and instantiate a Query object:

class QueryTransformer < Parslet::Transform
  rule(:term => simple(:term)) { term.to_s }
  rule(:query => sequence(:terms)) { Query.new(terms) }
end

Query is also simple. It stores its list of term Strings, and defines a #to_elasticsearch method that joins them back together again in the query DSL:

class Query
  attr_accessor :terms

  def initialize(terms)
    self.terms = terms
  end

  def to_elasticsearch
    {
      :query => {
        :match => {
          :title => {
            :query => terms.join(" "),
            :operator => "or"
          }
        }
      }
    }
  end
end

All together, we can now generate an Elasticsearch match query:

parse_tree = QueryParser.new.parse("cat in the hat")
query = QueryTransformer.new.apply(parse_tree)
query.to_elasticsearch
# =>
{
  :query => {
    :match => {
      :title => {
        :query => "cat in the hat",
        :operator => "or"
      }
    }
  }
}

Aside: Fields

You may have noticed the Elasticsearch queries are only searching one field, title. This keeps the example code simple. A real query generator would need to know the mapping's schema so it could, for example, search all text fields.

OK, that was fun, but this is a a roundabout way of generating a simple Elasticsearch query. So far, the code could be replaced with a simple match query to Elasticsearch. But we can build up from here.

Boolean queries: should, must, and must not

Unlike the boolean logic you may be familiar with, Lucene-based systems define three types of boolean clauses: should, which means that a clause ought to match, but does not reject documents that don't; must, which requires documents match the clause; and must not, which requires documents do not match the clause. These correspond to "or", "and", and "not", respectively.

In a query language, that might look like cat -hat +cradle. I like using + and - for this rather than AND and OR because I think it looks better and users are less likely to accidentally trigger dangling operators such as foo AND.

To support boolean logic, we'll add a new entity to our parse tree: a clause. A clause has an optional operator (+ or -) and a term.

In Parslet, the new clause node can be defined like this:

class QueryParser < Parslet::Parser
  rule(:term) { match('[^\s]').repeat(1).as(:term) }
  rule(:operator) { (str('+') | str('-')).as(:operator) }
  rule(:clause) { (operator.maybe >> term).as(:clause) }
  rule(:space) { match('\s').repeat(1) }
  rule(:query) { (clause >> space.maybe).repeat.as(:query) }
  root(:query)
end

Parsing a query yields a parse tree like this:

QueryParser.new.parse("the +cat in the -hat")
# =>
{
  :query => [
    {:clause => {:term => "the"@0}},
    {:clause => {:operator => "+"@4, :term => "cat"@5}},
    {:clause => {:term => "in"@9}},
    {:clause => {:term => "the"@12}},
    {:clause => {:operator => "-"@16, :term => "hat"@17}}
  ]
}

Transforming this parse tree into the Elasticsearch query DSL will be a little more complicated than the previous iteration, where we could use match directly. Elasticsearch supports several compound queries that allow you to combine simpler queries in complicated ways. For our parser, we'll use the bool query. It looks like this:

{
  "query": {
    "bool": {
      "should": [ list of queries that SHOULD match (OR) ],
      "must": [ list of queries that MUST match (AND)],
      "must_not": [ list of queries that MUST NOT match (NOT)]
    }
  }
}

Yes, the input to a bool query is more queries. This is why you can be overwhelmed by query, query, query when trying to understand the JSON of a complicated Elasticsearch query!

In order to transform the parse tree into an Elasticsearch bool query, let's define a few classes.

First, a helper to convert +, -, or nil into :must, :must_not, or :should:

class Operator
  def self.symbol(str)
    case str
    when '+'
      :must
    when '-'
      :must_not
    when nil
      :should
    else
      raise "Unknown operator: #{str}"
    end
  end
end

Next, Clause holds an operator and a term (which is a String):

class Clause
  attr_accessor :operator, :term

  def initialize(operator, term)
    self.operator = Operator.symbol(operator)
    self.term = term
  end
end

Finally, Query changes to take an Array of clauses and buckets them into should, must, and must_not for conversion to the Elasticsearch query DSL:

class Query
  attr_accessor :should_terms, :must_not_terms, :must_terms

  def initialize(clauses)
    grouped = clauses.chunk { |c| c.operator }.to_h
    self.should_terms = grouped.fetch(:should, []).map(&:term)
    self.must_not_terms = grouped.fetch(:must_not, []).map(&:term)
    self.must_terms = grouped.fetch(:must, []).map(&:term)
  end

  def to_elasticsearch
    query = {
      :query => {
        :bool => {
        }
      }
    }

    if should_terms.any?
      query[:query][:bool][:should] = should_terms.map do |term|
        match(term)
      end
    end

    if must_terms.any?
      query[:query][:bool][:must] = must_terms.map do |term|
        match(term)
      end
    end

    if must_not_terms.any?
      query[:query][:bool][:must_not] = must_not_terms.map do |term|
        match(term)
      end
    end

    query
  end

  def match(term)
    {
      :match => {
        :title => {
          :query => term
        }
      }
    }
  end
end

Using these classes, we can write a Parslet::Transform to convert the parse tree to a Query:

class QueryTransformer < Parslet::Transform
  rule(:clause => subtree(:clause)) do
    Clause.new(clause[:operator]&.to_s, clause[:term].to_s)
  end
  rule(:query => sequence(:clauses)) { Query.new(clauses) }
end

Since a clause only has a single term, we can use Parslet's subtree to consume each :clause hash from the tree and convert them to Clause instances. Then sequence will consume the array of Clause objects to create the Query.

parse_tree = QueryParser.new.parse('the +cat in the -hat')
query = QueryTransformer.new.apply(parse_tree)
query.to_elasticsearch
# =>
{
  :query => {
    :bool => {
      :should => [
        {
          :match => {
            :title => {
              :query => "the"
            }
          }
        },
        {
          :match => {
            :title => {
              :query => "in"
            }
          }
        },
        {
          :match => {
            :title => {
              :query => "the"
            }
          }
        }
      ],
      :must => [
        {
          :match => {
            :title => {
              :query => "cat"
            }
          }
        }
      ],
      :must_not => [
        {
          :match => {
            :title => {
              :query => "hat"
            }
          }
        }
      ]
    }
  }
}

Now we have a bool query ready to send to Elasticsearch!

The companion source code to this tutorial includes a script that lets you input search queries and see the parse tree and generated Elasticsearch query DSL JSON.

$ bundle exec bin/parse BooleanTermParser

Welcome to the parser test console. Using BooleanTermParser.
Input a query string to see the generated Elasticsearch query DSL.
To exit, use Control-C or Control-D
Input query string:

Give it a try!

Phrase queries

Another important feature for a query parser is to be able to match phrases. In Elasticsearch, this is done with a match_phrase query. A match_phrase query can be used as input for a bool query, just like we previously used the match query. In our query language, an example query might look like "cat in the hat" -green +ham.

Building on the previous example, let's add rules for matching a phrase, defined as a sequence of one or more terms surrounded by quotation marks.

class QueryParser < Parslet::Parser
  rule(:term) { match('[^\s"]').repeat(1).as(:term) }
  rule(:quote) { str('"') }
  rule(:operator) { (str('+') | str('-')).as(:operator) }
  rule(:phrase) do
    (quote >> (term >> space.maybe).repeat >> quote).as(:phrase)
  end
  rule(:clause) { (operator.maybe >> (phrase | term)).as(:clause) }
  rule(:space)  { match('\s').repeat(1) }
  rule(:query) { (clause >> space.maybe).repeat.as(:query) }
  root(:query)
end

The parse tree for the example query looks like this:

QueryParser.new.parse('"cat in the hat" -green +ham')
# =>
{
  :query => [
    {
      :clause => {
        :phrase => [
          {:term => "cat"@1},
          {:term => "in"@5},
          {:term => "the"@8},
          {:term => "hat"@12}
        ]
      }
    },
    {
      :clause => {
        :operator => "-"@17,
        :term => "green"@18
      }
    },
    {
      :clause => {
        :operator => "+"@24,
        :term => "ham"@25
      }
    }
  ]
}

To support phrases, the Clause object needs to know whether it is a term clause or a phrase clause. To do this, let's introduce separate classes for TermClause and PhraseClause:

class TermClause
  attr_accessor :operator, :term

  def initialize(operator, term)
    self.operator = Operator.symbol(operator)
    self.term = term
  end
end

class PhraseClause
  attr_accessor :operator, :phrase

  def initialize(operator, phrase)
    self.operator = Operator.symbol(operator)
    self.phrase = phrase
  end
end

Other than this change, the code stays quite similar to the boolean term query parser. Query now needs to operate on the clause level rather than the term level, and later when generating the bool queries, choose match or match_phrase depending on the type.

class Query
  attr_accessor :should_clauses, :must_not_clauses, :must_clauses

  def initialize(clauses)
    grouped = clauses.chunk { |c| c.operator }.to_h
    self.should_clauses = grouped.fetch(:should, [])
    self.must_not_clauses = grouped.fetch(:must_not, [])
    self.must_clauses = grouped.fetch(:must, [])
  end

  def to_elasticsearch
    query = {
      :query => {
        :bool => {
        }
      }
    }

    if should_clauses.any?
      query[:query][:bool][:should] = should_clauses.map do |clause|
        clause_to_query(clause)
      end
    end

    if must_clauses.any?
      query[:query][:bool][:must] = must_clauses.map do |clause|
        clause_to_query(clause)
      end
    end

    if must_not_clauses.any?
      query[:query][:bool][:must_not] = must_not_clauses.map do |clause|
        clause_to_query(clause)
      end
    end

    query
  end

  def clause_to_query(clause)
    case clause
    when TermClause
      match(clause.term)
    when PhraseClause
      match_phrase(clause.phrase)
    else
      raise "Unknown clause type: #{clause}"
    end
  end

  def match(term)
    {
      :match => {
        :title => {
          :query => term
        }
      }
    }
  end

  def match_phrase(phrase)
    {
      :match_phrase => {
        :title => {
          :query => phrase
        }
      }
    }
  end
end

With these classes defined, QueryTransformer can take the parse tree and transform it into a Query:

class QueryTransformer < Parslet::Transform
  rule(:clause => subtree(:clause)) do
    if clause[:term]
      TermClause.new(clause[:operator]&.to_s, clause[:term].to_s)
    elsif clause[:phrase]
      phrase = clause[:phrase].map { |p| p[:term].to_s }.join(" ")
      PhraseClause.new(clause[:operator]&.to_s, phrase)
    else
      raise "Unexpected clause type: '#{clause}'"
    end
  end
  rule(:query => sequence(:clauses)) { Query.new(clauses) }
end

Here is the Elasticsearch query it generates:

parse_tree = QueryParser.new.parse('"cat in the hat" -green +ham')
query = QueryTransformer.new.apply(parse_tree)
query.to_elasticsearch
# =>
{
  :query => {
    :bool => {
      :should => [
        {
          :match_phrase => {
            :title => {
              :query => "cat in the hat"
            }
          }
        }
      ],
      :must => [
        {
          :match => {
            :title => {
              :query => "ham"
            }
          }
        }
      ],
      :must_not => [
        {
          :match => {
            :title => {
              :query => "green"
            }
          }
        }
      ]
    }
  }
}

You can try out the PhraseParser by running bundle exec bin/parse PhraseParser.

With these features, this is a respectable query parser. It supports a simple syntax that's easy to understand and hard to mess up, and more importantly, hard to abuse.

Going beyond generic query parsers: Adding heuristics

So far, what we've built has been aimed at providing a simple user experience—and preventing harmful queries. However, another benefit of building your own query parser is that it is specific to your application, so you can tailor it to your domain.

For example, let's say we are building search for a database of books. We know a lot about the data, and can develop heuristics for users' search input. If we know all publication dates for books in the catalog are from the twentieth and early twenty-first century, then we can turn a search term like 1970 or 1970s into a date range query for the dates 1970 to 1979.

For the search cats 1970s the Elasticsearch query DSL we want to generate is:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "cats"
            }
          }
        },
        {
          "range": {
            "publication_year": {
              "gte": 1970,
              "lte": 1979
            }
          }
        }
      ]
    }
  }
}

To represent this in our grammar, we'll add a new clause type called decade.

Where decade is defined as:

To implement this, we add the new decade rule to the parser and use it in the clause rule.

class QueryParser < Parslet::Parser
  rule(:eof) { any.absent? }
  rule(:decade) do
    ((str('1') >> str('9') |
      str('2') >> str('0')) >>
     match('\d') >> str('0')).as(:decade) >>
      str('s').maybe >> (eof | space).present?
  end
  rule(:term) { match('[^\s"]').repeat(1).as(:term) }
  rule(:quote) { str('"') }
  rule(:operator) { (str('+') | str('-')).as(:operator) }
  rule(:phrase) do
    (quote >> (term >> space.maybe).repeat >> quote).as(:phrase)
  end
  rule(:clause) { (operator.maybe >> (phrase | decade | term)).as(:clause) }
  rule(:space) { match('\s').repeat(1) }
  rule(:query) { (clause >> space.maybe).repeat.as(:query) }

A PEG parser always takes the first alternative, so we need to make decade match before term, because a decade is always a valid term. If we didn't do this, the decade rule would never match. We also need to add a lookahead for space or end of input to the decade rule. Without this, input like 1990th would be parsed as a decade 1990 and a term th.

For the transformer, we define a DateRangeClause class that takes a number and converts it into a start and end date:

class DateRangeClause
  attr_accessor :operator, :start_year, :end_year

  def initialize(operator, decade)
    self.operator = Operator.symbol(operator)
    self.start_year = decade
    self.end_year = decade + 9
  end
end

Finally, we add a date_range method to the Query class that converts a DateRangeClause into the Elasticsearch query DSL.

def date_range(start_year, end_year)
  {
    :range => {
      :publication_year => {
        :gte => start_year,
        :lte => end_year
      }
    }
  }
end

Here is the Elasticsearch query DSL it generates:

parse_tree = QueryParser.new.parse('cats "in the hat" 1970s')
query = QueryTransformer.new.apply(parse_tree)
query.to_elasticsearch
#=>
{
  :query => {
    :bool => {
      :should => [
        {
          :match => {
            :title => {
              :query => "cats"
            }
          }
        },
        {
          :match_phrase => {
            :title => {
              :query => "in the hat"
            }
          }
        },
        {
          :range => {
            :publication_year => {
              :gte => 1970,
              :lte => 1979
            }
          }
        }
      ]
    }
  }
}

You can try out the HeuristicParser by running bundle exec bin/parse HeuristicParser.

Now, thanks to Parslet, we have created a query parser that's purpose-built for our application. We fully control the syntax and Elasticsearch queries it makes, and we can add more heuristics that make sense for our application, but would never be part of a general-purpose query parser.

Next steps

You can take the code in many directions from here. Here's some ideas.

Improving search relevance

Our query parser does not need to be limited to a 1-to-1 correspondence with Elasticsearch. You may be able to improve search relevance by issuing the same query different ways. For example, the query cat in the hat should match documents containing that text as a phrase higher than documents containing the terms individually. You can implement this by creating a bool query with a match_phrase clause along with the match clauses for the individual terms.

Error handling, reporting, and fallback

The current query parser raises an exception if the query can't be parsed. You can catch this in your application before sending the query to Elasticsearch and report the error the user. For a more user-friendly solution, you could try to correct the error in the input and re-parse or fallback to a simple match query when the input cannot be parsed.

Limiting query complexity

We talked about limiting expensive queries, but the current parser doesn't do that yet. There are a few things to consider:

Total number of clauses
Number of words to allow in a phrase
Overall query length

Field configuration for query generation

To be truly useful, the query generator needs to know the schema of the mapping it is building Elasticsearch query DSL for. In the example code, we hard-coded a text field called title and an integer field called publication_year. A real schema will probably have many more fields.

Additional field types open up new search opportunities, too. Imagine your mapping has a keyword field called sku for storing the exact text of SKUs. By adding SKU detection to the query parser, you could generate a term query clause for the sku field when a user types in an SKU.

Footnotes

¹ OK, it was me.

Thanks to Marshall Scorcio, Natthu Bharambe, and Quin Hoxie for reviewing drafts of this tutorial. All errors are my own.

Additional thanks to Kaspar Schiess for creating Parslet and Tab Atkins for his terrific railroad diagram generator.

Last generated at 2017-06-19 01:01:55 from revision 6049469db3.