Building a search query parser

tl;dr: I built and open-sourced a Twitter search query parser, written in JavaScript, specified using PEG and generated by James Coglan’s Canopy library.

While building an interface for Twitter search queries the TweetDeck team found we needed to be able to convert from a search query string into a data structure to generate the UI, and back to a string for sending queries to the API.

So, I built a thing…

The problem

There were two basic tasks:

Convert a search query string — like "hello, world" javascript OR python -java — into a data structure
Stringify the data structure back into a search query for storage and use in the API

It seemed useful to have a way to avoid complex queries that the UI doesn’t support, too.

The solution

While totally not my area of expertise (having never taken CS properly, I missed out on the “compiler internals” classes) I’ve tinkered and read just enough to recognise the solution space: grammars, ASTs and parsers.

In summary, the twitter-search-query-parser is:

a Parsing Expression Grammar — a set of rules for recognising strings — describing the grammar of a Twitter search query
a parser that can turn a search query string into an AST
AST node types that define reduction, stringification and simplification for the rules of the grammar
methods for parsing, stringifying and simplifying a search query

Typical usage of the library to generate a UI would look like this:

Load search query from storage
Pass to parser for parsing into the interchange format
Simplify to remove unsupported types (eg, Group and Or)
Render query UI using interchange format

To get back to a search query string, usage might be:

Convert the UI representation back into the interchange format
Pass to the stringifier
Save the resulting string back to storage

The grammar

The search query grammar is defined PEG format.

Grammars are structural rules describing the syntax of a language, and PEGs use recursive string-matching rules.

The following grammar describes lists (of lists) of numbers, like [1,2] or [[1,2],3]:

grammar Lists
  value   <-  list / number
  list    <-  "[" value ("," value)* "]"
  number  <-  [0-9]

Each line is a rule. For example, value <- list / number describes that a value node is either a list or a number. The line belows describe a list, using the value rule recursively.

Two terms next to each other concatenate: the rule "a" "b" matches the string ab, not a b.

The other important component relating to the grammar is a parser that implements it.

I used James’s parser-generator library Canopy. Passing a grammar file to Canopy generates code for a dependency-free parser to which you can pass strings and receive a parsed tree, or a syntax error. This is what the library uses internally to parse the query into an AST.

Canopy supports multiple output languages (JavaScript, Java, Python and Ruby) which makes the grammar of this library potentially even more useful in the future!

I’ve barely touched on parser generation here, but there’s more documentation on the Canopy site.

I learnt a whole load about PEGs. They’re simple, but combine in complex ways to create complicated grammars. I made a few mistakes as I went along, so here’s a couple of tips:

Be specific first and generalise later. It’s better to have a specific rule (like filter <- "filter:" [\S]+) first, as generalisation can cause some matched-the-wrong-thing errors. You can see me generalising in this commit.
If you’re using types, a rule of thumb is: annotate rules that doesn’t contain a forward-slash (/, aka “or”). This is because rules that only contain cross-references and slashes are not in the parse tree and so require extra work.

Interchange format

After parsing, a “reduction” step takes place. There’s probably a computer-sciencey term for this but I don’t know it!

We do this because parsed tree that Canopy’s parser returns is verbose even for small queries, and not representative of the actual query.

Here’s snippet just for the query “a b”:

{ text: 'a b',
  offset: 0,
  elements:
   [ { text: '', offset: 0, elements: [] },
     { text: 'a b',
       offset: 0,
       elements:
        [ { text: 'a ',
            offset: 0,
            elements:
             [ ... ],
          }, ... ] }, ... ] }

When reduced into the interchange format, it looks like this:

[ 'And',
  [ [ 'Including', [ 'Text', 'a' ] ],
    [ 'Including', [ 'Text', 'b' ] ] ] ]

Much simpler! But why is it structured this way?

A basic query is a space-separated list of terms. In the query a b, the terms are a and b. By default each term is “included” in the query. To represent this, the interchange format is a tree:

          And
         /   \
 Including   Including
     |           |
  Text "a"    Text "b"

If the query negated the b term – a -b – the tree would be:

          And
         /   \
 Including   Excluding
     |           |
  Text "a"    Text "b"

Each node in the tree has a particular “type”, indicated by the first item of the tuple-ish array. In the node ['And', [ ... ]], the type is And. For ['Text', 'b'] it’s Text.

Types specify how to reduce, stringify and simplify the parsed tree and the interchange format, and you can find them over here.

Reducing the full parse tree is a depth-first compression of the tree. Each node knows how to reduce itself by combining a type, like And, with the reduction of its child nodes.

This strategy is nice as adding support for a new term or operator only requires annotating a PEG rule with a type and adding an implementation to the types file.

To parse a query, you pass a query string to the parse method:

parse('@jack from:twitter')

Stringification

We’ve seen how the parse and reduce steps work, but what about stringification?

One of my goals with this project was to make the stringification as close to the reverse of parsing as possible: a query should parse and stringify back to the same query. The goal is to avoid surprising users writing search queries by changing what they typed.

Right now, the interchange format doesn’t store information about superfluous whitespace. This is a possible area for improvement.

The general stringification strategy is similar to reduction: a depth-first walk of the tree, stringifying each node according to its type.

The technique isn’t quite the same as reduction because the a particular type’s stringify implementation is chosen according to its type annotation and called statically, whereas the reduce method is called on an instance of a parse tree node.

Despite their differences, both are in the types file. Hopefully that will make implementing new types simple, or at least more obvious.

To stringify a query, you pass a data structure in the interchange format to the stringify method:

stringify(
  [ 'And',
    [ [ 'Including', [ 'Text', '@jack' ] ],
      [ 'Including', [ 'Pair', 'from', 'twitter' ] ] ] ]
) // ==> '@jack from:twitter'

Keep it simple, stupid

To recap, this library is going form part of a UI over a search query.

While we intend to have pretty good coverage of the available operators, we need to allow users to type search operators that the UI doesn’t support. In particular, we’re never likely to support some operators like OR and groups (a b) OR (c d) as they’re kinda complex.

To enable full query support without a UI representation of every operator, I built simplification into the library. This allows a developer to elide unsupported operators from the parsed interchange format.

For example, these two trees represent the same query (@jack OR @twitter -(@support OR @twitterdev)) and stringify to the same result, but the latter has had OR and Group types removed:

[ 'And',
  [ [ 'Including',
      [ 'Or',
        [ [ 'Including', [ 'Text', '@jack' ] ],
          [ 'Including', [ 'Text', '@twitter' ] ] ] ] ],
    [ 'Excluding',
      [ 'Group',
        [ 'And',
          [ [ 'Including',
              [ 'Or',
                [ [ 'Including', [ 'Text', '@support' ] ],
                  [ 'Including', [ 'Text', '@twitterdev' ] ] ] ] ] ] ] ] ] ] ]

[ 'And',
  [ [ 'Including', [ 'Text', '@jack OR @twitter' ] ],
    [ 'Excluding', [ 'Text', '(@support OR @twitterdev)' ] ] ] ]

As you can see, it radically simplifies the data structure while still maintaining a reasonable representation of the query.

You can go even further and reduce the query to just Text if that’s all that’s supported:

[ 'Text', '@jack OR @twitter -(@support OR @twitterdev)' ]

Or something in the middle, perhaps useful for a “pill” UI that allows the user to type individual search strings into a text box, appending each time:

[ 'And',
  [ [ 'Text', '@jack OR @twitter' ],
    [ 'Text', '-(@support OR @twitterdev)' ] ] ]

Simplification is a depth-first walk of the tree and internally uses the stringify feature to collapse a sub-tree into a Text node.

To simplify a tree you pass it to simplify, and specify which node types you don’t support:

simplify(parse('@jack OR @twitter -(@support OR @twitterdev)'), {
  disallow: ['Group', 'Or']
})

It’s possible that “disallowing” certain nodes is not the correct approach: if a new type were added to the library, the developer would immediately have to disallow it. It seems the better approach would be to explicitly allow supported node types. Another avenue for improvement.

Testing strategy

I also want to make a quick note about the way this library is tested as the technique it might be useful for others.

As the goal was to have reversible parsing (query -> parse -> (simplify ->) stringify -> query), I tested it by checking that each test case against itself as well as the expected values:

testCases.forEach(([name, rawQuery, expected]) => {
  // Clean up the query a bit
  const query = rawQuery.split('\n').map(v => v.trim()).join(' ');
  test(name, t => {
    // query -> parsed
    t.deepEqual(
      parse(query),
      expected
    );
    // stringify -> query
    t.deepEqual(
      stringify(expected),
      query
    );
    // query -> parsed -> query
    t.deepEqual(
      stringify(parse(query)),
      query
    );
    // parsed -> stringify -> parsed
    t.deepEqual(
      parse(stringify(expected)),
      expected
    );
  });
});

You can see this in action over here.

I’m also looking into QuickCheck-style property-based testing. This kind of test generates realistic-but-random data and operations, and asserts that certain properties hold. I haven’t found a really great tool like this for JavaScript, but there are a couple that look promising.

Summing up

This was a fun project to work on. It was a great learning experience, though I’m still no expert, and who knows? Maybe other people will find this post, Canopy or the library useful!