How to write a parser with Nearley

How to write a parser with Nearley

·

7 min read

Introduction

Parsers turn strings of characters into meaningful data structures (like a JSON object!). nearley is a fast, feature-rich, and modern parser toolkit for JavaScript.

Nearley is an easy to understand parser with its own (also easy to understand) language for writing context-free grammars.

In this tutorial we will go over how to make a parser that spans multiple Nearley files and turning it into a JavaScript file.

We wont just use Nearley, we will also use the moo lexer/tokenizer with it.

Hang on... what is a lexer?

Dalip S Mahal on quora writes:

A lexer is a software program that performs lexical analysis. Lexical analysis is the process of separating a stream of characters into different words, which in computer science we call 'tokens' . When you read my answer you are performing the lexical operation of breaking the string of text at the space characters into multiple words.

Why do I need to make my own language?

You don't.

However, sometimes it is just nice to have your own way of writing things that you how it works. If you are just starting out with programming, please click away. You need to have some experience with NodeJS and understand the theory behind how it and languages in general work.

Getting set up

Before we get started, there are a few things you need.

  1. A code editor (we will use Visual Studio Code)
  2. Node.JS (preferably v12.x or above. I will use v14.5.0)

Once you have Visual Studio Code installed, you need to install the following extension (which you can do so from the extensions tab)

I prefer using Visual Studio Code for this because WebStorm doesn't have an extension for Nearley grammar and I am too lazy to write it without highlighting and auto completion.

Getting the project ready

First, to get the project ready you must create a folder for this and run npm init -y && npm i moo nearley

After you have done that, add a script to your package.json that is uses the npx nearleyc -o parse/grammar.js lang/lang.ne && node src/parse test/test.example

What this does is turn your grammar files into a JavaScript file Nearley can read during the parsing stage. Then, it runs a script which parses the file test.example and turns it into a JSON file.

Now, so we can test our grammar quickly and easily, we will write the the file that will parse our language.

Make a file called src/parse.js and write this in it. I will explain what this is doing after.

const nearley = require("nearley");
const grammar = require("../parse/grammar.js");
const fs = require('fs');
const path = require('path');

if (!fs.existsSync("./compile")) {
    fs.mkdirSync("./compile");
}

function parse(argf) {
    const parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));

    parser.feed(fs.readFileSync(path.resolve(process.cwd(), argf), 'utf8').replace(/\r\n/g, "\n"));

    if (process.argv.includes("--dev")) {
        fs.writeFileSync(`./compile/${argf.split(/[/\\]/)[argf.split(/[/\\]/).length - 1]}` + ".json", JSON.stringify(parser.results, null, 2), 'utf8');
    }
}

parse(process.argv[2]);

I know this looks intimidating at first (depending on your skill level) but don't worry its simple.

Basically, this makes sure that the compile directory exists, and then passes the file we specified in the script (test.example) to the Nearley parser along with the generated grammar. It then takes the output of the parser and writes it to a JSON file with the same name as the file we are parsing but makes it pretty so its easier to read.

Writing the grammar

Nearley uses a very simple way of writing grammars (which can span multiple files)

Create a file called lang/lang.ne

Adding the lexer

We can embed JavaScript into the grammar which is useful because it means we can write our lexer in the grammar files without having to modify any files later on.

How we do this is by surrounding it in @{% // some js %}

Add this to the start of lang.ne

@{%
    const moo = require('moo');

    const lexer = moo.compile({
        ws:     /[ \t]+/,
        number: /[0-9]+/,
        word: /[a-z]+/,
        nl:{ match: /\n/, lineBreaks: true },
    });
%}

@lexer lexer

We can reference these inside our grammar using %name so if we wanted a new line, we could do %nl. If these are simple strings we can also reference them using the string but it is better to reference it using the %name if it uses a regular expression or something else that is complex.

Basic grammar

When writing Nearley grammars, it always starts by parsing using the first rule it finds. I wish I knew this when I started. This first rule must reference all other rules you need in some way or another or they will not be recognized.

Rules are easy to write, to recognize a word, we would write this:

process -> main:+ {%id%}

main -> %word {% (d) => {return d[0].value} %}

There are quite a lot of things going on here. We have two rules, process and main I always start my grammars with process because it makes more sense and make it only reference the main rule so that I can parse multiple things in a file without things getting complicated.

:+ is an ebnf (or something like that) expression? which basically tells the parser we want to recognize multiple if these.

We have %ws as normal which references ws in the lexer.

You probably have noticed we have some, JavaScript after them? Yes. This is known as a postprocessor. This lets us pass some JSON back to the parser which is given back to us in the parser results. It is a function which is passed a list of data based on the rule we wrote.

Here is an example

hello -> "hello" %ws "world {% (d) => { return d[0] + d[2] }

If we passed hello world to the parser, this would return "helloworld" inside an array or two. This is because we are getting the string at index 0 and 2 of the rule. The whitespace character is at index 1. If we change it to return d[1], we would get an object that moo gave us with information about it. This is why I return d[0].value. Because the word is at index 0, and the object returned from moo has the value stored inside the value property.

Nearley also has a built in function called id which is pretty much just {% (d) => return d[0] %}

Kind-of complex grammar

With Nearley, you have the freedom to do a lot. Lets say you wanted to parse something like

SET somevariable TO "hello"

I like to split my rules up so I can re-use parts of it.

This is how I would do it:

@{%
    const moo = require('moo');

    const lexer = moo.compile({
        ws:     /[ \t]+/,
        number: /[0-9]+/,
        word: /[a-z]+/,
        nl:{ match: /\n/, lineBreaks: true },

        set: "SET",
        to: "TO,
    });
%}

@lexer lexer

process -> main:+ {%id%}

main -> variable {%id%}

variable -> "SET" %ws "TO" value {% (d) => return {type: "variable", name: d[1].value, value: d[3]} %}

value -> %string %ws %string {% (d) => return { type: "string", value: d[1].value } %}

This should work although I haven't had time to test it.

To test it, simply just run the script you created at the start and it will compile your grammar, parse the file, and save the results to a json file.

Finishing up

That is how to make a basic grammar. I stopped it early because in the future I will continue it by writing a post on how to turn the parser results into a JavaScript file or assembly that will be compiled.

Thank you for reading!

Photo by Kevin Ku on Unsplash