#StackBounty: #javascript #parsing #ecmascript-6 #lexical-analysis A simple parser generator

Bounty: 50

I need to parse simple DSLs in a few projects. Since I don’t know BNF or other grammars, I figured an alternative would be to use a simple parser generator.

I’m looking for improvements to the lexer/parser in order to be able to use it to parse more complex languages in future projects while keeping a relatively simple interface to define a grammar.

Feedback to increase code quality would be highly appreciated.
I’d also like to know if I’m missing crucial features a lexer/parser would have to include.
If I’m doing anything inherently wrong or do use inappropriate techniques, that would be helpful to know as well.

I’ll include a simple usage example at the beginning and post the code and snippet at the bottom. I think in that order it’s easier to follow the code.

Here’s an example of how to tokenize a basic arithmetic expression like 1+2+3*4*5*6+3;

const tokenDefinitions = [
    TokenFactory({type:'Whitespace', ignore: true}).while(/^s+$/),
    TokenFactory({type:'Integer'}).start(/-|d/).next(/^d$/),
    TokenFactory({type:'Paren'}).start(/^[()]$/),
    TokenFactory({type:'Addition'}, true).start(/^+|-$/),
    TokenFactory({type:'Multiplication'}, true).start(/^*|\$/),
];

const src = '1 + 2 + 3 * 4 * 5'
const lexer = Lexer(tokenDefinitions);
const tokens = lexer(src).filter(t => !t.ignore);

Here’s an example to parse the tokens to an AST.

const Any = new Driver('Any').match(_ => true);
const Number = new Driver('Number').match(type('Integer')).bind(0, 0);
const RParen = new Driver('RParen').match(value(')')).bind(100, 0);
const Expression = new Driver('Expression').match(value('(')).consumeRight().end(value(')')).bind(0, 99)
const MulOperator = new Driver('Operator').match(type('Multiplication')).consumeLeft(Any).consumeRight().bind(60,60)
const AddOperator = new Driver('Operator').match(type('Addition')).consumeLeft(Any).consumeRight().bind(50,50)

const nodeDefinitions = [
    MulOperator,
    AddOperator,
    Number,
    Expression,
    RParen,
];

const parse = Parser(nodeDefinitions);
const ast = parse(tokens);

This example uses left and right binding powers to define the precedence of the multiplication over addition. You can get the same result by using .until, but that feels kind of wrong.

const Any = new Driver('Any').match(_ => true);
const Number = new Driver('Number').match(type('Integer'));
const RParen = new Driver('RParen').match(value(')'));
const Expression = new Driver('Expression').match(value('(')).consumeRight().until(value(')')).end(value(')'))
const MulOperator = new Driver('Operator').match(type('Multiplication')).consumeLeft(Any).consumeRight().until(parentOr(type('Addition')))
const AddOperator = new Driver('Operator').match(type('Addition')).consumeLeft(Any).consumeRight().until(parent)

In this example, the multiplication operator consumes tokens until it encounters an addition token, or if inside an expression, a right parenthesis.

Both examples produce the following AST.

[
  {
    children: [
      { children: [], token: { value: '1' }, id: 'Number' },
      {
        children: [
          { children: [], token: { value: '2' }, id: 'Number' },
          {
            children: [
              {
                children: [
                  { children: [], token: { value: '3' }, id: 'Number' },
                  {
                    children: [
                      {
                        children: [],
                        token: { value: '4' },
                        id: 'Number'
                      },
                      {
                        children: [
                          {
                            children: [],
                            token: { value: '5' },
                            id: 'Number'
                          },
                          {
                            children: [],
                            token: { value: '6' },
                            id: 'Number'
                          }
                        ],
                        token: { type: 'Multiplication', value: '*' },
                        id: 'Operator'
                      }
                    ],
                    token: { type: 'Multiplication', value: '*' },
                    id: 'Operator'
                  }
                ],
                token: { type: 'Multiplication', value: '*' },
                id: 'Operator'
              },
              { children: [], token: { value: '3' }, id: 'Number' }
            ],
            token: { type: 'Addition', value: '+' },
            id: 'Operator'
          }
        ],
        token: { type: 'Addition', value: '+' },
        id: 'Operator'
      }
    ],
    token: { type: 'Addition', value: '+' },
    id: 'Operator'
  }
]

You can flatten the recursive structure of the AST by changing the grammar of addition and multiplication tokens to repeatedly parse its rhs while its condition matches by using .repeat, or by using .unfold which recurses first and flattens the structure after parsing the node. This can reduce the size of the AST a lot.

[
  {
    children: [
      { children: [], token: { value: '1' }, id: 'Number' },
      { children: [], token: { value: '2' }, id: 'Number' },
      {
        children: [
          { children: [], token: { value: '3' }, id: 'Number' },
          { children: [], token: { value: '4' }, id: 'Number' },
          { children: [], token: { value: '5' }, id: 'Number' },
          { children: [], token: { value: '6' }, id: 'Number' }
        ],
        token: { type: 'Multiplication', value: '*' },
        id: 'Operator'
      },
      { children: [], token: { value: '3' }, id: 'Number' }
    ],
    token: { type: 'Addition', value: '+' },
    id: 'Operator'
  }
]
const AddOperator = new Driver('Operator').match(type('Addition')).consumeLeft(Any).consumeRight().until(parent).repeat()

Here’s an example of how to interpret the AST.
It doesn’t matter if the AST is flattened or not, all versions (bind/until, repeat/unfold) will be interpreted correctly as the semantic doesn’t change*

const operators = {
    '+': (a,b) => a+b,
    '-': (a,b) => a-b,
    '*': (a,b) => a*b,
    '/': (a,b) => a/b,
};

const hasId = id => token => token.id === id;
const tokenValue = node => node.token.value;

const NrBh = new Behaviour(hasId('Number'), n => +tokenValue(n))
const OpBh = new Behaviour(hasId('Operator'), (node, _eval) =>  node.children.map(c => _eval(c)).reduce(operators[tokenValue(node)]));
const ExprBh = new Behaviour(hasId('Expression'), (node, _eval) => _eval(node.rhs));

const behaviours = [NrBh, OpBh, ExprBh];
const res = Behaviour.eval(ast[0], behaviours); // 63

Here’s the code for the lexer.

//Matcher.js

const setInstanceProp = (instance, key, value) => (instance[key] = value, instance);

/**
 * The Matcher defines multiple regular expressions or functions that are matched against a single character at different positions.
 */
class Matcher {
    constructor (transform) {
        /** Can be given a transform function that transforms the token value */
        if (typeof transform === 'function')
            this._transform = transform
    }

    /** Consumes a character once at the beginning.*/
    start (regExp) {return setInstanceProp(this, '_start', regExp)}
    /** Consumes a character each step*/
    next (regExp) {return setInstanceProp(this, '_next', regExp)}
    /** Consumes a character and terminates the current token*/
    end (regExp) {return setInstanceProp(this, '_end', regExp)}
    /** Consumes characters as long as the regExp matches */
    while (regExp) {return setInstanceProp(this, '_while', regExp)}

    /** Tests a regex or function against a character */
    _test (obj, char)  {
        if (typeof obj === 'function')
            return obj(char);
        if (obj instanceof RegExp)
            return obj.test(char);
        return false;
    }

    /** Tests a character and token against the defined regexes/functions. Can be given a hint to test a specific regex/fn */
    test (char, token = '', hint)  {
        if (hint === null) return false;
        if (hint) return this._test(hint, char)
        if (this._start && !token) return this._test(this._start, char);
        if (this._next)  return this._test(this._next, char);
        if (this._while) return this._test(this._while, token + char);
        
        return false;
    }

    /** Default transform behaviour. Returns the primitive token value */
    _transform (token) {
        return token;
    }

    /** Called by the tokenizer to transform the primitive token value to an object*/
    transform (token) {
        return this._transform(token);
    }
}

/** Creates a matcher that transforms the matched token into an object with a prototype that shares common information*/
const TokenFactory = (proto, assign) => new Matcher((value) => {
    if (typeof value === 'object') return value
    if (assign)
        return Object.assign({}, proto, {value})
    return Object.assign(Object.create(proto), {value})
});

module.exports = {Matcher, TokenFactory};
//Lexer.js

const {Matcher} = require('./Matcher');

const Lexer = (def) =>  (src) => {
    return src.split('').reduce((acc, char, i, arr) => {
        let [token, lastMatcher, tokens] = acc;
        const {_end = null} = lastMatcher; let ret; 
        if (lastMatcher.test(char, token, _end)) {
            ret = [lastMatcher.transform(token+char), new Matcher, tokens];
        } else if (lastMatcher.test(char, token)) {
            ret = [token+char, lastMatcher,tokens];
        } else {
            const matcher = def.find(matcher => matcher.test(char));
            if (!matcher) throw new Error(`No matcher found for character '${char}'.`);
            token && tokens.push(lastMatcher.transform(token));
            ret = [char, matcher, tokens];
            lastMatcher = matcher;
        }

        if (i === arr.length - 1) {
            tokens.push(lastMatcher.transform(ret[0]));
            ret = tokens;
        }

        return ret;
    }, ['', new Matcher, []]);
}

module.exports = {Lexer};

Here’s the code of the parser.

//Driver.js

class Driver {
    constructor (id, transform) {
        this.id = id;
        this._transform = transform;
        this.bind();
    };

    match (token) {
        this._match = token;
        return this;
    }
    consumeLeft (token) {
        this._consumeLeft = token;
        return this;
    }

    consumeRight (token = true, n = Infinity) {
        this._consumeRight = token;
        this.n = n;
        return this;
    }

    end (token) {
        this._end = token;
        return this;
    }

    unfold () {
        this._unfold = true;
        return this;
    }

    until (token, lookAhead = 0) {
        this._until = token;
        this._lookAhead = lookAhead;
        return this;
    }

    repeat (token) {
        this._repeat = true;
        return this;
    }

    test (token, nodes = []) {
        let ret;
        if (typeof this._match === 'function')
            ret = this._match(token);
        else if (this._match) {
            ret = token.type === this._match || token.value === this._match;
        }

        if (this._consumeLeft) {
            const lhs = nodes.slice().pop();
            ret = ret && lhs && (lhs.id === this._consumeLeft.id || this._consumeLeft.test(lhs.token));
        }

        return ret;
    }

    transform (node) {
        if (typeof this._transform === 'function')
            return {...this._transform(node), id: this.id};
        return {...node, id: this.id};
    }
    
    bind (l = 0, r = 0) {
        this.lbp = l;
        this.rbp = r;
        return this;
    }
}

module.exports = {Driver};
//Parser.js 

const Parser = nodeDefinitions => {
    const nodes = [];
    return function parse (tokens, parents = []) {
        if (tokens.length === 0)return [];

        const [parent, ...rest] = parents;
        let i=0;

        do {
            const token = tokens.shift();

            const node = {children:[]};
            const cur = nodeDefinitions.find (d => d.test(token, nodes));

            if (!cur) {
                throw new Error(`Unexpected token ${JSON.stringify(token)}`);
            }

            let next = tokens[0]
            const nextDriver = next && nodeDefinitions.find (d => d.test(next, nodes));
            
            if (parent && nextDriver && parent.rbp < nextDriver.lbp) {
                tokens.unshift(token);
                break;
            }
            
            next = parent && (parent._lookAhead==0?token:tokens[parent._lookAhead - 1]);
            if (parent && parent._until && next && parent._until(next, parents, nodes)) {
                tokens.unshift(token);
                break;
            }       

            if (cur._consumeLeft) {
                const lhs = nodes.pop();
                if (!cur.test(token, [lhs]))
                    throw new Error(`Expected token ${cur._consumeLeft._match} but found ${lhs.token.type} instead. ${cur.name}`)
                node.children.push(lhs);
            }

            if (cur._consumeRight) {
                let repeat = false;
                do {
                    parse(tokens, [cur, ...parents]);
                    const rhs = nodes.shift();
                    node.children.push(rhs);
                    if (tokens[0] && cur.test(tokens[0], [node.children[0]])) {
                        tokens.shift();
                        repeat = true;
                    } else {
                        repeat = false;
                    }
                } while (repeat);
            }
            
            node.token = token;

            if (cur._unfold) {
                const rhs = node.children.slice(-1)[0];
                const un = rhs.children;
                if (node.token.value === rhs.token.value) {
                    node.children = [node.children[0], ...un];
                }
            } 

            if (cur._end && cur._end(tokens[0] || {}, cur, nodes)) {
                node.end = tokens.shift();
            }

            nodes.push(cur.transform(node));

            if (parent && ++i === parent.n) break;
        } while (tokens.length);

        return nodes;
    }
}

module.exports = {Parser};

Here’s the code for the interpreter.

//Behaviour.js

class Behaviour {
    static eval (ast, behaviours) {
        const node = ast;
        const beh  = behaviours.find(b => b.testFn(ast)); 
        if (!beh)
            throw new Error(`No behaviour found for node ${JSON.stringify(node)}`)
        return beh.evalFn(node, (node, _behaviours = behaviours) => {
            const val = Behaviour.eval(node, _behaviours)
            return val;
        });
    }
    constructor (testFn, evalFn) {
        this.testFn = testFn;
        this.evalFn = evalFn;
    }
}

Here’s a fiddle to run the example.

const tokenDefinitions = [
    TokenFactory({type:'Whitespace', ignore: true}).while(/^s+$/),
    TokenFactory({type:'Integer'}).start(/-|d/).next(/^d$/),
    TokenFactory({type:'Paren'}).start(/^[()]$/),
    TokenFactory({type:'Addition'}, true).start(/^+|-$/),
    TokenFactory({type:'Multiplication'}, true).start(/^*|\$/),
];

const src = '1 + 2 + 3 * 4 * 5 * 6 + 3'
console.log ('Source', src);

const lexer = Lexer(tokenDefinitions);
const tokens = lexer(src).filter(t => !t.ignore);

console.log("Tokens", tokens);

const type = type => token => token.type === type;
const value = value => token => token.value === value;
const parent =  (token, parents, nodes) => parents[1] && parents[1]._until(token, parents.slice(1), nodes) ;
const or = (...fns) => (token, parents, nodes) => fns.reduce((a, fn) => a || fn(token, parents, nodes), false);
const and = (...fns) => (token, parents, nodes) => fns.reduce((a, fn) => a && fn(token, parents, nodes), true);
const parentOr = fn => or(parent, fn);
const keyword = token => type('Identifier')(token) && keywords.some(k => value(k)(token));

// const Any = new Driver('Any').match(_ => true);
// const Number = new Driver('Number').match(type('Integer')).bind(0, 0);
// const RParen = new Driver('RParen').match(value(')')).bind(100, 0);
// const Expression = new Driver('Expression').match(value('(')).consumeRight().end(value(')')).bind(0, 99)
// const MulOperator = new Driver('Operator').match(type('Multiplication')).consumeLeft(Any).consumeRight().bind(60,60)
// const AddOperator = new Driver('Operator').match(type('Addition')).consumeLeft(Any).consumeRight().bind(50,50)

const Any = new Driver('Any').match(_ => true);
const Number = new Driver('Number').match(type('Integer'));
const RParen = new Driver('RParen').match(value(')'));
const Expression = new Driver('Expression').match(value('(')).consumeRight().until(value(')')).end(value(')'))
const MulOperator = new Driver('Operator').match(type('Multiplication')).consumeLeft(Any).consumeRight().until(or(parent,type('Multiplication'),type('Addition'))).repeat()
const AddOperator = new Driver('Operator').match(type('Addition')).consumeLeft(Any).consumeRight().until(parentOr(type('Addition'))).repeat();

const nodeDefinitions = [
    MulOperator,
    AddOperator,
    Number,
    Expression,
    RParen,
];

const parse = Parser(nodeDefinitions);
const ast = parse(tokens);

console.log("AST", ast);

const operators = {
    '+': (a,b) => a+b,
    '-': (a,b) => a-b,
    '*': (a,b) => a*b,
    '/': (a,b) => a/b,
};

const hasId = id => token => token.id === id;
const tokenValue = node => node.token.value;

const NrBh = new Behaviour(hasId('Number'), n => +tokenValue(n))
const OpBh = new Behaviour(hasId('Operator'), (node, _eval) =>  node.children.map(c => _eval(c)).reduce(operators[tokenValue(node)]));
const ExprBh = new Behaviour(hasId('Expression'), (node, _eval) => _eval(node.rhs));

const behaviours = [NrBh, OpBh, ExprBh];
const res = Behaviour.eval(ast[0], behaviours);

console.log ("Result", res)
<script>
const setInstanceProp = (instance, key, value) => (instance[key] = value, instance);

class Matcher {
    constructor (transform) {
        if (typeof transform === 'function')
            this._transform = transform
    }

    start (r) {return setInstanceProp(this, '_start', r)}
    next (r) {return setInstanceProp(this, '_next', r)}
    end (r) {return setInstanceProp(this, '_end', r)}
    while (r) {return setInstanceProp(this, '_while', r)}

    _test (obj, char)  {
        if (typeof obj === 'function')
            return obj(char);
        if (obj instanceof RegExp)
            return obj.test(char);
        return false;
    }

    test (char, token = '', hint)  {
        if (hint === null) return false;
        if (hint) return this._test(hint, char)
        if (this._start && !token) return this._test(this._start, char);
        if (this._next)  return this._test(this._next, char);
        if (this._while) return this._test(this._while, token + char);
        
        return false;
    }

    _transform (token) {
        return token;
    }

    transform (token) {
        return this._transform(token);
    }
}

const TokenFactory = (proto, assign) => new Matcher((value) => {
    if (typeof value === 'object') return value
    if (assign)
        return Object.assign({}, proto, {value})
    return Object.assign(Object.create(proto), {value})
});

const Lexer = (def) =>  (src) => {
    return src.split('').reduce((acc, char, i, arr) => {
        let [token, lastMatcher, tokens] = acc;
        const {_end = null} = lastMatcher; let ret; 
        if (lastMatcher.test(char, token, _end)) {
            ret = [lastMatcher.transform(token+char), new Matcher, tokens];
        } else if (lastMatcher.test(char, token)) {
            ret = [token+char, lastMatcher,tokens];
        } else {
            const matcher = def.find(matcher => matcher.test(char));
            if (!matcher) throw new Error(`No matcher found for character '${char}'.`);
            token && tokens.push(lastMatcher.transform(token));
            ret = [char, matcher, tokens];
            lastMatcher = matcher;
        }

        if (i === arr.length - 1) {
            tokens.push(lastMatcher.transform(ret[0]));
            ret = tokens;
        }

        return ret;
    }, ['', new Matcher, []]);
}

class Driver {
    constructor (id, transform) {
        this.id = id;
        this._transform = transform;
        this.bind();
    };

    match (token) {
        this._match = token;
        return this;
    }
    consumeLeft (token) {
        this._consumeLeft = token;
        return this;
    }

    consumeRight (token = true, n = Infinity) {
        this._consumeRight = token;
        this.n = n;
        return this;
    }

    end (token) {
        this._end = token;
        return this;
    }

    unfold () {
        this._unfold = true;
        return this;
    }

    until (token, lookAhead = 0) {
        this._until = token;
        this._lookAhead = lookAhead;
        return this;
    }

    repeat (token) {
        this._repeat = true;
        return this;
    }

    test (token, nodes = []) {
        let ret;
        if (typeof this._match === 'function')
            ret = this._match(token);
        else if (this._match) {
            ret = token.type === this._match || token.value === this._match;
        }

        if (this._consumeLeft) {
            const lhs = nodes.slice().pop();
            ret = ret && lhs && (lhs.id === this._consumeLeft.id || this._consumeLeft.test(lhs.token));
        }

        return ret;
    }

    transform (node) {
        if (typeof this._transform === 'function')
            return {...this._transform(node), id: this.id};
        return {...node, id: this.id};
    }
    
    bind (l = 0, r = 0) {
        this.lbp = l;
        this.rbp = r;
        return this;
    }
}

const Parser = nodeDefinitions => {
    const nodes = [];
    return function parse (tokens, parents = []) {
        if (tokens.length === 0)return [];

        const [parent, ...rest] = parents;
        let i=0;

        do {
            const token = tokens.shift();

            const node = {children:[]};
            const cur = nodeDefinitions.find (d => d.test(token, nodes));

            if (!cur) {
                throw new Error(`Unexpected token ${JSON.stringify(token)}`);
            }

            let next = tokens[0]
            const nextDriver = next && nodeDefinitions.find (d => d.test(next, nodes));
            
            if (parent && nextDriver && parent.rbp < nextDriver.lbp) {
                tokens.unshift(token);
                break;
            }
            
            next = parent && (parent._lookAhead==0?token:tokens[parent._lookAhead - 1]);
            if (parent && parent._until && next && parent._until(next, parents, nodes)) {
                tokens.unshift(token);
                break;
            }       

            if (cur._consumeLeft) {
                const lhs = nodes.pop();
                if (!cur.test(token, [lhs]))
                    throw new Error(`Expected token ${cur._consumeLeft._match} but found ${lhs.token.type} instead. ${cur.name}`)
                node.children.push(lhs);
            }

            if (cur._consumeRight) {
                let repeat = false;
                do {
                    parse(tokens, [cur, ...parents]);
                    const rhs = nodes.shift();
                    node.children.push(rhs);
                    if (tokens[0] && cur.test(tokens[0], [node.children[0]])) {
                        tokens.shift();
                        repeat = true;
                    } else {
                        repeat = false;
                    }
                } while (repeat);
            }
            
            node.token = token;

            if (cur._unfold) {
                const rhs = node.children.slice(-1)[0];
                const un = rhs.children;
                if (node.token.value === rhs.token.value) {
                    node.children = [node.children[0], ...un];
                }
            } 

            if (cur._end && cur._end(tokens[0] || {}, cur, nodes)) {
                node.end = tokens.shift();
            }

            nodes.push(cur.transform(node));

            if (parent && ++i === parent.n) break;
        } while (tokens.length);

        return nodes;
    }
}

class Behaviour {
    static eval (ast, behaviours) {
        const node = ast;
        const beh  = behaviours.find(b => b.testFn(ast)); 
        if (!beh)
            throw new Error(`No behaviour found for node ${JSON.stringify(node)}`)
        return beh.evalFn(node, (node, _behaviours = behaviours) => {
            const val = Behaviour.eval(node, _behaviours)
            return val;
        });
    }
    constructor (testFn, evalFn) {
        this.testFn = testFn;
        this.evalFn = evalFn;
    }
}
</script>


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.