from jupyLR import Scanner, Automaton my_scanner = Scanner(zero='0', one='1', star='[*]', plus='[+]') grammar = """ E = E plus B E = E star B E = B B = zero B = one """ a = Automaton('E', grammar, my_scanner) a('1+1')
The scanner converts a text into a stream of tokens. A token is a pair (token_type, token_value). jupyLR provides a convenient way to define a scanner with the Scanner class. As the __init__ docstring states:
Each named keyword is a token type and its value is the corresponding regular expression. Returns a function that iterates tokens in the form (type, value) over a string.
Special keywords are discard_names and discard_values, which specify lists containing tokens names or values that must be discarded from the scanner output.
As an undocumented feature, the scanner holds the list of token types in its attribute 'tokens'.
The grammar you can provide to configure a parser is extremely simple: one symbol followed by the equal "=" sign defines a rule name, and all other words after the equal sign are the production of this rule, until the next rule name (that is, everything until the word before the next equal sign or the end).
One can use a pipe "|" to delimit alternative productions, instead of reentering the rule name followed by the equal sign.
A minus sign in front of a rule name tells the parser to not produce an AST node for this production (see below).
The parsing algorithm is Generalized LR (GLR). The LR(0) sets and SLR action/goto tables are computed on the fly when the Automaton instance is created. Support for serialization is planned very soon to avoid recomputing the automaton for big grammars. For the user's convenience, the parser is callable, taking a string and outputting the stream of productions which can be used to build an AST.
As the GLR progresses in the parsing, it maintains up-to-date AST and validates each new AST node before continuing.
By default each rule produces an AST node in the form (rule_name, contents...). A dash ("-") can be prepended to a rule name in the grammar to make this rule transient in the AST.
For instance :
E = E plus B E = B B = 1
with the text "1+1" will produce (E (E (B (one, 1))) (plus, +) (E (B (one, 1)))).
The variant :
E = E plus B -E = B B = 1
with the same text will produce (E (B (one, 1)) (plus, +) (B (one, 1))).
You can subclass Automaton and overload validate_ast
to perform disambiguation or just transform the AST.
For instance, consider the following code that produces a simple arithmetic calculator :
from jupyLR import Automaton, Scanner import operator class A(Automaton): def validate_ast(self, ast): # Demonstrate what we are currently doing print "evaluating", ast, "...", if ast[0] == 'D': # on a digit, evaluate to int ret = int(ast[1][1]) elif ast[0] == 'E': # All done, evaluate to computed result ret = ast[1] else: # Perform an operation using the operator module op = getattr(operator, ast[0]) ret = op(ast[1], ast[3]) print ret return ret grammar = """ E = e1 -e1 = e0|add|sub -e0 = D|mul|div add = e1 plus e0 sub = e1 minus e0 mul = e0 star D div = e0 slash D D=d """ a=A('E', grammar, Scanner(d='[0-9]', plus='[+]', minus='[-]', slash='/', star='[*]'))
With this, the call a('1+2*3+5-8')
will produce the following output :
evaluating ('D', ('d', '1', 0)) ... 1 evaluating ('D', ('d', '2', 2)) ... 2 evaluating ('D', ('d', '3', 4)) ... 3 evaluating ('mul', 2, ('star', '*', 3), 3) ... 6 evaluating ('add', 1, ('plus', '+', 1), 6) ... 7 evaluating ('D', ('d', '5', 6)) ... 5 evaluating ('add', 7, ('plus', '+', 5), 5) ... 12 evaluating ('D', ('d', '8', 8)) ... 8 evaluating ('sub', 12, ('minus', '-', 7), 8) ... 4 evaluating ('E', 4) ... 4
And return [4]
.
git clone git://github.com/bl0b/jupyLR.git