httk.core.miniparser module

LR(1) miniparser

Introduction

A relatively bare-bones LR(1) parser that parses strings into abstract syntax trees (ast) for generic languages. Python 2 and 3 compatible. Language grammars can be given in textual EBNF.

A simple usage example:

from miniparser import parser

ls = {
  'ebnf_grammar': """
     S = E ;
     E = T, '+', E ;
     E = T ;
     T = id ;
  """,
  'tokens': {'id': '[a-zA-Z][a-zA-Z0-9_]*'}
}

input_string = "Test + Test"

result = parser(ls, input_string)
print(result)

Usage example of a simple grammar for balanced parentheses. This also shows using inline regex via an EBNF special sequence:

from miniparser import parser
ls = {
  'ebnf_grammar': """
     Expr = Group
          | Expr , Expr
          | id ;
     Group = '(', Expr, ')' ;
     id = ? [a-zA-Z0-9 _]+ ? ;
  """,
  'remove': ['(',')'],
  'simplify': ['Expr']
}

input_string = "Outer ( Inner ( Inside ) Further outside )"

result = parser(ls, input_string)
print(result)

Note: in the above examples, the parse tables are generated on the first call to parse, and then cached inside the ‘ls’ dict. However, if one wants to pre-generate the parse tables (e.g., for looking at them), that can be done by calling build_ls(ls=ls) before parse. You can, if you want, save the ‘ls’ variable to disk (e.g. using pickle). However, since a modern computer builds the parse tables in a time comparable with starting up the python interpreter, this may not be so useful.

For documentation on the parameters in the ls dict, see help(build_ls).

Detailed description

This is roughly how the parser operates:

  1. It takes as input:

    1.1. An EBNF grammar in text format for the language it is

    supposed to parse: ebnf_grammar.

    1.2. Some other meta-info about the language that defines, e.g.,

    terminals (elements that are not further simplified), etc.

    1.3. A string to parse.

  2. The fist time this langague is parsed, the parser builds up the necessary data structures for the language using the function build_ls. The steps are:

    2.1. The parser uses itself to parse ebnf_grammar into

    an ast representation of the grammar: ebnf_grammar_ast.

    To do this, it uses an already provided ast of the EBNF language itself (but which can also be recreated by the parser itself as shown in the examples at the end of the file under __name__ == “__main__”.)

    2.2. The ebnf_grammar_ast is translated to a more BNF-like abstract

    form that expands alteration, optionals, groupings, and repetitions into separate rules: bnf_grammar_ast.

    2.3. The bnf_grammar_ast is processed into a rule_table.

    This is a dictionary that maps every symbol to a list of possible right hand sides in the production rules.

    2.4. The rule_table is used to build a table of the FIRST(symbol)

    function in LR parsing. It maps all symbols on a list of terminals that may be the very first thing seen in the input when matching that production rule: first_table.

    2.5. The rule_table`and the `first_table are used to build

    the ACTION and GOTO tables in LR parsing. These encode a state machine that for every starting state S tells the machine to either shift or reduce, and when doing so, the state the machine progresses to: action_table and goto_table.

  3. The parse string is processed the python generator lexer, which splits the input into lexical tokens.

  4. The LR state machine is initialized in its starting state. Tokens are read from the lexer, and shift/reduce actions and state changes are made according to action_table and goto_table. The results of the parsing are collected on the symbol stack in the from of an ast.

  5. When all input has been reduced into the starting symbol, the ast connected to that symbol is returned.

Diagnostic output

  • You can add verbosity=<int> as an argument to both the parser and the build_ls function to get that level of diagnostic output.

  • For more fine-tuned output, set verbosity = LogVerbosity(verbosity, [<flags>]) flags can be various flags that can be found in the source code.

    Known flags at the time of writing:

    • print_all_tokens=True lets makes the parser have the lexer process all input first and prints all tokens before the parsing starts.

    • <function name>_verbosity = <verbosity level> adjusts the verbosity level for just that one function. For example:

      parser(ls, source, verbosity=LogVerbosity(0,parser_verbosity=3))

    prints out diagnostic output on level 3 for the parser function, but skips any other diagnostic output.

  • If you do not want the default behavior of printing diagnostic output on stdout, both parser and build_ls takes the argument logger=<function>, which redirects all diagnostic output to that function. The function should have the signature:

    logger(*args,**kargs):
    

    where the args is the diagnostic info being printed, and the keyword arguments communicates flags. In particular, pretty=True indicates that complex objects are passed which would benefit from using, e.g., pprint.pprint to typeset the output.

class httk.core.miniparser.LogVerbosity(verbosity, **flags)[source]

Bases: object

Class to send in as keyword argument for verbosity to fine-tune diagnostic output from certain functions.

Set the keyword argument as follows:

verbosity = LogVerbosity(verbosity, [<flags>])

flags can be various flags that can be found in the source code, e.g., print_all_tokens=True lets makes the parser have the lexer process all input first and prints all tokens before the parsing starts.

Specifically, set <function name>_verbosity = <verbosity level> to adjust the verbosity level for just that one function. For example:

parser(ls, source, verbosity=LogVerbosity(0,parser_verbosity=3))

prints out diagnostic output on level 3 for the parser function, but skips any other diagnostic output.

exception httk.core.miniparser.ParserError[source]

Bases: exceptions.Exception

exception httk.core.miniparser.ParserGrammarError[source]

Bases: httk.core.miniparser.ParserError

exception httk.core.miniparser.ParserInternalError[source]

Bases: httk.core.miniparser.ParserError

exception httk.core.miniparser.ParserSyntaxError(*args)[source]

Bases: httk.core.miniparser.ParserError

httk.core.miniparser.build_ls(ebnf_grammar=None, tokens={}, partial_tokens={}, literals=None, precedence=[], ignore=' \t\n', simplify=[], aggregate=[], start=None, skip=[], remove=[], comment_markers=[], ls=None, verbosity=0, logger=<function logger>)[source]

Build a language specification from an ebnf grammar and some meta-info of the language.

Args:
ebnf_grammar (str):
a string containing the ebnf describing the language.
tokens (dict,optional):
a dict of token names and the regexs that defines them, they are considered terminals in the parsing. (They may also be defined as production rules in the ebnf, but if so, those definitions are ignored.)
partial_tokens (dict):
a dictionary that maps token names on regular expressions for partial token matches. This is used to allow finding longer matches if there is intermediate length input that does not match. E.g., to match 5.32e6 as a number instead as as Number(5.32) + Identifier(e) + Number(6).
literals (list of str):
a list of strings of 1 or more characters which define literal symbols of the language (i.e, the tokenizer name the tokens the same as the string), if not given, an attemt is made to auto-extract them from the grammar.
precedence (list,optional):
list of tuples of the format (associativity, symbol, …), the order of this list defines the precedence of those symbols, later in the list = higher precedence. The associativity can be ‘left’, ‘right’, or ‘noassoc’.
ignore (str,optional):
a string of characters, or a list of strings for symbols, which are withheld by the tokenizer. (This is commonly used to skip emitting whitespace tokens, while still supprting whitespace inside tokens, e.g., quoted strings.)
simplify (list,optional):
a list of symbol identifiers that are simplified away when the parse tree is generated.
aggregate (list,optional):
a list of symbol identifiers that when consituting consequtive nodes are ‘flattened’, removing the ambiguity of left or right associativity.
start (str,optional):
the start (topmost) symbol of the grammar. A successful parsing means reducing all input into this symbol.
remove (list):
list of symbols to just skip in the output parse tree (useful to, e.g., skip uninteresting literals).
skip (list):
list of rules to completely ignore in the grammar. (useful to skip rules in a complete EBNF which reduces the tokens into single characters entities, when one rather wants to handle those tokens by regex:es by passing the token argument)
ls (dict):
As an alternative to giving the above parameters, a dict can be given with the same attributes as the arguments defined above.
httk.core.miniparser.ebnf_unqote(s)[source]
httk.core.miniparser.lexer(source, tokens, partial_tokens, literals, ignore, comment_markers=[], verbosity=0, logger=<function logger>)[source]

A generator that turn source into tokens.

Args:
source (str):
input string
tokens (dict):
a dictonary that maps all tokens of the language on regular expressions that match them.
partial_tokens (dict):
a dictionary that maps token names on regular expressions for partial token matches. This is used to allow finding longer matches if there is intermediate length input that does not match. E.g., to match 5.32e6 as a number instead as as Number(5.32) + Identifier(e) + Number(6).
literals (list):
a list of single character strings that are to be treated as literals.
httk.core.miniparser.logger(*args, **kargs)[source]

This is the default logging function for diagnostic output. It prints the output in args on stdout.

Args:
loglevel:
the level designated to the diagnostic output
args:
list of arguments to print out
kargs:
keyword flags. These are: pretty=True: formats the output using pprint.pprint(arg).
httk.core.miniparser.parser(ls, source, verbosity=0, logger=<function logger>)[source]

This is a fairly straightforward implementation of an LR(1) parser. It should do well for parsing somewhat simple grammars.

The parser takes a language specification (ls), and a string to parse (source). The string is then parsed according to that ls into a syntax tree, which is returned.

An ls is produced by calling the function build_ls (see help(build_ls))

Args:
ls: language specification produced by build_ls. source: source string to parse.
httk.core.miniparser.split_chars_strip_comments(source, comment_markers)[source]

Helper function for the lexer that reads input and strips comments, while keeping track of absolute position in the file.

Args:
source (str):
input string
comment_markers (list of tuples):
a list of entries (start_marker, end_marker) that designate comments. A marker can be end-of-line or end with end-of-line, but multiline comment separators are not allowed, i.e., no characters may follow the end-of-line.