Previous Contents Next

Chapter 3   Grammars

Camlp4 grammar system is a way to parse streams with more power and more convenience than simple stream parsers. A grammar deals with operator precedences and associativity, and left factorization, and its entries are extensible.

3.1   Grammars and entries

The Camlp4 library provides a module ``Grammar'' and a syntax extension file ``pa_extend.cmo'' to program grammars. All details on modules, types and functions are described chapter 7.

A grammar can be created using the function ``Grammar.create''. It takes a lexer as parameter. A good candidate it the function ``Plexer.make'' but the user can creates its own lexer, providing that it is of type ``Token.lexer''.

A grammar holds entries. An entry can be created using the function ``Grammar.Entry.create''. The first parameter is the grammar, the second one is a string used to name the entry in error messages and in entries dump. Entries are created empty, i.e. raising ``Stream.Error'' when called. An entry is composed of entry precedence levels, the first one being the least precedent and the last one the most.

Lastly, a stream can be parsed by en entry using the function ``Grammar.Entry.parse''. In case of syntax error, the exception ``Stream.Error'' is raised, encapsulated by the exception ``Stdpp.Exc_located'', giving the location of the error.

3.2   Extension

An entry can be extended using the function ``Grammar.extend''. But its interface being quite complicated and, as it must be used with appropriate type constraints, the Camlp4 library provides a file, named ``pa_extend.cmo'', compatible with ``pa_o.cmo'' and ``pa_r.cmo'' which creates a new instruction doing this work.

This instruction is ``EXTEND'' which has the following format:

EXTEND
{ GLOBAL : global-list ; }
entry : { position } extension ;
...
entry : { position } extension ;
END

EXTEND, GLOBAL and END are keywords. There are some other keywords in this instruction, all in uppercase.

The entry is a name of an entry variable (simple identifier or prefixed by a module name). It is extended according to the rules given by extension. The optional position tells how to insert the extension among the entry precedence levels: by default, the extension is done at the first precedence level (creating this level if the entry is empty).

The optional global-list restricts the list of the global entries (separated by spaces) extended in the ``EXTEND'' instruction. The other entries are created locally. By default, all entries are global and must correspond to entry variables visible at the ``EXTEND'' instruction point.

Syntax of a position

A position can be: Only the case LEVEL extends already existing levels: the other cases create new levels.

Syntax of an extension

The syntax of an entry extension is:
[ { label } { associativity } level-rules  
| ...  
| { label } { associativity } level-rules ]

The optional label identifies the current level for possible future extensions. The optional associativity can be LEFTA, RIGHTA or NONA for respectively left, right and no associativity: the default is left associativity.

Syntax of a level-rules

The level-rules have the format:
[ { pattern = } symbol ; ... { pattern = } symbol { -> action }  
| ...  
| { pattern = } symbol ; ... { pattern = } symbol { -> action } ]

The pattern is a language simple pattern, limited to identifiers, wildcard (character ``underscore'') and tuples of patterns. If not present, it is equivalent to the wildcard.

The action is a language expression where the patterns are bound to the result of their corresponding symbol; in addition, the variable ``loc'' is bound to the source location of the rule. The action part is optional; by default it is the value ``()''.

A symbol is either:
A string
meaning: a keyword. If this keyword does not exist in the current grammar, it is added. The syntax of the keyword must respect the rules of the lexer. Its type is string.
An entry name
(simple identifier or prefixed by a module name) meaning a call to this entry. Its type is the entry result type.
The keyword SELF
meaning the entry being extended (first level).
The keyword NEXT
meaning the entry being extended at the next level.
A token pattern.
Tokens patterns are identifiers starting with an uppercase letter optionally followed by a string; they recognize tokens generated by the lexer. See for example, the constructors generated by the predefined lexer (section 7.6), lexing OCaml normal and revised syntaxes.
A metasymbol
among:
A level-rules
meaning an anonymous entry with only one level.
A symbol between parentheses

3.3   Parsed language

Entries are stream parsers improved. Streams use recursive descendant parsing (something close to LL(1)). Each precedence level can be seen as an independent extensible stream parser.

In a precedence level, the rules are factorized, i.e. several rules can start with the same symbols, and the parser is represented by a tree of symbols where leaves are either semantic actions, or dead ends.

When an extension is done, the new rules are inserted in the precedence level tree. The system does not check whether the entry precedence level will ``work'', nor whether it would work together with the other levels and the other entries.

This insertion is generally done at the end of the rules list. The special cases are for symbols of type token, which are always inserted before the other symbols (non terminals, lists, options, etc), tokens with parameters before tokens without parameters (parameter is the empty string).

There is no factorization of rules between different levels nor, a fortiori, with other entries.

While parsing an entry, the system tries the first precedence level. Then, if it fails, it tries the second one, etc. If the last one fails, the entry fails and the caller of this entry resumes, either by raising a parse error (the parsing is then stopped), or by trying other rules.

3.4   Deletion

Entries rules can be deleted using the function ``Grammar.delete_rule''. But, like for ``Grammar.extend'', it is not documented. One must use the instruction ``DELETE_RULE'', generating a call to this function. This instruction is a syntax extension, loaded together with the instruction ``EXTEND'' by the file ``pa_extend.cmo''.

The syntax of ``DELETE_RULE'' is:

DELETE_RULE
entry : symbol ; ... symbol
END

The rule defined by the list of symbols (semantic actions forgotten), if found, is deleted.

3.5   Writing a lexer

The lexers have to create tokens of type Token.t which is actually (string * string), the first string being a constructor (which must be an identifier starting with an uppercase letter) and the second string the value.

For example, if you lexer wants to define identifiers, you must reserve a constructor name, e.g. IDENT and it put the identifier value as second string. For example, reading "foo", it returns: ("IDENT", "foo"). Another example, if your lexer read integers, you can use INT as constructor and the string representation of the integer in the string, e.g. ("INT", "32").

In the corresponding EXTEND statement, you can use as symbol a constructor with a specific value, e.g:

     IDENT "bar"
     INT "32"
which recognizes only the identifier "bar" or only the integer 32. Another possible symbol is the constructor alone, which recognizes any value of this constructor. It is useful to assign it to a pattern identifier, to use it in the action part of the rule:

     p = IDENT
     i = INT
Notice that you can use any name for your constructors, provided they are identifiers starting with an uppercase letter, and not in conflict with the predefined symbols in the EXTEND statement which are: SELF, NEXT, LIST0, LIST1, OPT.

There is a special case, useful for keywords: if you give "" (the empty string) as constructor, you can use the second string directly in the rules. It is the case in our examples for the operators "+", "*" and so on.

Each call to your main lexer function must create one value of type (Token.t * Token.location) from a character stream. The type Token.t is defined above and Token.location is a couple of integers giving the input location (the first character of the input stream having position 0).

In a first version, you can return always (0, 0) as location; it does not prevent the system to work, since it is used only in error messages.

The input is a character stream, but you are not obliged to use a stream parser to write your lexer. For example, the module Token provides a function to create a lexer function from an ocamllex lexing function (see the interface of module Token). Moreover, this function takes care of the location stuff.

A lexer appropriate for the Grammar.create function is a record of type Token.lexer which is a record with the following fields:


Previous Contents Next