Add documentation for AST generation mode - close #22

This commit is contained in:
Josh Holtrop 2024-04-23 00:15:19 -04:00
parent c7a18ef821
commit f0bd8d8663
2 changed files with 169 additions and 1 deletions

View File

@ -6,7 +6,8 @@ Propane is a LALR Parser Generator (LPG) which:
* generates a built-in lexer to tokenize input * generates a built-in lexer to tokenize input
* supports UTF-8 lexer inputs * supports UTF-8 lexer inputs
* generates a table-driven shift/reduce parser to parse input in linear time * generates a table-driven shift/reduce parser to parse input in linear time
* target C or D language outputs * targets C or D language outputs
* optionally supports automatic full AST generation
* is MIT-licensed * is MIT-licensed
* is distributable as a standalone Ruby script * is distributable as a standalone Ruby script

View File

@ -14,6 +14,7 @@ Propane is a LALR Parser Generator (LPG) which:
* supports UTF-8 lexer inputs * supports UTF-8 lexer inputs
* generates a table-driven shift/reduce parser to parse input in linear time * generates a table-driven shift/reduce parser to parse input in linear time
* targets C or D language outputs * targets C or D language outputs
* optionally supports automatic full AST generation
* is MIT-licensed * is MIT-licensed
* is distributable as a standalone Ruby script * is distributable as a standalone Ruby script
@ -182,6 +183,99 @@ rule.
Parser values for the rules or tokens in the rule pattern can be accessed Parser values for the rules or tokens in the rule pattern can be accessed
positionally with tokens `$1`, `$2`, `$3`, etc... positionally with tokens `$1`, `$2`, `$3`, etc...
Parser rule code blocks are not available in AST generation mode.
In AST generation mode, a full parse tree is automatically constructed in
memory for user code to traverse after parsing is complete.
##> AST generation mode - the `ast` statement
To activate AST generation mode, place the `ast` statement in your grammar file:
```
ast;
```
It is recommended to place this statement early in the grammar.
In AST generation mode various aspects of propane's behavior are changed:
* Only one `ptype` is allowed.
* Parser user code blocks are not supported.
* Structure types are generated to represent the parsed tokens and rules as
defined in the grammar.
* The parse result from `p_result()` points to a `Start` structure containing
the entire parse tree for the input.
Example AST generation grammar:
```
ast;
ptype int;
token a << $$ = 11; >>
token b << $$ = 22; >>
token one /1/;
token two /2/;
token comma /,/ <<
$$ = 42;
>>
token lparen /\\(/;
token rparen /\\)/;
drop /\\s+/;
Start -> Items;
Items -> Item ItemsMore;
Items -> ;
ItemsMore -> comma Item ItemsMore;
ItemsMore -> ;
Item -> a;
Item -> b;
Item -> lparen Item rparen;
Item -> Dual;
Dual -> One Two;
Dual -> Two One;
One -> one;
Two -> two;
```
The following unit test describes the fields that will be present for an
example parse:
```
string input = "a, ((b)), b";
p_context_t context;
p_context_init(&context, input);
assert_eq(P_SUCCESS, p_parse(&context));
Start * start = p_result(&context);
assert(start.pItems1 !is null);
assert(start.pItems !is null);
Items * items = start.pItems;
assert(items.pItem !is null);
assert(items.pItem.pToken1 !is null);
assert_eq(TOKEN_a, items.pItem.pToken1.token);
assert_eq(11, items.pItem.pToken1.pvalue);
assert(items.pItemsMore !is null);
ItemsMore * itemsmore = items.pItemsMore;
assert(itemsmore.pItem !is null);
assert(itemsmore.pItem.pItem !is null);
assert(itemsmore.pItem.pItem.pItem !is null);
assert(itemsmore.pItem.pItem.pItem.pToken1 !is null);
assert_eq(TOKEN_b, itemsmore.pItem.pItem.pItem.pToken1.token);
assert_eq(22, itemsmore.pItem.pItem.pItem.pToken1.pvalue);
assert(itemsmore.pItemsMore !is null);
itemsmore = itemsmore.pItemsMore;
assert(itemsmore.pItem !is null);
assert(itemsmore.pItem.pToken1 !is null);
assert_eq(TOKEN_b, itemsmore.pItem.pToken1.token);
assert_eq(22, itemsmore.pItem.pToken1.pvalue);
assert(itemsmore.pItemsMore is null);
```
##> Specifying tokens - the `token` statement ##> Specifying tokens - the `token` statement
The `token` statement allows defining a lexer token and a pattern to match that The `token` statement allows defining a lexer token and a pattern to match that
@ -442,6 +536,12 @@ In this example:
* a reduced `Values`'s parser value has a type of `Value[]`. * a reduced `Values`'s parser value has a type of `Value[]`.
* a reduced `KeyValue`'s parser value has a type of `Value[string]`. * a reduced `KeyValue`'s parser value has a type of `Value[string]`.
When AST generation mode is active, the `ptype` functionality works differently.
In this mode, only one `ptype` is used by the parser.
Lexer user code blocks may assign a parse value to the generated `Token` node
by assigning to `$$` within a lexer code block.
The type of the parse value `$$` is given by the global `ptype` type.
##> Specifying a parser rule - the rule statement ##> Specifying a parser rule - the rule statement
Rule statements create parser rules which define the grammar that will be Rule statements create parser rules which define the grammar that will be
@ -490,6 +590,9 @@ The `$$` symbol accesses the output parser value for this rule.
The above examples demonstrate how the parser values for the rule components The above examples demonstrate how the parser values for the rule components
can be used to produce the parser value for the accepted rule. can be used to produce the parser value for the accepted rule.
Parser rule code blocks are not allowed and not used when AST generation mode
is active.
##> Specifying the parser module name - the `module` statement ##> Specifying the parser module name - the `module` statement
The `module` statement can be used to specify the module name for a generated The `module` statement can be used to specify the module name for a generated
@ -586,6 +689,67 @@ A pointer to this instance is passed to the generated functions.
The `p_position_t` structure contains two fields `row` and `col`. The `p_position_t` structure contains two fields `row` and `col`.
These fields contain the 0-based row and column describing a parser position. These fields contain the 0-based row and column describing a parser position.
### AST Node Types
If AST generation mode is enabled, a structure type for each rule will be
generated.
The name of the structure type is given by the name of the rule.
Additionally a structure type called `Token` is generated to represent an
AST node which refers to a raw parser token rather than a composite rule.
#### AST Node Fields
A `Token` node has two fields:
* `token` which specifies which token was parsed (one of `TOKEN_*`)
* `pvalue` which specifies the parser value for the token. If a lexer user
code block assigned to `$$`, the assigned value will be stored here.
The other generated AST node structures have fields generated based on the
right hand side components specified for all rules of a given name.
In this example:
```
Start -> Items;
Items -> Item ItemsMore;
Items -> ;
```
The `Start` structure will have a field called `pItems` and another field of
the same name but with a positional suffix (`pItems1`) which both point to the
parsed `Items` node.
Their value will be null if the parsed `Items` rule was empty.
The `Items` structure will have fields:
* `pItem` and `pItem1` which point to the parsed `Item` structure.
* `pItemsMore` and `pItemsMore2` which point to the parsed `ItemsMore` structure.
If a rule can be empty (for example in the second `Items` rule above), then
an instance of a pointer to that rule's generated AST node will be null if the
parser matches the empty rule definition.
The non-positional AST node field pointer will not be generated if there are
multiple positions in which an instance of the node it points to could be
present.
For example, in the below rules:
```
Dual -> One Two;
Dual -> Two One;
```
The generated `Dual` structure will contain `pOne1`, `pTwo2`, `pTwo1`, and
`pOne2` fields.
However, a `pOne` field and `pTwo` field will not be generated since it would
be ambiguous which one was matched.
If the first rule is matched, then `pOne1` and `pTwo2` will be non-null while
`pTwo1` and `pOne2` will be null.
If the second rule is matched instead, then the opposite would be the case.
##> Functions ##> Functions
### `p_context_init` ### `p_context_init`
@ -639,6 +803,9 @@ if (p_parse(&context) == P_SUCCESS)
} }
``` ```
If AST generation mode is active, then the `p_result()` function returns a
`Start *` pointing to the `Start` AST structure.
### `p_position` ### `p_position`
The `p_position()` function can be used to retrieve the parser position where The `p_position()` function can be used to retrieve the parser position where