Add documentation for AST generation mode - close #22

2024-04-23 00:15:19 -04:00 · 2024-04-23 00:15:19 -04:00 · f0bd8d8663
commit f0bd8d8663
parent c7a18ef821
2 changed files with 169 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -6,7 +6,8 @@ Propane is a LALR Parser Generator (LPG) which:
  * generates a built-in lexer to tokenize input
  * supports UTF-8 lexer inputs
  * generates a table-driven shift/reduce parser to parse input in linear time
-  * target C or D language outputs
+  * targets C or D language outputs
+  * optionally supports automatic full AST generation
  * is MIT-licensed
  * is distributable as a standalone Ruby script

--- a/doc/user_guide.md
+++ b/doc/user_guide.md
@ -14,6 +14,7 @@ Propane is a LALR Parser Generator (LPG) which:
  * supports UTF-8 lexer inputs
  * generates a table-driven shift/reduce parser to parse input in linear time
  * targets C or D language outputs
+  * optionally supports automatic full AST generation
  * is MIT-licensed
  * is distributable as a standalone Ruby script

@ -182,6 +183,99 @@ rule.
 Parser values for the rules or tokens in the rule pattern can be accessed
 positionally with tokens `$1`, `$2`, `$3`, etc...

+Parser rule code blocks are not available in AST generation mode.
+In AST generation mode, a full parse tree is automatically constructed in
+memory for user code to traverse after parsing is complete.
+
+##> AST generation mode - the `ast` statement
+
+To activate AST generation mode, place the `ast` statement in your grammar file:
+
+```
+ast;
+```
+
+It is recommended to place this statement early in the grammar.
+
+In AST generation mode various aspects of propane's behavior are changed:
+
+  * Only one `ptype` is allowed.
+  * Parser user code blocks are not supported.
+  * Structure types are generated to represent the parsed tokens and rules as
+  defined in the grammar.
+  * The parse result from `p_result()` points to a `Start` structure containing
+  the entire parse tree for the input.
+
+Example AST generation grammar:
+
+```
+ast;
+
+ptype int;
+
+token a << $$ = 11; >>
+token b << $$ = 22; >>
+token one /1/;
+token two /2/;
+token comma /,/ <<
+  $$ = 42;
+>>
+token lparen /\\(/;
+token rparen /\\)/;
+drop /\\s+/;
+
+Start -> Items;
+
+Items -> Item ItemsMore;
+Items -> ;
+
+ItemsMore -> comma Item ItemsMore;
+ItemsMore -> ;
+
+Item -> a;
+Item -> b;
+Item -> lparen Item rparen;
+Item -> Dual;
+
+Dual -> One Two;
+Dual -> Two One;
+One -> one;
+Two -> two;
+```
+
+The following unit test describes the fields that will be present for an
+example parse:
+
+```
+string input = "a, ((b)), b";
+p_context_t context;
+p_context_init(&context, input);
+assert_eq(P_SUCCESS, p_parse(&context));
+Start * start = p_result(&context);
+assert(start.pItems1 !is null);
+assert(start.pItems !is null);
+Items * items = start.pItems;
+assert(items.pItem !is null);
+assert(items.pItem.pToken1 !is null);
+assert_eq(TOKEN_a, items.pItem.pToken1.token);
+assert_eq(11, items.pItem.pToken1.pvalue);
+assert(items.pItemsMore !is null);
+ItemsMore * itemsmore = items.pItemsMore;
+assert(itemsmore.pItem !is null);
+assert(itemsmore.pItem.pItem !is null);
+assert(itemsmore.pItem.pItem.pItem !is null);
+assert(itemsmore.pItem.pItem.pItem.pToken1 !is null);
+assert_eq(TOKEN_b, itemsmore.pItem.pItem.pItem.pToken1.token);
+assert_eq(22, itemsmore.pItem.pItem.pItem.pToken1.pvalue);
+assert(itemsmore.pItemsMore !is null);
+itemsmore = itemsmore.pItemsMore;
+assert(itemsmore.pItem !is null);
+assert(itemsmore.pItem.pToken1 !is null);
+assert_eq(TOKEN_b, itemsmore.pItem.pToken1.token);
+assert_eq(22, itemsmore.pItem.pToken1.pvalue);
+assert(itemsmore.pItemsMore is null);
+```
+
 ##> Specifying tokens - the `token` statement

 The `token` statement allows defining a lexer token and a pattern to match that
@ -442,6 +536,12 @@ In this example:
  * a reduced `Values`'s parser value has a type of `Value[]`.
  * a reduced `KeyValue`'s parser value has a type of `Value[string]`.

+When AST generation mode is active, the `ptype` functionality works differently.
+In this mode, only one `ptype` is used by the parser.
+Lexer user code blocks may assign a parse value to the generated `Token` node
+by assigning to `$$` within a lexer code block.
+The type of the parse value `$$` is given by the global `ptype` type.
+
 ##> Specifying a parser rule - the rule statement

 Rule statements create parser rules which define the grammar that will be
@ -490,6 +590,9 @@ The `$$` symbol accesses the output parser value for this rule.
 The above examples demonstrate how the parser values for the rule components
 can be used to produce the parser value for the accepted rule.

+Parser rule code blocks are not allowed and not used when AST generation mode
+is active.
+
 ##> Specifying the parser module name - the `module` statement

 The `module` statement can be used to specify the module name for a generated
@ -586,6 +689,67 @@ A pointer to this instance is passed to the generated functions.
 The `p_position_t` structure contains two fields `row` and `col`.
 These fields contain the 0-based row and column describing a parser position.

+### AST Node Types
+
+If AST generation mode is enabled, a structure type for each rule will be
+generated.
+The name of the structure type is given by the name of the rule.
+Additionally a structure type called `Token` is generated to represent an
+AST node which refers to a raw parser token rather than a composite rule.
+
+#### AST Node Fields
+
+A `Token` node has two fields:
+
+  * `token` which specifies which token was parsed (one of `TOKEN_*`)
+  * `pvalue` which specifies the parser value for the token. If a lexer user
+  code block assigned to `$$`, the assigned value will be stored here.
+
+The other generated AST node structures have fields generated based on the
+right hand side components specified for all rules of a given name.
+
+In this example:
+
+```
+Start -> Items;
+
+Items -> Item ItemsMore;
+Items -> ;
+```
+
+The `Start` structure will have a field called `pItems` and another field of
+the same name but with a positional suffix (`pItems1`) which both point to the
+parsed `Items` node.
+Their value will be null if the parsed `Items` rule was empty.
+
+The `Items` structure will have fields:
+
+  * `pItem` and `pItem1` which point to the parsed `Item` structure.
+  * `pItemsMore` and `pItemsMore2` which point to the parsed `ItemsMore` structure.
+
+If a rule can be empty (for example in the second `Items` rule above), then
+an instance of a pointer to that rule's generated AST node will be null if the
+parser matches the empty rule definition.
+
+The non-positional AST node field pointer will not be generated if there are
+multiple positions in which an instance of the node it points to could be
+present.
+For example, in the below rules:
+
+```
+Dual -> One Two;
+Dual -> Two One;
+```
+
+The generated `Dual` structure will contain `pOne1`, `pTwo2`, `pTwo1`, and
+`pOne2` fields.
+However, a `pOne` field and `pTwo` field will not be generated since it would
+be ambiguous which one was matched.
+
+If the first rule is matched, then `pOne1` and `pTwo2` will be non-null while
+`pTwo1` and `pOne2` will be null.
+If the second rule is matched instead, then the opposite would be the case.
+
 ##> Functions

 ### `p_context_init`
@ -639,6 +803,9 @@ if (p_parse(&context) == P_SUCCESS)
 }
 ```

+If AST generation mode is active, then the `p_result()` function returns a
+`Start *` pointing to the `Start` AST structure.
+
 ### `p_position`

 The `p_position()` function can be used to retrieve the parser position where