Add user guide content for lexer

This commit is contained in:
Josh Holtrop 2023-09-24 16:07:43 -04:00
parent 562c24ce9e
commit 1328a718ac

View File

@ -7,7 +7,7 @@ ${/remove}
#> Overview #> Overview
Propane is an LR Parser Generator (LPG) which: Propane is a LALR Parser Generator (LPG) which:
* accepts LR(0), SLR, and LALR grammars * accepts LR(0), SLR, and LALR grammars
* generates a built-in lexer to tokenize input * generates a built-in lexer to tokenize input
@ -56,8 +56,10 @@ Example grammar file:
import std.math; import std.math;
>> >>
# Parser values are unsigned integers.
ptype ulong; ptype ulong;
# A few basic arithmetic operators.
token plus /\\+/; token plus /\\+/;
token times /\\*/; token times /\\*/;
token power /\\*\\*/; token power /\\*\\*/;
@ -72,6 +74,7 @@ token integer /\\d+/ <<
>> >>
token lparen /\\(/; token lparen /\\(/;
token rparen /\\)/; token rparen /\\)/;
# Drop whitespace.
drop /\\s+/; drop /\\s+/;
Start -> E1 << Start -> E1 <<
@ -103,6 +106,12 @@ E4 -> lparen E1 rparen <<
>> >>
``` ```
Grammar files can contain comment lines beginning with `#` which are ignored.
White space in the grammar file is also ignored.
It is convention to use the extension `.propane` for the Propane grammar file,
however any file name is accepted by Propane.
##> User Code Blocks ##> User Code Blocks
User code blocks begin with the line following a "<<" token and end with the User code blocks begin with the line following a "<<" token and end with the
@ -165,7 +174,7 @@ token integer /\\d+/ <<
>> >>
``` ```
Lexer code blocks appear following a `token` or `pattern` expression. Lexer code blocks appear following a `token` or pattern expression.
User code in a lexer code block will be executed when the lexer matches the User code in a lexer code block will be executed when the lexer matches the
given pattern. given pattern.
Assignment to the `$$` symbol will associate a parser value with the lexed Assignment to the `$$` symbol will associate a parser value with the lexed
@ -190,6 +199,215 @@ rule.
Parser values for the rules or tokens in the rule pattern can be accessed Parser values for the rules or tokens in the rule pattern can be accessed
positionally with tokens `$1`, `$2`, `$3`, etc... positionally with tokens `$1`, `$2`, `$3`, etc...
##> Specifying tokens - the `token` statement
The `token` statement allows defining a lexer token and a pattern to match that
token.
The name of the token must be specified immediately following the `token`
keyword.
A regular expression pattern may optionally follow the token name.
If a regular expression pattern is not specified, the name of the token is
taken to be the pattern.
See also: ${#Regular expression syntax}.
Example:
```
token for;
```
In this example, the token name is `for` and the pattern to match it is
`/for/`.
Example:
```
token lbrace /\{/;
```
In this example, the token name is `lbrace` and a single left curly brace will
match it.
The `token` statement can also include a user code block.
The user code block will be executed whenever the token is matched by the
lexer.
Example:
```
token if <<
writeln("'if' keyword lexed");
>>
```
The `token` statement is actually a shortcut statement for a combination of a
`tokenid` statement and a pattern statement.
To define a lexer token without an associated pattern to match it, use a
`tokenid` statement.
To define a lexer pattern that may or may not result in a matched token, use
a pattern statement.
##> Defining tokens without a matching pattern - the `tokenid` statement
The `tokenid` statement can be used to define a token without associating it
with a lexer pattern that matches it.
Example:
```
tokenid string;
```
The `tokenid` statement can be useful when defining a token that may optionally
be returned by user code associated with a pattern.
It is also useful when lexer modes and multiple lexer patterns are required to
build up a full token.
A common example is parsing a string.
See the ${#Lexer modes} chapter for more information.
##> Specifying a lexer pattern - the pattern statement
A pattern statement is used to define a lexer pattern that can execute user
code but may not result in a matched token.
Example:
```
/foo+/ <<
writeln("saw a foo pattern");
>>
```
This can be especially useful with ${#Lexer modes}.
See also ${#Regular expression syntax}.
##> Ignoring input sections - the `drop` statement
A `drop` statement can be used to specify a lexer pattern that when matched
should result in the matched input being dropped and lexing continuing after
the matched input.
A common use for a `drop` statement would be to ignore whitespace sequences in
the user input.
Example:
```
drop /\s+/;
```
See also ${#Regular expression syntax}.
##> Regular expression syntax
A regular expression ("regex") is used to define lexer patterns in `token`,
pattern, and `drop` statements.
A regular expression begins and ends with a `/` character.
Example:
```
/#.*$/
```
Regular expressions can include many special characters:
* The `.` character matches any input character other than a newline.
* The `*` character matches any number of the previous regex element.
* The `+` character matches one or more of the previous regex element.
* The `?` character matches 0 or 1 of the previous regex element.
* The `[` character begins a character class.
* The `(` character begins a matching group.
* The `{` character begins a count qualifier.
* The `\` character escapes the following character and changes its meaning:
* The `\d` sequence matches any character `0` through `9`.
* The `\s` sequence matches a space, horizontal tab `\t`, carriage return
`\r`, a form feed `\f`, or a vertical tab `\v` character.
* Any other character matches itself.
* The `|` character creates an alternate match.
Any other character just matches itself in the input stream.
A character class consists of a list of character alternates or character
ranges that can be matched by the character class.
For example `[a-zA-Z_]` matches any lowercase character between `a` and `z` or
any uppercase character between `A` and `Z` or the underscore `_` character.
Character classes can also be negative character classes if the first character
after the `[` is a `^` character.
In this case, the set of characters matched by the character class is the
inverse of what it otherwise would have been.
For example, `[^0-9]` matches any character other than 0 through 9.
A matching group can be used to override the pattern sequence that multiplicity
specifiers apply to.
For example, the pattern `/foo+/` matches "foo" or "foooo", while the pattern
`/(foo)+/` matches "foo" or "foofoofoo", but not "foooo".
A count qualifier in curly braces can be used to restrict the number of matches
of the preceding atom to an explicit minimum and maximum range.
For example, the pattern `\d{3}` matches exactly 3 digits 0-9.
Both a minimum and maximum multiplicity count can be specified and separated by
a comma.
For example, `/a{1,5}/` matches between 1 and 5 `a` characters.
Either the minimum or maximum count can be omitted to omit the corresponding
restriction in the number of matches allowed.
An alternate match is created with the `|` character.
For example, the pattern `/foo|bar/` matches either the sequence "foo" or the
sequence "bar".
##> Lexer modes
Lexer modes can be used to change the set of patterns that are matched by the
lexer.
A common use for lexer modes is to match strings.
Example:
```
<<
string mystringvalue;
>>
tokenid str;
# String processing
/"/ <<
mystringvalue = "";
$mode(string);
>>
string: /[^"]+/ <<
mystringvalue += match;
>>
string: /"/ <<
$mode(default);
return $token(str);
>>
```
A lexer mode is defined by placing the name before a colon (`:`) character that
precedes a token or pattern statement.
The token or pattern statement is restricted to only applying if the named mode
is active.
By default, the active lexer mode is named `default`.
A `$mode()` call within a lexer code block can be used to change lexer modes.
In the above example, when the lexer in the default mode sees a doublequote
(`"`) character, the lexer code block will clear the `mystringvalue` variable
and will set the lexer mode to `string`.
When the lexer begins looking for patterns to match against the input, it will
now look only for patterns tagged for the `string` lexer mode.
Any non-`"` character will be appended to the `mystringvalue` string.
A `"` character will end the `string` lexer mode and return to the `default`
lexer mode.
It also returns the `str` token now that the token is complete.
Note that the token name `str` above could have been `string` instead - the
namespace for token names is distinct from the namespace for lexer modes.
##> Specifying parser value types - the `ptype` statement ##> Specifying parser value types - the `ptype` statement
The `ptype` statement is used to define parser value type(s). The `ptype` statement is used to define parser value type(s).
@ -248,6 +466,18 @@ In this example:
* a reduced `Values`'s parser value has a type of `Value[]`. * a reduced `Values`'s parser value has a type of `Value[]`.
* a reduced `KeyValue`'s parser value has a type of `Value[string]`. * a reduced `KeyValue`'s parser value has a type of `Value[string]`.
##> Specifying the parser module name - the `module` statement
The `module` statement can be used to specify the module name for a generated
D module.
```
module proj.parser;
```
If a module statement is not present, then the generated D module will not
contain a module statement and the default module name will be used.
#> License #> License
Propane is licensed under the terms of the MIT License: Propane is licensed under the terms of the MIT License: