diff --git a/doc/user_guide.md b/doc/user_guide.md index 2821865..491fde3 100644 --- a/doc/user_guide.md +++ b/doc/user_guide.md @@ -7,7 +7,7 @@ ${/remove} #> Overview -Propane is an LR Parser Generator (LPG) which: +Propane is a LALR Parser Generator (LPG) which: * accepts LR(0), SLR, and LALR grammars * generates a built-in lexer to tokenize input @@ -56,8 +56,10 @@ Example grammar file: import std.math; >> +# Parser values are unsigned integers. ptype ulong; +# A few basic arithmetic operators. token plus /\\+/; token times /\\*/; token power /\\*\\*/; @@ -72,6 +74,7 @@ token integer /\\d+/ << >> token lparen /\\(/; token rparen /\\)/; +# Drop whitespace. drop /\\s+/; Start -> E1 << @@ -103,6 +106,12 @@ E4 -> lparen E1 rparen << >> ``` +Grammar files can contain comment lines beginning with `#` which are ignored. +White space in the grammar file is also ignored. + +It is convention to use the extension `.propane` for the Propane grammar file, +however any file name is accepted by Propane. + ##> User Code Blocks User code blocks begin with the line following a "<<" token and end with the @@ -165,7 +174,7 @@ token integer /\\d+/ << >> ``` -Lexer code blocks appear following a `token` or `pattern` expression. +Lexer code blocks appear following a `token` or pattern expression. User code in a lexer code block will be executed when the lexer matches the given pattern. Assignment to the `$$` symbol will associate a parser value with the lexed @@ -190,6 +199,215 @@ rule. Parser values for the rules or tokens in the rule pattern can be accessed positionally with tokens `$1`, `$2`, `$3`, etc... +##> Specifying tokens - the `token` statement + +The `token` statement allows defining a lexer token and a pattern to match that +token. +The name of the token must be specified immediately following the `token` +keyword. +A regular expression pattern may optionally follow the token name. +If a regular expression pattern is not specified, the name of the token is +taken to be the pattern. +See also: ${#Regular expression syntax}. + +Example: + +``` +token for; +``` + +In this example, the token name is `for` and the pattern to match it is +`/for/`. + +Example: + +``` +token lbrace /\{/; +``` + +In this example, the token name is `lbrace` and a single left curly brace will +match it. + +The `token` statement can also include a user code block. +The user code block will be executed whenever the token is matched by the +lexer. + +Example: + +``` +token if << + writeln("'if' keyword lexed"); +>> +``` + +The `token` statement is actually a shortcut statement for a combination of a +`tokenid` statement and a pattern statement. +To define a lexer token without an associated pattern to match it, use a +`tokenid` statement. +To define a lexer pattern that may or may not result in a matched token, use +a pattern statement. + +##> Defining tokens without a matching pattern - the `tokenid` statement + +The `tokenid` statement can be used to define a token without associating it +with a lexer pattern that matches it. + +Example: + +``` +tokenid string; +``` + +The `tokenid` statement can be useful when defining a token that may optionally +be returned by user code associated with a pattern. + +It is also useful when lexer modes and multiple lexer patterns are required to +build up a full token. +A common example is parsing a string. +See the ${#Lexer modes} chapter for more information. + +##> Specifying a lexer pattern - the pattern statement + +A pattern statement is used to define a lexer pattern that can execute user +code but may not result in a matched token. + +Example: + +``` +/foo+/ << + writeln("saw a foo pattern"); +>> +``` + +This can be especially useful with ${#Lexer modes}. + +See also ${#Regular expression syntax}. + +##> Ignoring input sections - the `drop` statement + +A `drop` statement can be used to specify a lexer pattern that when matched +should result in the matched input being dropped and lexing continuing after +the matched input. + +A common use for a `drop` statement would be to ignore whitespace sequences in +the user input. + +Example: + +``` +drop /\s+/; +``` + +See also ${#Regular expression syntax}. + +##> Regular expression syntax + +A regular expression ("regex") is used to define lexer patterns in `token`, +pattern, and `drop` statements. +A regular expression begins and ends with a `/` character. + +Example: + +``` +/#.*$/ +``` + +Regular expressions can include many special characters: + + * The `.` character matches any input character other than a newline. + * The `*` character matches any number of the previous regex element. + * The `+` character matches one or more of the previous regex element. + * The `?` character matches 0 or 1 of the previous regex element. + * The `[` character begins a character class. + * The `(` character begins a matching group. + * The `{` character begins a count qualifier. + * The `\` character escapes the following character and changes its meaning: + * The `\d` sequence matches any character `0` through `9`. + * The `\s` sequence matches a space, horizontal tab `\t`, carriage return + `\r`, a form feed `\f`, or a vertical tab `\v` character. + * Any other character matches itself. + * The `|` character creates an alternate match. + +Any other character just matches itself in the input stream. + +A character class consists of a list of character alternates or character +ranges that can be matched by the character class. +For example `[a-zA-Z_]` matches any lowercase character between `a` and `z` or +any uppercase character between `A` and `Z` or the underscore `_` character. +Character classes can also be negative character classes if the first character +after the `[` is a `^` character. +In this case, the set of characters matched by the character class is the +inverse of what it otherwise would have been. +For example, `[^0-9]` matches any character other than 0 through 9. + +A matching group can be used to override the pattern sequence that multiplicity +specifiers apply to. +For example, the pattern `/foo+/` matches "foo" or "foooo", while the pattern +`/(foo)+/` matches "foo" or "foofoofoo", but not "foooo". + +A count qualifier in curly braces can be used to restrict the number of matches +of the preceding atom to an explicit minimum and maximum range. +For example, the pattern `\d{3}` matches exactly 3 digits 0-9. +Both a minimum and maximum multiplicity count can be specified and separated by +a comma. +For example, `/a{1,5}/` matches between 1 and 5 `a` characters. +Either the minimum or maximum count can be omitted to omit the corresponding +restriction in the number of matches allowed. + +An alternate match is created with the `|` character. +For example, the pattern `/foo|bar/` matches either the sequence "foo" or the +sequence "bar". + +##> Lexer modes + +Lexer modes can be used to change the set of patterns that are matched by the +lexer. +A common use for lexer modes is to match strings. + +Example: + +``` +<< +string mystringvalue; +>> + +tokenid str; + +# String processing +/"/ << + mystringvalue = ""; + $mode(string); +>> +string: /[^"]+/ << + mystringvalue += match; +>> +string: /"/ << + $mode(default); + return $token(str); +>> +``` + +A lexer mode is defined by placing the name before a colon (`:`) character that +precedes a token or pattern statement. +The token or pattern statement is restricted to only applying if the named mode +is active. + +By default, the active lexer mode is named `default`. +A `$mode()` call within a lexer code block can be used to change lexer modes. + +In the above example, when the lexer in the default mode sees a doublequote +(`"`) character, the lexer code block will clear the `mystringvalue` variable +and will set the lexer mode to `string`. +When the lexer begins looking for patterns to match against the input, it will +now look only for patterns tagged for the `string` lexer mode. +Any non-`"` character will be appended to the `mystringvalue` string. +A `"` character will end the `string` lexer mode and return to the `default` +lexer mode. +It also returns the `str` token now that the token is complete. + +Note that the token name `str` above could have been `string` instead - the +namespace for token names is distinct from the namespace for lexer modes. + ##> Specifying parser value types - the `ptype` statement The `ptype` statement is used to define parser value type(s). @@ -248,6 +466,18 @@ In this example: * a reduced `Values`'s parser value has a type of `Value[]`. * a reduced `KeyValue`'s parser value has a type of `Value[string]`. +##> Specifying the parser module name - the `module` statement + +The `module` statement can be used to specify the module name for a generated +D module. + +``` +module proj.parser; +``` + +If a module statement is not present, then the generated D module will not +contain a module statement and the default module name will be used. + #> License Propane is licensed under the terms of the MIT License: