LuZc

Parser for conjunctive context sensitive grammers

Terminal grammar

The name "(non)terminal" is best understood from the perspective of creating a string according to the formal grammar, not parsing it. Nonterminals like the start symbol can continue being transformed according to the rules but terminals stop at what they are. As such, terminals are meant to represent the smallest meaningful piece of the language, which might not be the alphabet or character set usable. For example; in the string "10+12" the single digit 0 isn't supposed to have any meaning by itself, rather the whole number that digit is in (10) is the smallest meaningful substring.

Despite not recognizing any smaller part as meaningful; terminals can have many subcomponents checked for existence before being accepted. In fact, everything needed for a context-free language can fit into a single LuZc terminal expression.

Whitespace is meaningful so unless care is taken to clean up the grammar, terminal expressions can become long and messy.

Character literals are entered as themselves like "a" or "b", unless they are special meaningful characters used signify other operations. Any whitespace character represents itself, contrary to what other programs might do.

Special Characters; the characters []!+*?,()\<>% each have a purpose and cannot be used as literals, they can only be accessed through escape sequences. A slash followed by that character is the easiest way. ie "\?", "\(" etc "\0" is a special one for the NULL character. Unicode can also be directly entered via "\x12", "\u1234", "\U12345678" with a set number of hexadecimal digits following the xuU part. Escape sequences can appear and behave the same way as a single character literal.

Concatenation; concatenation of two parts are expressed by simply concatenating their parts. ie "keyword" is 7 concatenated characters, "(ab)+(cd)?(ef)*g" has several parts that must appear one after another.

Character ranges are made inside brackets with two characters signifying the start and end respectively of the Unicode range meant. A comma followed by another pair means that either range can be used. ie "[AZ,az]" would mean any single English letter. Combining ranges is most useful for compliment ranges.

Compliments can only be applied to literals, ranges and unions of ranges. Follow with a ! suffix. So "[AZ,az]!" would mean anything besides an English letter.

Kleene star denotes repetition or exclusion and there are three suffix operators. The * operator is 0 or more, the + operator is 1 or more, the ? operator is 0 0r 1.

Blocks are the same idea as having a substructure in a nonterminal fashion, but are not accessible as subunits. Parenthesis denote parts that should stick together

Disjunction (or) is handled by an infix vertical bar | and will be satisfied by either choice. ie "(ab)|(cd)" Only the immediate piece on either side is used, so "ab|cd" would match abd or acd, not ab or cd

Context can even be considered in terminals, and is denoted by the infix operators < > which can never appear inside a block. Everything left of the < mark or right of the > mark will be checked for existence, but only the part between < and > will be considered the matching string. ie "a<b>c" against "ababababcbcbcbc" would only catch the 4th b.