Parser-centric grammar complexity

We usually think about formal language grammars in terms of the complexity of the language, but it’s also possible — and potentially more useful, for a compiler-writer or other user — to think about them from the perspective of the parser, which is slightly different. I’m teaching grammars and parsing to a second-year class at the moment, which involves getting to grips with the ideas of formal languages, their representation, using them to recognise (“parse”) texts that conform to their rules. Simultaneously I’m also looking at programming language grammars and how they can be simplified. I’ve realised that these two activities don’t quite line up the way I expected. Formal languages are important for computer scientists, as they let us describe the texts allowed for valid programs, data files and the like. The understanding of formal languages derives from linguistics, chiefly the work of Noam Chomsky in recognising the hierarchy of languages induced by very small changes in the capabilities of the underlying grammar: regular languages, context-free and context-sensitive, and so forth. Regular and context-free languages have proven to be the most useful in practice, with everything else being a bit too complicated for most uses. The typical use of grammars in (for example) compilers is that we define a syntax for the language we’re trying to recognise (for example Java) using a (typically context-free) grammar. The grammar states precisely what constitutes a syntactically correct Java program and, conversely, rejects syntactically incorrect programs. This parsing process is typically purely syntactic, and says nothing about whether the program makes sense semantically. The result of parsing is usually a parse tree, a more computer-friendly form of the program that’s then used to generate code after some further analysis and optimisation. In this process we start from the grammar, which is generally quite complex: Java is a complicated system to learn. It’s important to realise that many computer languages aren’t like this at all: they’re a lot simpler, and consequently can have parsers that are extremely simple, compact, and don’t need much in the way of a formal grammar: you can build them from scratch, which would be pretty much impossible for a Java parser. This leads to a different view of the complexity of languages, based on the difficulty of writing a parser for them — which doesn’t quite track the Chomsky hierarchy. Instead you end up with something like this: Isolated. In languages like Forth, programs are composed of symbols separated by whitespace and the language says nothing about the ways in which they can be put together. Put another way, each symbol is recognised and responded to on its own rather than as part of a larger construction. In Java terms, that’s like giving meaning to the { symbol independent of whether it’s part of a class definition or a for loop. It’s easy to see how simple this is for the compiler-writer: we simply look-up each symbol extracted from the input and execute or compile it, without reference to anything else. Clearly this only works to some extent: you only expect to see an else branch somewhere after a corresponding if statement, for example. In Forth this isn’t a matter of grammar, but rather of compilation: there’s an extra-grammatical mechanism for testing the sanity of the control constructs as part of their execution. While this may sound strange (and is, really) it means that it’s easy to extend the syntax of the language — because it doesn’t really have one. Adding new control constructs means adding symbols to be parsed and defining the appropriate control structure to sanity-check their usage. The point is that we can simplify the parser well below the level that might be considered normal and still get a usable system. Incremental. The next step up is to allow the execution of a symbol to take control of the input stream itself, and define what it consumes itself. A good example of this is from Forth again, for parsing strings. The Forth word introducing a literal string is S", as in S" hello world". What happens with the hello world" part of the text isn’t defined by a grammar, though: it’s defined by S", which takes control of the parsing process and consumes the text it wants before returning control to the parser. (In this case it consumes all characters until the closing quotes.) Again this means we can define new language constructs, simply by having words that do their own parsing. Interestingly enough, these parser words could themselves use a grammar that’s different from that of the surrounding language — possibly using a standard parser generator. The main parser takes a back seat while the parser word does its job, allowing for arbitrary extension of the syntax of the language. The disadvantage of course is that there’s no central definition of what constitutes “a program” in the language — but that’s an advantage too, certainly for experimentation and extension. It’s the ideas of dynamic languages extended to syntax, in a way. Structured. Part of the subtlety of defining grammars is avoid ambiguity, making sure that every program can be parsed in a well-defined and unique way. The simplest way to avoid ambiguity is to make everything structured and standard. Lisp and Scheme are the best illustrations of this. Every expression in the language takes the same form: an atom, or a list whose elements may be atoms or other lists. Lists are enclosed in brackets, so it’s always possible to find the scope of any particular construction. It’s extremely easy to write a parser for such a language, and the “grammar” fits onto about three lines of description. Interestingly, this sort of language is also highly extensible, as all constructs look the same. Adding a new control construct is straightforward as long as it follows the model, and again can be done extra-grammatically without defining new elements to the parser. One is more constrained than with the isolated or incremental models, but this constraint means that there’s also more scope for controlled expansion. Scheme, for example, provides a macro facility that allows new syntactic forms, within the scope and style of the overall parser, that nevertheless behave differently to “normal” constructs: they can capture and manipulate fragments of program text and re-combine them into new structures using quasi-quoting and other mechanisms. One can provide these facilities quite uniformly without difficulty, and without explicit syntax. (This is even true for quasi-quoting, although there’s usually some syntactic sugar to make it more usable.) The results will always “look like Lisp”, even though they might behave rather differently: again, we limit the scope of what is dealt with grammatically and what is dealt with programmatically, and get results that are somewhat different to those we would get with the standard route to compiler construction. This isn’t to say that we should give up on grammars and go back to more primitive definitions of languages, but just to observe that grammars evolved to suit a particular purpose that needs to be continually checked for relevance. Most programmers find Java (for example) easier to read than Scheme, despite the latter’s more straightforward and regular syntax: simplicity is in the eye of the beholder, not the parser. But a formal grammar may not be the best solution in situations where we want variable, loose and extensible syntax for whatever reason, so it’s as well to be aware of the alternative that make the problem one of programming rather than of parsing.