Simon Dobson (Posts about forth)

Parser-centric grammar complexity

Simon Dobson — Tue, 09 Oct 2012 09:00:13 GMT

We usually think about formal language grammars in terms of the complexity of the language, but it's also possible -- and potentially more useful, for a compiler-writer or other user -- to think about them from the perspective of the parser, which is slightly different. I'm teaching grammars and parsing to a second-year class at the moment, which involves getting to grips with the ideas of formal languages, their representation, using them to recognise ("parse") texts that conform to their rules. Simultaneously I'm also looking at programming language grammars and how they can be simplified. I've realised that these two activities don't quite line up the way I expected. Formal languages are important for computer scientists, as they let us describe the texts allowed for valid programs, data files and the like. The understanding of formal languages derives from linguistics, chiefly the work of Noam Chomsky in recognising the hierarchy of languages induced by very small changes in the capabilities of the underlying grammar: regular languages, context-free and context-sensitive, and so forth. Regular and context-free languages have proven to be the most useful in practice, with everything else being a bit too complicated for most uses. The typical use of grammars in (for example) compilers is that we define a syntax for the language we're trying to recognise (for example Java) using a (typically context-free) grammar. The grammar states precisely what constitutes a syntactically correct Java program and, conversely, rejects syntactically incorrect programs. This parsing process is typically purely syntactic, and says nothing about whether the program makes sense semantically. The result of parsing is usually a parse tree, a more computer-friendly form of the program that's then used to generate code after some further analysis and optimisation. In this process we start from the grammar, which is generally quite complex: Java is a complicated system to learn. It's important to realise that many computer languages aren't like this at all: they're a lot simpler, and consequently can have parsers that are extremely simple, compact, and don't need much in the way of a formal grammar: you can build them from scratch, which would be pretty much impossible for a Java parser. This leads to a different view of the complexity of languages, based on the difficulty of writing a parser for them -- which doesn't quite track the Chomsky hierarchy. Instead you end up with something like this: Isolated. In languages like Forth, programs are composed of symbols separated by whitespace and the language says nothing about the ways in which they can be put together. Put another way, each symbol is recognised and responded to on its own rather than as part of a larger construction. In Java terms, that's like giving meaning to the { symbol independent of whether it's part of a class definition or a for loop. It's easy to see how simple this is for the compiler-writer: we simply look-up each symbol extracted from the input and execute or compile it, without reference to anything else. Clearly this only works to some extent: you only expect to see an else branch somewhere after a corresponding if statement, for example. In Forth this isn't a matter of grammar, but rather of compilation: there's an extra-grammatical mechanism for testing the sanity of the control constructs as part of their execution. While this may sound strange (and is, really) it means that it's easy to extend the syntax of the language -- because it doesn't really have one. Adding new control constructs means adding symbols to be parsed and defining the appropriate control structure to sanity-check their usage. The point is that we can simplify the parser well below the level that might be considered normal and still get a usable system. Incremental. The next step up is to allow the execution of a symbol to take control of the input stream itself, and define what it consumes itself. A good example of this is from Forth again, for parsing strings. The Forth word introducing a literal string is S", as in S" hello world". What happens with the hello world" part of the text isn't defined by a grammar, though: it's defined by S", which takes control of the parsing process and consumes the text it wants before returning control to the parser. (In this case it consumes all characters until the closing quotes.) Again this means we can define new language constructs, simply by having words that do their own parsing. Interestingly enough, these parser words could themselves use a grammar that's different from that of the surrounding language -- possibly using a standard parser generator. The main parser takes a back seat while the parser word does its job, allowing for arbitrary extension of the syntax of the language. The disadvantage of course is that there's no central definition of what constitutes "a program" in the language -- but that's an advantage too, certainly for experimentation and extension. It's the ideas of dynamic languages extended to syntax, in a way. Structured. Part of the subtlety of defining grammars is avoid ambiguity, making sure that every program can be parsed in a well-defined and unique way. The simplest way to avoid ambiguity is to make everything structured and standard. Lisp and Scheme are the best illustrations of this. Every expression in the language takes the same form: an atom, or a list whose elements may be atoms or other lists. Lists are enclosed in brackets, so it's always possible to find the scope of any particular construction. It's extremely easy to write a parser for such a language, and the "grammar" fits onto about three lines of description. Interestingly, this sort of language is also highly extensible, as all constructs look the same. Adding a new control construct is straightforward as long as it follows the model, and again can be done extra-grammatically without defining new elements to the parser. One is more constrained than with the isolated or incremental models, but this constraint means that there's also more scope for controlled expansion. Scheme, for example, provides a macro facility that allows new syntactic forms, within the scope and style of the overall parser, that nevertheless behave differently to "normal" constructs: they can capture and manipulate fragments of program text and re-combine them into new structures using quasi-quoting and other mechanisms. One can provide these facilities quite uniformly without difficulty, and without explicit syntax. (This is even true for quasi-quoting, although there's usually some syntactic sugar to make it more usable.) The results will always "look like Lisp", even though they might behave rather differently: again, we limit the scope of what is dealt with grammatically and what is dealt with programmatically, and get results that are somewhat different to those we would get with the standard route to compiler construction. This isn't to say that we should give up on grammars and go back to more primitive definitions of languages, but just to observe that grammars evolved to suit a particular purpose that needs to be continually checked for relevance. Most programmers find Java (for example) easier to read than Scheme, despite the latter's more straightforward and regular syntax: simplicity is in the eye of the beholder, not the parser. But a formal grammar may not be the best solution in situations where we want variable, loose and extensible syntax for whatever reason, so it's as well to be aware of the alternative that make the problem one of programming rather than of parsing.

Forth, 2^5 years ago

Simon Dobson — Wed, 08 Aug 2012 16:39:51 GMT

2^5 years ago this month, Byte magazine devoted an issue to the Forth language. Byte ("the small systems journal") volume 5 number 8, August 1980, was largely devoted to Forth. I only discovered this by accident, researching some background for a paper I'm writing on extensible virtual machines. What's even more remarkable is that you can download the issue -- along with a lot of others -- as a PDF. That the premier hobbyist/hacker magazine of its day would give over most of an entire issue to one language tells you something about the way people thought about programming their machines back then. There was a premium on compactness: one of the first adverts in this issue of Byte is for a Cromenco Z-2H with 64Kb of RAM and 11Mb of hard disc, and proudly claiming that it "is under $10K". One article is a history of Forth, one is a tutorial, and two are deeply technical programming pieces aimed at people comfortable with the idea of writing their own software pretty much from scratch -- and indeed, keen to get on with it. What's more, they could write software as good or better as that which they could buy (to the extent that there was any hobbyist software to buy). That's not something we've been able to say for at least the last 2^4 years: hobbyist software hasn't competed with commercial offerings in most domains for a long time. I think there were a number of things going on. The simplicity of the machines was obviously a bonus: one could understand the hardware and software of a personal computer in its entirety, and contemplate re-writing it from the ground up as an individual or small group. Expectations were lower, but that works both ways: low expectations coupled with low-performance hardware can still lead to some impressive software. But it's certainly the case that one of the main barriers to software development from-the-ground-up these days is the need to interface with so many devices and processes in order to do anything of interest: any new system would need to talk to flash drives and the web, which probably means writing device drivers and a filing system. You can get round this using hardware packages, of course: Zigbee radios have simple programmer interfaces and encapsulate the software stack inside them. Another factor, though, was a difference in ambition. A hobbyist in the 1980's only had herself and her friends to impress (and be impressed by): the horizon was closer. I'm sure that led to a lot of re-definition and duplication that the internet would allow one to avoid somewhat, but in some senses it provided a better learning environment in which a sequence of problems needed solution from a programmer's own creativity and resources. That's a significantly different skill set than what's required today, where we place a value on compliance, compatibility and re-use at least as high as we place on creativity and innovation. I'm not advocating a return to the past -- although programming in Forth for sensor networks does give me a tremendous sense of pleasure that I haven't felt in programming for a long time, at least partially derived from the old-school nature of it all. However, I would say that there's also value in this old-school style even today. The hackers who read Byte wouldn't settle for sub-standard tools: they wouldn't think twice about re-coding their compilers and operating systems (as well as their applications) if they were sub-standard. That power brings on a sense of power -- the ability to change what's perceived to be wrong in something-- that's to be celebrated and encouraged even today, amongst programmers who sometimes seem to be constrained by their toolchains rather than freed by them.

Evolving programming languages

Simon Dobson — Fri, 20 May 2011 07:00:11 GMT

Most programming languages have fixed definitions and hard boundaries. In thinking about building software for domains we don't understand very well, a case can be made for a more relaxed, evolutionary approach to language design. I've been thinking a lot about languages this week, for various reasons: mainly about the recurring theme of what are the right programming structures for systems driven by sensors, whether they're pervasive systems or sensor networks. In either case, the structures we've evolved for dealing with desktop and server systems don't feel like they're the right abstractions to effectively take things forward. A favourite example is the if statement: first decide whether a condition is true or false, and execute one piece of code or another depending on which it is. In a sensor-driven system we often can't make this determination cleanly because of noise and uncertainty -- and if we can, it's often only probably true, and only for a particular period. So are if statements (and while loops and the like) actually appropriate constructs, when we can't make the decisions definitively? Whatever you think of this example (and plenty of people hate it) there are certainly differences between what we want to do between traditional and highly sensorised systems, and consequently how we program them. The question is, how do we work out what the right structures are? Actually, the question is broader than this. It should be: how do we improve our ability to develop languages that match the needs of particular computational and conceptual domains? Domain-specific languages (DSLs) have a tangled history in computer science, pitched between those who like the idea and those who prefer their programming languages general-purpose and re-usable across a range of domains. There are strong arguments on both sides: general-purpose languages are more productive to learn and are often more mature, but can be less efficient and more cumbersome to apply; DSLs mean learning another language that may not last long and will probably have far weaker support, but can be enormously more productive and well-targeted in use. In some ways, though, the similarities between traditional languages and DSLs are very strong. As a general rule both will have syntax and semantics defined up-front: they won't be experimental in the sense of allowing experimentation within the language itself. If we don't know what we're building, does it make sense to be this definite? There are alternatives. One that I'm quite keen on is the idea of extensible virtual machines, where the primitives of a language are left "soft" to be extended as required. This style has several advantages. Firstly, it encourages experimentation by not forcing a strong division of concepts between the language we write (the target language) and the language this is implemented in (the host language): the two co-evolve. Secondly, it allows extensions to be as efficient as "base" elements, assuming we can reduce the cost of building new elements appropriately low. Thirdly, it allows multiple paradigms and approaches to co-exist within the same system, since they can share some elements while having other that differ. Another related feature is the ability to modify the compiler: that is, don't fix the syntax or the way in which its handled. So as well as making the low level soft, we also make the high level soft. The advantage here is two-fold. Firstly, we can modify the forms of expression we allow to capture concepts precisely. A good example would be the ability to add concurrency control to a language: the low-level might use semaphores, but programing might demand monitors or some other construct. Modifying the high-level form of the language allows these constructs to be added if required -- and ignored if not. This actually leads to the second advantage, that we can avoid features we don't want to be available, for example not providing general recursion for languages that need to complete all operations in a finite time. This is something that's surprisingly uncommon in language design despite being common in teaching programming: leaving stuff out can have a major simplifying effect. Some people argue that syntax modification is unnecessary in a language that's sufficiently expressive, for example Haskell. I don't agree. The counter-example is actually in Haskell itself, in the form of the do block syntactic sugar for simplifying monadic computations. This had to be in the language to make it in any way usable, which implied a change of definition, and the monad designers couldn't add it without the involvement of the language "owners", even though the construct is really just a re-formulation and generalisation of one common in other languages. There are certainly other areas in which such sugaring would be desirable to make the forms of expression simpler and more intuitive. The less well we understand a domain, the more likely this is to happen. Perhaps surprisingly, there are a couple of existing examples of systems that do pretty much what I'm suggesting. Forth is a canonical example (which explains my current work on Attila); Smalltalk is another, where the parser an run-time are almost completely exposed, although abstracted behind several layers of higher-level structure. Both the languages are quite old, have devoted followings, and weakly and dynamically typed -- and may have a lot to teach us about how to develop languages for new domains. They share a design philosophy of allowing a language to evolve to meet new applications. In Forth, you don't so much write applications as extend the language to meet the problem; in Smalltalk you develop a model of co-operating objects that provide direct-manipulation interaction through the GUI. In both cases the whole language, including the definition and control structures, is built in the language itself via bootstrapping and cross-compilation. Both languages are compiled, but in both cases the separation between run-time and compile-time are weak, in the sense that the compiler is by default available interactively. Interestingly this doesn't stop you building "normal" compiled applications: cross-compile a system without including the compiler itself, a process that can still take advantage of any extensions added into the compiler without cluttering-up the compiled code. You're unlikely to get strictly the best performance or memory footprint as you might with a mature C compiler, but you do get advantages in terms of expressiveness and experimentation which seem to outweigh these in a domain we don't understand well. In particular, it means you can evolve the language quickly, easily, and within itself, to explore the design space more effectively and find out whether your wackier ideas are actually worth pursuing further.

Languages for extensible virtual machines

Simon Dobson — Fri, 28 May 2010 05:00:30 GMT

Many languages have an underlying virtual machine (VM) to provide a more portable and convenient substrate for compilation or interpretation. For language research it's useful to be able to generate custom VMs and other language tools for different languages. Which raises the question: what's the appropriate language for writing experimental languages? What I have in mind is slightly more than just VMs, and more a platform for experimenting with language design for novel environments such as sensor-driven systems. As well as a runtime, this requires the ability to parse, to represent and evaluate type and semantic rules, and to provide a general framework for computation that can then be exposed into a target language as constructs, types and so forth. What's the right language in which to do all this? This isn't a simple question. It's well-accepted that the correct choice of language is vital to the success of a coding project. One could work purely at the language level, exploring constructs and type systems without any real constraint of the real world (such as being runnable on a sensor mote). This has to some extent been traditional in programming language research, justified by the Moore's law increases in performance of the target machines. It isn't justifiable for sensor networks, though, where we won't see the same phenomenon. If we want to prototype realistic language tools in the same framework, we need at least a run-time VM that was appropriate for these target devices; alternatively we could ignore this, focus on the language, and prototype only when we're happy with the structures, using a different framework. My gut ffeeling is that the former is preferable, if it's possible, for reasons of conceptual clarity, impact and simplicity. But even without making this decision we can consider the features of different candidate language-writing languages:

C

The most obvious approach is to use C, which is run-time-efficient and runs on any potential platform. For advanced language research, though, it's less attractive because of its poor symbolic data handling. That makes it harder to write type-checking sub-systems and the like, which are essentially symbolic mathematics.

Forth

I've wondered about Forth before. At one level it combines the same drawbacks as C -- poor symbolic and dynamic data handling -- with the additional drawback of being unfamiliar to almost everyone. Forth does have some redeeming features, though. Firstly, threaded interpretation means that additional layers of abstraction are largely cost-free: they run at the same speed as the language itself. Moreover there's a sense in which threaded interpretation blurs the distinction between host language and meta-language: you don't write Forth applications, you extend it towards the problem, so the meta-language becomes the VM and language tool. This is something that needs some further exploration.

Scheme

Scheme's advantages are its simplicity, regularity, and pretty much unrivalled flexibility in handling symbolic data. There's a long tradition of Scheme-based language tooling, and so a lot of experience and libraries to make use of. It's also easy to write purely functional code, which can aid re-use. Scheme is dynamically typed, which can be great when exploring approaches like partial evaluation (specialising an interpreter against a particular piece of code to get a compiled program, for example).

Haskell

In some ways, Haskell is the obvious language for a new language project. The strong typing, type classing and modules mean one can generate a typed meta-language. There are lots of libraries and plenty of activity in the research community. Moreover Haskell is in many ways the "mathematician's choice" of language, since one can often map mathematical concepts almost directly into code. Given thaat typing and semantics are just mathematical operations over symbols, this is a significant attraction. Where Haskell falls over, of course, is its runtime overheads -- mostly these days in terms of memory rather than performance. It essentially mandates a choice of target platform to be fairly meaty, which closes-off some opportunities. There are some "staged" Haskell strategies that might work around this, and one could potentially stage the code to another runtime virtual machine. Or play games like implement a Forth VM inside Haskell for experimentation, and then emit code for a different Forth implementation for runtime.

Java

Java remains the language du jour for most new projects. It has decent dynamic data handling, poor symbolic data handling, fairly large run-time overheads and a well-stocked library for re-use. (Actually I used Java for Vanilla, an earlier project in a similar area.) Despite the attractions, Java feels wrong. It doesn't provide a good solution to any of the constraints, and would be awkward as a platform for manipulating rules-based descriptions.

Smalltalk

Smalltalk -- and especially Squeak -- isn't a popular choice within language research, but does have a portable virtual machine, VM generation, and other nice features and libraries. The structure is also attractive, being modern and object-oriented. It's also a good platform for building interactive systems, so one could do simulation, visual programming and the like within the same framework -- something that'd be much harder with other choices. There are also some obvious connectionns between Smalltalk and pervasive systems, where one is talking about the interactions of objects in the real world. Where does that leave us? Nowhere, in a sense, other than with a list of characteristics of different candidate languages for language research. It's unfortunate there isn't a clear winner; alternatively, it's positive that there's a choice depending on the final direction. The worry has to be that a project like this is a moving target that moves away from the areas of strength for any choice made.

Forth for sensors?

Simon Dobson — Fri, 12 Mar 2010 06:00:10 GMT

Most sensor systems are programmed using C: compact and well-known, but low-level and tricky to get right when things get compact and complex. There have been several proposals for alternative languages from across the programming language research spectrum. I haven't heard anyone mention Forth, though, and it's worth considering -- even if only as a target for other languages. Many people will never have encountered Forth, a language that's enjoyed up-and-down popularity for over four decades without ever hitting the mainstream. Sometimes touted as an alternative to Basic, it even had an early-1980's home computer that used it as the introductory language. Forth has a number of characteristics that are completely different to the vast majority of modern languages:

It uses and explicit data stack and Reverse-Polish notation uniformly throughout the language
There's no type system. Everything is represented pretty much using addresses and integers. The programmer is on her own when building complex structures
It is a threaded interpreter where every construct in the language is a "word". Composing words together generates new words, but (unlike in an interpreter) the definitions are compiled efficiently, so there's an immediacy to things without crippling performance overheads
A standard system usually mixes its "shell" and "compiler" together, so one can define new words interactively which get compiled immediately
There's a small kernel of machine-code (or C) routines, but...
The compiler itself -- and indeed the vast majority of the system -- can be written in Forth, so you can extend the compiler (and hence the language) with new constructs that have first-class status alongside the built-in words. There's typically almost no overhead of programmer- versus system-defined words, since they're all written in the same (fast) language
If you're careful, you can build a cross-compiler that will generate code for a different target system: just port the kernel and the compiler should be re-usable to generate code that'll run on it. (It's not that simple, of course, as I'm finding myself...)

So Forth programs don't look like other languages. There's no real phase distinction between compilation and run-time, since everything's mixed-in together, but that has the advantage that you can write new "compiler" words to make it easier to write your "application" words, all within the same framework and set of capabilities. You don't write applications so much as extend the language itself towards your problem. That in turn means you can view Forth either as low-level -- a glorified assembler -- or very high-level in terms of its ability to define new syntax and semantics. That probably sounds like a nightmare, but suspend judgment for a while..... One of the problems that concerns a lot of sensor networks people is the programming level at which we have to deal with systems. Typically we're forced to write C on a per-node basis: program the individual nodes, and try to set it up so that the network behaves, as a whole, in an intended way. This is clearly possible in many cases, and clearly gets way more difficult as things get bigger and more complex, and especially when we want the network to adapt to the phenomena it's sensing, which often requires decisions to be made on a non-local basis. Writing a new language is a big undertaking, but a substantially smaller undertaking with Forth. It's possible to conceive of new language structures (for example a construct that generates moving averages) and implement it quickly and simply. This might just be syntactic sugar, or might be something rather more ambitious in terms of control flow. Essentially you can extend the syntax and the semantics of Forth so that it "understands", at the same level as the rest of the compiler, the new construct you want to use. Which is interesting enough, but what makes it more interesting for sensors is the structure of these compiler words. Typically they're what is known as IMMEDIATE words, which means they execute when encountered compiling a word. That may sound odd, but what it means is that the compiler word executes and generates code, rather than being compiled itself. And that means that, when used with a cross-compiler, the new language constructs don't add to the target system's footprint, because their action all happens at compile-time to generate code that's expressed in terms of lower-level words. In core Forth, constructs like IF and LOOP (conditional and counted loops respectively) do exactly this: they compile low-level jumps (the word (BRANCH) and (?BRANCH), which do non-conditional and conditional branches respectively) implementing the higher-level structured-programming abstraction. A lot of modern languages use virtual machines as targets for their compilers, and a lot of those VMs are stack machines -- Forth, in other words. If we actually use Forth as the VM for a compiler, we have an extensible VM in which we can define new constructs, so we can evolve the VM better to fit the language that targets it. (There are also some very interesting, close parallels between Forth code and the abstract syntax trees used to represent code within compilers, but that's something I need to think about a bit more before I write about it.) All this is rather speculative, and doesn't really address the core problem of programming a network rather than a device, but it does provide a device-level platform that might be more amenable to language research. I've been experimenting with Forth for this purpose, and have a prototype system -- Attila, an abstract, re-targetable threaded interpreter that's fully cross-compilable -- in the works, but not quite ready to see the light of day. It's taught me a lot about the practicalities of threaded interpreters and cross-compilers. This is a strand of language design that's been almost forgotten, and I think it deserves more of a look.