# Languages for extensible virtual machines

Many languages have an underlying virtual machine (VM) to provide a more portable and convenient substrate for compilation or interpretation. For language research it's useful to be able to generate custom VMs and other language tools for different languages. Which raises the question: what's the appropriate language for writing experimental languages?

What I have in mind is slightly more than just VMs, and more a platform for experimenting with language design for novel environments such as sensor-driven systems. As well as a runtime, this requires the ability to parse, to represent and evaluate type and semantic rules, and to provide a general framework for computation that can then be exposed into a target language as constructs, types and so forth. What's the right language in which to do all this?

This isn't a simple question. It's well-accepted that the correct choice of language is vital to the success of a coding project.  One could work purely at the language level, exploring constructs and type systems without any real constraint of the real world (such as being runnable on a sensor mote). This has to some extent been traditional in programming language research, justified by the Moore's law increases in performance of the target machines. It isn't justifiable for sensor networks, though, where we won't see the same phenomenon. If we want to prototype realistic language tools in the same framework, we need at least a run-time VM that was appropriate for these target devices; alternatively we could ignore this, focus on the language, and prototype only when we're happy with the structures, using a different framework. My gut ffeeling is that the former is preferable, if it's possible, for reasons of conceptual clarity, impact and simplicity. But even without making this decision we can consider the features of different candidate language-writing languages:

### C

The most obvious approach is to use C, which is run-time-efficient and runs on any potential platform. For advanced language research, though, it's less attractive because of its poor symbolic data handling. That makes it harder to write type-checking sub-systems and the like, which are essentially symbolic mathematics.

### Forth

I've wondered about Forth before. At one level it combines the same drawbacks as C -- poor symbolic and dynamic data handling -- with the additional drawback of being unfamiliar to almost everyone.

Forth does have some redeeming features, though. Firstly, threaded interpretation means that additional layers of abstraction are largely cost-free: they run at the same speed as the language itself. Moreover there's a sense in which threaded interpretation blurs the distinction between host language and meta-language: you don't write Forth applications, you extend it towards the problem, so the meta-language becomes the VM and language tool. This is something that needs some further exploration.

### Scheme

Scheme's advantages are its simplicity, regularity, and pretty much unrivalled flexibility in handling symbolic data. There's a long  tradition of Scheme-based language tooling, and so a lot of experience and libraries to make use of. It's also easy to write purely functional code, which can aid re-use.

Scheme is dynamically typed, which can be great when exploring approaches like partial evaluation (specialising an interpreter against a particular piece of code to get a compiled program, for example).

In some ways, Haskell is the obvious language for a new language project. The strong typing, type classing and modules mean one can generate a typed meta-language. There are lots of libraries and plenty of activity in the research community. Moreover Haskell is in many ways the "mathematician's choice" of language, since one can often map mathematical concepts almost directly into code. Given thaat typing and semantics are just mathematical operations over symbols, this is a significant attraction.

Where Haskell falls over, of course, is its runtime overheads -- mostly these days in terms of memory rather than performance. It essentially mandates a choice of target platform to be fairly meaty, which closes-off some opportunities. There are some "staged" Haskell strategies that might work around this, and one could potentially stage the code to another runtime virtual machine. Or play games like implement a Forth VM inside Haskell for experimentation, and then emit code for a different Forth implementation for runtime.

### Java

Java remains the language du jour for most new projects. It has decent dynamic data handling, poor symbolic data handling, fairly large run-time overheads and a well-stocked library for re-use. (Actually I used Java for Vanilla, an earlier project in a similar area.) Despite the attractions, Java feels wrong. It doesn't provide a good solution to any of the constraints, and would be awkward as a platform for manipulating rules-based descriptions.

### Smalltalk

Smalltalk -- and especially Squeak -- isn't a popular choice within language research, but does have a portable virtual machine, VM generation, and other nice features and libraries. The structure is also attractive, being modern and object-oriented. It's also a good platform for building interactive systems, so one could do simulation, visual programming and the like within the same framework -- something that'd be much harder with other choices. There are also some obvious connectionns between Smalltalk and pervasive systems, where one is talking about the interactions of objects in the real world.

Where does that leave us? Nowhere, in a sense, other than with a list of characteristics of different candidate languages for language research. It's unfortunate there isn't a clear winner; alternatively, it's positive that there's a choice depending on the final direction. The worry has to be that a project like this is a moving target that moves away from the areas of strength for any choice made.

# Impressions of Pervasive 2010

I've spent this week at the Pervasive 2010 conference on pervasive computing, along with the Programming Methods for Mobile and Pervasive Systems workshop I co-arranged with Dominic Duggan. Both events have been fascinating.

The PMMPS workshop is something we've wanted to run for a while, bringing together the programming language and pervasive/mobile communities to see where languages  ought to go. We received a diverse set of submissions: keynotes from Roy Campbell and Aaron Quigley, talks covering topics including debugging, software processes, temporal aspects (me), context collectionvisual programming ang a lot more. Some threads emerge quite strongly, but I think they'll have to wait for a later post after I've collected my thoughts a bit more.

The main conference included many  papers so good that it seems a shame to single any out. The following are simply those that spoke most strongly to me:

Panorama and Cascadia. The University of Washington presented work on a "complex" events system, combining lower-level raw events. Simple sensor events are noisy and often limited in their coverage. Cascadiais an event service that allows complex events to be defined over the raw event stream, using Bayesian particle filters to interpolate missing events or those from uncovered areas: so it's possible in principle to inferentially  "sense" someone's location even in places without explicit sensor coverage, using a model of the space being observed. This is something that could be generalised to other model-based sensor streams. The Panorama tool allows end-users to specify complex events by manipulating thresholds, which seems little unsatisfactory: there's no principled way to determine the thresholds, and it still begs the question of how to program with the uncertain event stream. Still, I have to say this is the first complex event system I've seen that I actually believe could work.

Eyecatcher. How do you stop people hiding from a camera, or playing-up to it? Work from Ochanomizu University in Japan places a small display on top of the camera, which can be used to present images to catch the subject's attention and to suggest poses or actions. (Another version barks like a dog, to attract your pet's attention.)I have to say this research is very Japanese, a very unusual but perceptive view of the world and the problems appropriate for research.

Emotion modeling. Jennifer Healey from Intel described how to monitor and infer emotions from physiological data. The main problem is that there is no common language for describing emotions -- "anxious" is good for some and bad for others -- so getting ground truth is hard even given extensive logging.

Indoor location tracking for non-experts. More University of Washington work, this time looking at an indoor location system simple enough to be used by non-experts such as rehabilitation therapists. They used powerline positioning, injecting different frequencies into a home's power network and detecting the radiated signal using what are essentially AM radios. Interestingly one of the most important factors was the aesthetics of the sensors: people don't want ugly boxes in their home.

Transfer learning. Tim van Kasteren of the University of Amsterdam has generated one of the most useful smart-home data sets, used across the community (including by several of my students). He reported experiences with transfering machine-learned classifiers from one sensor network to another, by mapping the data into a new, synthetic feature space. He also used the known distribution of features from the first network to condition the learning algorithhm in the second, to improve convergence.

Common Sense. Work from UC Berkeley on a platform for participative sensing: CommonSense. The idea is to place environmental sensors onto commodity mobile devices, and give them to street cleaners and others "out and about" in a community. The great thing about this is that is gives information on pollution and the like to the communities themselves, directly, rather than mediated through a (possibly indifferent or otherwise) State agency.

Energy-aware data traffic management. I should add the disclaimer that is work by my colleague, Mirco Musolesi of the University of  St Andrews. Sensor nodes need to be careful about the energy they use to transmit data back to their base station. This work compares a range of strategies that trade-off the accuracy of returned data with the amount of traffic exchanged and so the impact on the nodoe's battery. This is //really// important for environmental sensing, and makes me think about further modifying the models to account for what's being sensed to trade-off information content as well.

Tutorials AJ Brush did a wonderful tutorial on how to do user surveys. This is something we've done ourselves, and it was great to see the issues nailed-down -- along with war stories of how to plan and conduct a survey for greatest validity and impact. Equally, John Krumm did a fantastic overview of signal processing, particle filters, hidden Markov models and the like that make the maths far more accessible than it normally is. Adrian Friday heroically took the graveyard slot with experiences and ideas about system support for pervasive systems.

This is the first large conference I've attended for a while, for various reasons, and it's been a great week both scientifically and socially. The organisers at the University of Helsinki deserve an enormous vote of thanks for their efforts. Pervasive next year will be in San Francisco, an I'll definitely be there -- hopefully with a paper to present :-)

# The only computer science book worth reading twice?

I was talking to one of my students earlier, and lent him a book to read over summer. It was only after he'd left that I realised  that -- for me at any rate -- the book I'd given him is probably the most seminal work in the whole of computer science, and certainly the book that's most influenced my career and research interests.

So what's the book? Structure and interpretation of computer programs by Hal Abelson and Jerry Sussman (MIT Press. 1984. ISBN 0-262-01077-1), also known as SICP. The book's still in print, but -- even better -- is available online in its entirety.

OK, everyone has their favourite book: why's this one so special to me? The first reason is the time I first encountered it: in Newcastle upon Tyne in the second year of my first degree. I was still finding my way in computer science, and this book was a recommended text after you'd finished the first programming courses. It's the book that introduced me to programming as it could be (rather than programming as it was, in Pascal at the time). What I mean by that is that SICP starts out by introducing the elements of programming -- values, names, binding, control and so on -- and then runs with them to explore a quite dazzling breadth of issues including:

• lambda-abstraction and higher-order computation
• complex data structures, including structures with embedded computational content
• modularity and mutability
• streams
• lazy evaluation
• interpreter and compiler construction
• storage management, garbage collection and virtual memory
• machine code
• domain-specific languages
...and so forth. The list of concepts is bewildering, and only stays coherent because the authors are skilled writers devoted to their craft. But it's also a remarkable achievement to handle all these concepts within a single language framework -- Scheme -- in such a way that each builds on what's gone before.

The second reason is the way in which Hal and Jerry view everything as an exercise in language design:

We have also obtained a glimpse of another crucial idea about languages and program design. This is the approach of stratified design, the notion that a complex system should be structured as a sequence of levels that are described using a sequence of languages. Each level is constructed by combining parts that are regarded as primitive at that level, and the parts constructed at each level are used as primitives at the next level. The language used at each level of a stratified design has primitives, means of combination, and means of abstraction appropriate to that level of detail.

Layered abstraction of course is second nature to all computer scientists. What's novel in this view is that each level should be programmable: that the layers are all about computation and transformation, and not simply about hiding information. We don't see that in the mainstream of programming languages, because layering doesn't extend the language at all: Java is Java from top to bottom, with class and libraries but no new control structures. If a particular domain has concepts that would benefit from dedicated language constructs, that's just tough. Conversely (and this is something that very much interests me) if there are constructs it'd be desirable not to have in some domain, they can't be removed. (Within the language, anyway: Java-ME dumps some capabilities in the interests of running on small devices, but that's not something you can do without re-writing the compiler.)

The third influential feature is the clear-sighted view of what computer science is actually about:

The computer revolution is a revolution in the way we think and in the way we express what we think. The essence of this change is the emergence of what might best be called procedural epistemology -- the study of the structure of knowledge from an imperative point of view, as opposed to the more declarative point of view taken by classical mathematical subjects. Mathematics provides a framework for dealing precisely with notions of "what is." Computation provides a framework for dealing precisely with notions of "how to."

I've taken a view before about computers being the new microscopes, opening-up new science on their own as well as facilitating existing approaches. The "how to" aspect of computer science re-appears everywhere in this: in describing the behaviours of sensor networks that can adapt while continuing the reflect the phenomena they've been deployed to sense; in the interpretation of large-scale data mined and mashed-up across the web; in capturing scientific methods and processes for automation; and so forth. The richness of these domains mitigates against packaged software and encourages integration through programming languages like R, so that the interfaces and structures remain "soft" and open to experimentation.

When I looked at my copy, the date I'd written on the inside was September 1988. So a book I bought nearly 22 years ago is still relevant. In fact, I'd go further and say that it's the only computer science book of that age that I'd happily and usefully read again without it being just for historical interest: the content has barely aged at all. That's not all that unusual for mathematics books, but it's almost unheard of in computer science, where the ideas move so quickly and where much of what's written about is ephemeral rather than foundational. It goes to show how well SICP nailed the core concepts. In this sense, it's certainly one of the very few  books on computer science that it's worth reading twice (or more). SICP is to computer science what Feynman's Lectures on Physics are to physics: an accessible distillation of the essence of the subject that's stood the test of time.

# Forth for sensors?

Most sensor systems are programmed using C: compact and well-known, but low-level and tricky to get right when things get compact and complex. There have been several proposals for alternative languages from across the programming language research spectrum. I haven't heard anyone mention Forth, though, and it's worth considering -- even if only as a target for other languages.

Many people will never have encountered Forth, a language that's enjoyed up-and-down popularity for over four decades without ever hitting the mainstream. Sometimes touted as an alternative to Basic, it even had an early-1980's home computer that used it as the introductory language.

Forth has a number of characteristics that are completely different to the vast majority of modern languages:

• It uses and explicit data stack and Reverse-Polish notation uniformly throughout the language
• There's no type system. Everything is represented pretty much using addresses and integers. The programmer is on her own when building complex structures
• It is a threaded interpreter where every construct in the language is a "word". Composing words together generates new words, but (unlike in an interpreter) the definitions are compiled efficiently, so there's an immediacy to things without crippling performance overheads
• A standard system usually mixes its "shell" and "compiler" together, so one can define new words interactively which get compiled immediately
• There's a small kernel of machine-code (or C) routines, but...
• The compiler itself -- and indeed the vast majority of the system -- can be written in Forth, so you can extend the compiler (and hence the language) with new constructs that have first-class status alongside the built-in words. There's typically almost no overhead of programmer- versus system-defined words, since they're all written in the same (fast) language
• If you're careful, you can build a cross-compiler that will generate code for a different target system: just port the kernel and the compiler should be re-usable to generate code that'll run on it. (It's not that simple, of course, as I'm finding myself...)
So Forth programs don't look like other languages. There's no real phase distinction between compilation and run-time, since everything's mixed-in together, but that has the advantage that you can write new "compiler" words to make it easier to write your "application" words, all within the same framework and set of capabilities. You don't write applications so much as extend the language itself towards your problem. That in turn means you can view Forth either as low-level -- a glorified assembler -- or very high-level in terms of its ability to define new syntax and semantics.

That probably sounds like a nightmare, but suspend judgment for a while.....

One of the problems that concerns a lot of sensor networks people is the programming level at which we have to deal with systems. Typically we're forced to write C on a per-node basis: program the individual nodes, and try to set it up so that the network behaves, as a whole, in an intended way. This is clearly possible in many cases, and clearly gets way more difficult as things get bigger and more complex, and especially when we want the network to adapt to the phenomena it's sensing, which often requires decisions to be made on a non-local basis.

Writing a new language is a big undertaking, but a substantially smaller undertaking with Forth. It's possible to conceive of new language structures (for example a construct that generates moving averages) and implement it quickly and simply. This might just be syntactic sugar, or might be something rather more ambitious in terms of control flow. Essentially you can extend the syntax and the semantics of Forth so that it "understands", at the same level as the rest of the compiler, the new construct you want to use.

Which is interesting enough, but what makes it more interesting for sensors is the structure of these compiler words. Typically they're what is known as IMMEDIATE words, which means they execute when encountered compiling a word. That may sound odd, but what it means is that the compiler word executes and generates code, rather than being compiled itself. And that means that, when used with a cross-compiler, the new language constructs don't add to the target system's footprint, because their action all happens at compile-time to generate code that's expressed in terms of lower-level words. In core Forth, constructs like IF and LOOP (conditional and counted loops respectively) do exactly this: they compile low-level jumps (the word (BRANCH) and (?BRANCH), which do non-conditional and conditional branches respectively) implementing the higher-level structured-programming abstraction.

A lot of modern languages use virtual machines as targets for their compilers, and a lot of those VMs are stack machines -- Forth, in other words. If we actually use Forth as the VM for a compiler, we have an extensible VM in which we can define new constructs, so we can evolve the VM better to fit the language that targets it. (There are also some very interesting, close parallels between Forth code and the abstract syntax trees used to represent code within compilers, but that's something I need to think about a bit more before I write about it.)

All this is rather speculative, and doesn't really address the core problem of programming a network rather than a device, but it does provide a device-level platform that might be more amenable to language research. I've been experimenting with Forth for this purpose, and have a prototype system -- Attila, an abstract, re-targetable threaded interpreter that's fully cross-compilable -- in the works, but not quite ready to see the light of day. It's taught me a lot about the practicalities of threaded interpreters and cross-compilers. This is a strand of language design that's been almost forgotten, and I think it deserves more of a look.

# Programming with limited certainty

Sensor networks are all about uncertainty: if the sensor says it's 20°C  out there, then it might be 20°C plus-or-minus half a degree or so (limited precision); or it might be some different temperature, and the sensor's just reported a duff value for some reason (limited accuracy). By contrast, computers most definitely aren't about uncertainty, a fact enshrined in the famous maxim "garbage in, garbage out". What does this mean for our ability to build really large, robust and flexible sensor networks?

All the really foundational models of computing -- λ calculus, Turing machines and the like -- pretty much reflect this notion that input is correct in some sense, and if it's wrong then that's an error to be corrected outside the computational system. That seems to mean that the computational system can't itself either tolerate or express the notions of limited certainty -- precision and accuracy -- that lie at the core of sensor networks (and a lot of other modern systems, or course). That suggests to me that there might be a problem at the heart of computer science as we currently formulate it: it isn't a realistic model of the world we're trying to compute over.

In some ways this is nether surprising nor threatening. Any mathematical or computational model is only a simplified abstraction of the real world, for which we have to make often staggeringly bold simplifications if we're to get anywhere. We should however always be prepared to challenge the validity and necessity of these simplifications, and that's what I'd like to do here.

As far as validity is concerned, the simplification is quite patently invalid when it comes to any system that operates with real-world data: some of it is bound to be "wrong" in some sense. This isn't the same as being tolerant of mistakes, such as when someone presses the delete key by mistake: that's a action that certainly happened and to which the system responded correctly, albeit "wrongly" from the user's perspective. Interesting problem, but different: we're talking here about responding to inherently erroneous input -- the delete key seeming to press itself, if you like.

Necessity, then: is it necessary to handle computation in this way? Clearly not: we can easily conjecture a computational model that's more tolerant of input with limited certainty.

Consider precision first. If the input is only known to a limited precision, then we don't want that error margin to cause enormous errors. If we have a function $latex f$, then we want $latex f$ to exhibit a tolerance of imprecision such that $latex \delta x < tol_x \Rightarrow \left | f(x + \delta x) - f(x) \right | < s \left | \delta x \right|$ for some scaling factor $latex s < 1$. $latex f$ doesn't cause errors to blow-up in unpredictable ways. A lot of functions behave in exactly this way: for example, in a sliding-window average function $latex f_w(\overline{x}, x) = \frac{x + \overline{x}(w - 1)}{w}$ for an average $latex \overline{x}$ computed from $latex w$ recent observations, we have that $latex s = \frac{1}{w}$. Small errors therefore perturb the result significantly less than the input is perturbed. If the errors are uniformly distributed, the function should converge on the "true" value.

Conversely, a large, accurate new observation will perturb the average only slowly, so large step-changes will be detected only slowly. It's hard to distinguish such a change when it first happens from an inaccurate reading. There are various ways of dealing with this, such as using a weighted sliding window with non-linear weighting.

This is a rather topological idea. Rather than simply mapping points in an input space (such as temperature) to an output space (average temperature over the past few minutes), we're also requiring that the mapping take elements close in the input space to elements close in the result space: we require that it be a contraction mapping. Building systems from contraction mappings, perhaps combined with contraction-preserving operators, yields systems that are robust to small errors in precision from their sensors.

Of course not all systems are actually like this, and in many cases we want rapid changes to be detected quickly and/or propagated. The point, perhaps, is that this is a choice we should make rather than a consequence of choosing a particular model of computation. There might actually be a model of computation lurking about here, in which we define functions coupled with a model of how their input and output errors should behave. At the very least, this yields systems in which we can predict the consequences of errors and imprecisions, which is a major challenge to deal with at present.