Five big questions

If you try to do everything, you always end up doing nothing. Which is why Gray’s laws suggest searching for the twenty “big questions” in a field and then focusing-in the first five as the ones that’ll generate the biggest return on the effort invested. So what are the five biggest open issues in programming for sensorised systems? Of course we should start with a big fat disclaimer: these are my five biggest open issues, which probably don’t relate well to anyone else’s — but that’s what blogs are for, right? :-) So here goes: five questions, with an associated suggestion for a research programme. 1. Programming with uncertainty. This is definitely the one I feel is most important. I’ve mentioned before that there’s a mismatch between traditional computer science and what we have to deal with for sensor systems: the input is uncertain and often of very poor quality, but the output  behaviour has to be a “best attempt” based on what’s available and has to be robust against small perturbations due to noise and the like. But uncertainty is something that computers (and computer scientists) are quite bad at handling, so there’s a major change that has to happen. To deal with this we need to re-think the programming models we use, and the ways in which we express behaviour. For example we could look at how programs respond to perturbation, or design languages in which perturbations have a small impact by design. A calculus of stable functions might be a good starting-point, where perturbation tends to die-out over time and space, but long-term changes are propagated. We might also look at how to program more effectively with Bayesian statistics, or how to program with machine leaning: turn things that are currently either libraries or applications into core constructs from which to build programs. 2. Modeling adaptive systems as a whole. We’ve had a huge problem getting systems to behave according to specification: now we propose that they adapt in response to changing circumstances. Clearly the space of possible stimuli and responses are too large for exhaustive testing, or for formal model-checking, so correctness becomes a major issue. What we’re really interested in, of course, isn’t so much specifying what happens as much as how what happens changes over time and with context. Holistic models are common in physics but uncommon in computer science, where more discrete approaches (like model checking) have been more popular. It’s easy to see why this is the case, but a small-scale, pointwise formal method doesn’t feel appropriate to the scale of the problem. Reasoning about a system as a whole means re-thinking how we express both specifications and programs. But the difference is target is important too: we don’t need to capture all the detail of a program’s behaviour, just those aspects that relate to properties like stability, response time, accuracy and the like — a macro method for reasoning about macro properties, not something that gets lost in the details. Dynamical systems might be a good model, at least at a conceptual level, with adaptation being seen as a “trajectory” through the “space” of acceptable parameter values. At the very least this makes adaptation an object of study in its own right, rather than being something that happens within another, less well-targeted model. 3. Modeling complex space- and time-dependent behaviours. Reasoning systems and classifiers generally only deal with instants: things that are decided by the state of the system now, or as what immediately follows from now. In many cases what happens is far richer than this, and one can make predictions (or at least calculate probabilities) about the future based on classifying a person or entity as being engaged in a particular process. In pervasive computing this manifests itself as the ways in which people move around a space, the services they access preferentially in some locations rather than others, and so forth. These behaviours are closely tied-up with the way people move and the way their days progress, as it were: complex spatio-temporal processes giving rise to complex behaviours. The complexities come from how we divide-up people’s actions, and how the possibilities branch to give a huge combinatorial range of possibilities — not all of which are equally likely, and so can be leveraged. A first step at addressing this would be to look at how we represent real-world spatio-temporal processes with computers. Of course we represent such processes all the time as programs, but (linking back to point 1 above) the uncertainties involved are such that we need to think about these things in new ways. We have a probabilistic definition of the potential future evolutions, against which we need to be able to express behaviours synthesising the “best guesses” we can make and simultaneously use the data we actually observe to validate or refute our predictions and refine our models. The link between programming and the modelingthat underlies it looks surprisingly intimate. 4. Rich representations of linked data. Sensors generate a lot of data. Much of it has long-term value, if only for verification and later re-study. Keeping track of all this data is going to become a major challenge. It’s not something that the scientists for whom it’s collected are generally very good at — and why should they be, given that their interests are in the science and not in data management? But the data has to be kept, has to be retrievable, and has to be associated with enough metadata to make its properly and validly interpretable in the future. Sensor mark-up languages like SensorML are a first step, but only a first step. There’s also the issue of the methodology by which the data was collected, and especially (returning to point 2) were the behaviours of the sensors consistent with gaining a valid view of the phenomena of interest? That means linking data to process descriptions, or to code, so that we can track-back through the provenance to ensure integrity. Then we can start applying reasoners to classify and derive information automatically from the data, secure in the knowledge that we have an audit trail for the science. 5. Making it easier to build domain-specific languages for real. A lot has been said about DSLs, much of it negative: if someone’s learned C (or Java, or Fortran, or Matlab, or Perl, or…) they won’t want to then learn something else just to work in a particular domain. This argument holds that it’s therefore more appropriate to provide advanced functions as libraries accessed from a common host language (or a range of languages). The counter-argument is that libraries only work around the edges of a language and can’t provide the strength of optimisation, type-checking and new constructs needed. I suspect that there’s truth on both sides, and I also suspect that many power users would gladly learn a new language if it really matched their domain and really gave them leverage. Building DSLs is too complicated, though, especially for real-world systems that need to run with reasonable performance on low-grade hardware. A good starting-point might be a system that would allow libraries to be wrapped-up with language-like features — like Tcl was intended for, but with more generality in terms of language constructs and types. A simpler language-design framework would facilitate work on new languages (as per point 1 above), and would allow us to search for modes of expression closer to the semantic constructs we think are needed (per points 2 and 3): starting from semantics and deriving a language rather than vice versa.

Both ends of the data-intensive spectrum

Data-intensive science (or “web science” as it is sometimes known) has received a major boost from the efforts of Googlle and others, with the availability of enormous data sets against which we can learn. It’s easy to see that such large-scale data affects experimental science, but there are lessons further down the scale too. I spent part of last week at a workshop on data-intensive computing hosted by the National e-Science Centre in Edinburgh. It was a fscinating meeting, and I very grateful for having been invited. Most of the sessions focussed on the challenges of petabyte and exabyte data, but it struck me that many of the points were actually relative rather than absolute: problems that are data-intensive because they have large amounts of data relative to the processing power you can deploy against them. This got me thinking to what extent the characteristics of data-intensive systes can be applied to sensor systems too. One of the most interesting points was made early on, by Alex Szalay of the Sloan Digital Sky Survey, who set out some desiderata for data intensive computing first made by the late Jim Gray — “Gray’s laws”:

  1. Scientific computing revolves around data — not computing
  2. Build solutions that intrisically can scale-out as required
  3. Take the analysis to the data — because the interesting data’s almost certainly too big to move, even with fast backbones
  4. Start with asking for”20 queries” — the most important questions— and recognise that the first 5 will be by far the most important
  5. Go from “working to working” — assume that the infrastructure will change every 2 — 5 years, and design hardware and software accordingly
This is advice hard-won from practice, and it’s easy to see how it affects the largest-scale systems. I think Gray’s laws also work at the micro-scale of sensor networks, and at points in between. Data-intensive science is perhaps better envisioned as data-driven science, in which the data drives the design and evolution computation. This view unifies large- and small-scales: a sensor network needs to respond to the observations it makes of the phenomane it’s sensing, even though the scale of the data (for an individual node) is so small. By focusing on networking we can scale-out solutions, but we also need to consider that several different networks may be needed to take in the different aspects of systems being observed. It’s a mistake to think that we can grow the capabilities of individual nodes too much, since that starts to eat into power budgets. At a data centre level, scale-out tends to mean virtualisation and replication: proven parallel-processing design patterns. At a network level, though, I suspect it means composition and co-design, which we understand substantially less well. Taking the analysis to the data means pushing processing down into the network and reducing the amount of raw data that needs to be returned to a base station. This is a slightly contentious point: do I want to limit the raw data I collect, or should I grab it all in case something emerges later that I need to check against a raw stream? In other words, can I define the information I want to extract from the data stream sufficiently clearly to avoid storing it all? This same point was made by Alan Heavens at the meeting, pointing out that one can do radical data discarding if one has a strong enough model of the pheonmenon against which to evaluate the raw data. Again, the point may be the scale if the data relative to the platform on which it’s being processed rather than in any absolute sense: astonomical data is too, well, “astronomical” to retain even in a large data centre, while sensor data is large relative to node capabilities. It’s an open question whether many systems have strong enough data models to support aggressive discarding, though. The “20 queries” goal is I think key to many things: identify the large questions and address them first. (Richard Hamming made a similar point with regard to research as a whole.) Coupling sensing research to the science (and public policy formation) that needs it is the only way to do this, and strikes me as at least as important as theoretical advances in network sensing science. The engineering challenges of (for example) putting a sensor network into a field are at least as valuable — and worthy of support — as the basic underpinnings. The coupling of computer and physical science also speaks the the need for designing systems for upgrade. The techniques for doinjg this — component design and so forth — are well-explored by computer scientists and under-understood by many practitioners from other disciplines. Designing sensor and data systems for expansion and re-tasking should form a backbone of any research effort. So I think Jim Gray’s pioneering insights into large-scale data may actually be broader than we think, and might also characterise the individually small-scale — but collectively widespread — instrumentation of the physical world. It also suggests that there are end-to-end issues that can usefully form part of the research agenda.

Forth for sensors?

Most sensor systems are programmed using C: compact and well-known, but low-level and tricky to get right when things get compact and complex. There have been several proposals for alternative languages from across the programming language research spectrum. I haven’t heard anyone mention Forth, though, and it’s worth considering — even if only as a target for other languages. Many people will never have encountered Forth, a language that’s enjoyed up-and-down popularity for over four decades without ever hitting the mainstream. Sometimes touted as an alternative to Basic, it even had an early-1980’s home computer that used it as the introductory language. Forth has a number of characteristics that are completely different to the vast majority of modern languages:

  • It uses and explicit data stack and Reverse-Polish notation uniformly throughout the language
  • There’s no type system. Everything is represented pretty much using addresses and integers. The programmer is on her own when building complex structures
  • It is a threaded interpreter where every construct in the language is a “word”. Composing words together generates new words, but (unlike in an interpreter) the definitions are compiled efficiently, so there’s an immediacy to things without crippling performance overheads
  • A standard system usually mixes its “shell” and “compiler” together, so one can define new words interactively which get compiled immediately
  • There’s a small kernel of machine-code (or C) routines, but…
  • The compiler itself — and indeed the vast majority of the system — can be written in Forth, so you can extend the compiler (and hence the language) with new constructs that have first-class status alongside the built-in words. There’s typically almost no overhead of programmer- versus system-defined words, since they’re all written in the same (fast) language
  • If you’re careful, you can build a cross-compiler that will generate code for a different target system: just port the kernel and the compiler should be re-usable to generate code that’ll run on it. (It’s not that simple, of course, as I’m finding myself…)
So Forth programs don’t look like other languages. There’s no real phase distinction between compilation and run-time, since everything’s mixed-in together, but that has the advantage that you can write new “compiler” words to make it easier to write your “application” words, all within the same framework and set of capabilities. You don’t write applications so much as extend the language itself towards your problem. That in turn means you can view Forth either as low-level — a glorified assembler — or very high-level in terms of its ability to define new syntax and semantics. That probably sounds like a nightmare, but suspend judgment for a while….. One of the problems that concerns a lot of sensor networks people is the programming level at which we have to deal with systems. Typically we’re forced to write C on a per-node basis: program the individual nodes, and try to set it up so that the network behaves, as a whole, in an intended way. This is clearly possible in many cases, and clearly gets way more difficult as things get bigger and more complex, and especially when we want the network to adapt to the phenomena it’s sensing, which often requires decisions to be made on a non-local basis. Writing a new language is a big undertaking, but a substantially smaller undertaking with Forth. It’s possible to conceive of new language structures (for example a construct that generates moving averages) and implement it quickly and simply. This might just be syntactic sugar, or might be something rather more ambitious in terms of control flow. Essentially you can extend the syntax and the semantics of Forth so that it “understands”, at the same level as the rest of the compiler, the new construct you want to use. Which is interesting enough, but what makes it more interesting for sensors is the structure of these compiler words. Typically they’re what is known as IMMEDIATE words, which means they execute when encountered compiling a word. That may sound odd, but what it means is that the compiler word executes and generates code, rather than being compiled itself. And that means that, when used with a cross-compiler, the new language constructs don’t add to the target system’s footprint, because their action all happens at compile-time to generate code that’s expressed in terms of lower-level words. In core Forth, constructs like IF and LOOP (conditional and counted loops respectively) do exactly this: they compile low-level jumps (the word (BRANCH) and (?BRANCH), which do non-conditional and conditional branches respectively) implementing the higher-level structured-programming abstraction. A lot of modern languages use virtual machines as targets for their compilers, and a lot of those VMs are stack machines — Forth, in other words. If we actually use Forth as the VM for a compiler, we have an extensible VM in which we can define new constructs, so we can evolve the VM better to fit the language that targets it. (There are also some very interesting, close parallels between Forth code and the abstract syntax trees used to represent code within compilers, but that’s something I need to think about a bit more before I write about it.) All this is rather speculative, and doesn’t really address the core problem of programming a network rather than a device, but it does provide a device-level platform that might be more amenable to language research. I’ve been experimenting with Forth for this purpose, and have a prototype system — Attila, an abstract, re-targetable threaded interpreter that’s fully cross-compilable — in the works, but not quite ready to see the light of day. It’s taught me a lot about the practicalities of threaded interpreters and cross-compilers. This is a strand of language design that’s been almost forgotten, and I think it deserves more of a look.

Things that won’t change

Technology always advances, and in most areas the rate of change is also increasing all the time. But there are some areas where technological changes either happen only slowly, or even go into reverse. Not something we’re used to in computer science, but it’s a feature of sensor network programming: what are the challenges that technology won’t solve for us? Moore’s law has been a defining feature of computers for the past decades. By and large computing power has doubled every 18 months at constant price; alternatively, the cost of a unit of computing power has halved in the same period. The effects of this on user experience have been plain to see. Within computer science, Moore’s law has had an effect on research directions too. In starting on a PhD a student can work on a problem that’d at the edge of the performance envelope of whatever class of machine she is targeting — cellphone, laptop, desktop or server — secure in the knowledge that, but the time she’s coming to an end, the power available to that machine class will have quadrupled. This doesn’t open-up every problem, of course — a four-times speed-up on an NP-hard search problem might still leave it largely intractable — but in fields such as middleware, software tools, language design and the like, it’s enough to overcome many issues. It’s therefore something of a shock to come to sensor networks and similar systems, because I suspect these systems aren’t subject to Moore’s law in the usual way. In some ways, the situation on such small systems is actually better than in desktops and enterprise computing. At the higher end, we’re already hitting at least the suspicion that the performance increases in individual cores will soon start to flatten out. Multicore processors allow us to keep increasing performance, but at the cost of vastly complicating the sorts of programming needed in order to keep all those cores occupied. Sensor motes are still single-core and typically aren’t using state-of-the-art processes at that, so there’s still plenty of room for manoeuvre. But it’s easy to forget that while the cash cost of a unit of processing power has decreased, the power cost of that unit hasn’t decreased by nearly as much (and may actually have increased). Those twice-as-powerful chips eighteen months on typically burn significantly more power than their predecessors. You only have to look at the size of heatsinks on chips to realise that there’s a lot of heat being dissipated. So for a sensor network, which is using a battery or scavenging for power,  increasing the processor power will almost certainly decrease lifetime, and that’s not a trade-off designers will accept. Battery, scavenging and renewable power sources like solar cells aren’t subject to Moore’s law: their growth curves are those of physics and traditional engineering, not those of IT systems. Ten years ago my cellphone went for three days without a charge; my new HTC Hero lasts no more than two days, even if I turn off the data services and wifi. The extra compute cost has a severe power cost. In many sensor applications, the trade-off will actually be in reverse. Given the choice, a designer might opt for two older, less capable but less power-hungry processors over one more powerful but more hungry. Two motes can provide more coverage, or more robustness, or both. But this exposes a real programming challenge, since it implies that we’re going to have to get used to building modern, open, adaptive software on machines whose capabilities are similar to those of a mid-1980’s vintage home computer — and which might in fact even decrease over time, since the driving forces are pushing for coverage, lifetime and redundant replication. The performance of a network in aggregate might still increase, of course, but that still means that we have to extract extra performance from co-ordinating distributed processors rather than from improving individual nodes. The history of distributed parallel processing should warn us not to be sanguine about that prospect. Actually, though, the challenge will do us good. Modern machines encourage sloppy over-engineering and over-generalisation — building frameworks for situations that we anticipate but which might never occur. Targeting small machines will change this, and instead encourage us to build software that’s fit for immediate purpose, and that’s build to be evolved and extended over time alongside changing requirements and constraints. This building evolution into the core of the system will make for better engineering in the long run.

Programming with limited certainty

Sensor networks are all about uncertainty: if the sensor says it’s 20°C  out there, then it might be 20°C plus-or-minus half a degree or so (limited precision); or it might be some different temperature, and the sensor’s just reported a duff value for some reason (limited accuracy). By contrast, computers most definitely aren’t about uncertainty, a fact enshrined in the famous maxim “garbage in, garbage out”. What does this mean for our ability to build really large, robust and flexible sensor networks? All the really foundational models of computing — λ calculus, Turing machines and the like — pretty much reflect this notion that input is correct in some sense, and if it’s wrong then that’s an error to be corrected outside the computational system. That seems to mean that the computational system can’t itself either tolerate or express the notions of limited certainty — precision and accuracy — that lie at the core of sensor networks (and a lot of other modern systems, or course). That suggests to me that there might be a problem at the heart of computer science as we currently formulate it: it isn’t a realistic model of the world we’re trying to compute over. In some ways this is nether surprising nor threatening. Any mathematical or computational model is only a simplified abstraction of the real world, for which we have to make often staggeringly bold simplifications if we’re to get anywhere. We should however always be prepared to challenge the validity and necessity of these simplifications, and that’s what I’d like to do here. As far as validity is concerned, the simplification is quite patently invalid when it comes to any system that operates with real-world data: some of it is bound to be “wrong” in some sense. This isn’t the same as being tolerant of mistakes, such as when someone presses the delete key by mistake: that’s a action that certainly happened and to which the system responded correctly, albeit “wrongly” from the user’s perspective. Interesting problem, but different: we’re talking here about responding to inherently erroneous input — the delete key seeming to press itself, if you like. Necessity, then: is it necessary to handle computation in this way? Clearly not: we can easily conjecture a computational model that’s more tolerant of input with limited certainty. Consider precision first. If the input is only known to a limited precision, then we don’t want that error margin to cause enormous errors. If we have a function $latex f$, then we want $latex f$ to exhibit a tolerance of imprecision such that $latex \delta x < tol_x \Rightarrow \left | f(x + \delta x) - f(x) \right | < s \left | \delta x \right|$ for some scaling factor $latex s < 1$. $latex f$ doesn’t cause errors to blow-up in unpredictable ways. A lot of functions behave in exactly this way: for example, in a sliding-window average function $latex f_w(\overline{x}, x) = \frac{x + \overline{x}(w - 1)}{w}$ for an average $latex \overline{x}$ computed from $latex w$ recent observations, we have that $latex s = \frac{1}{w}$. Small errors therefore perturb the result significantly less than the input is perturbed. If the errors are uniformly distributed, the function should converge on the “true” value. Conversely, a large, accurate new observation will perturb the average only slowly, so large step-changes will be detected only slowly. It’s hard to distinguish such a change when it first happens from an inaccurate reading. There are various ways of dealing with this, such as using a weighted sliding window with non-linear weighting. This is a rather topological idea. Rather than simply mapping points in an input space (such as temperature) to an output space (average temperature over the past few minutes), we’re also requiring that the mapping take elements close in the input space to elements close in the result space: we require that it be a contraction mapping. Building systems from contraction mappings, perhaps combined with contraction-preserving operators, yields systems that are robust to small errors in precision from their sensors. Of course not all systems are actually like this, and in many cases we want rapid changes to be detected quickly and/or propagated. The point, perhaps, is that this is a choice we should make rather than a consequence of choosing a particular model of computation. There might actually be a model of computation lurking about here, in which we define functions coupled with a model of how their input and output errors should behave. At the very least, this yields systems in which we can predict the consequences of errors and imprecisions, which is a major challenge to deal with at present.