Both ends of the data-intensive spectrum

Data-intensive science (or “web science” as it is sometimes known) has received a major boost from the efforts of Googlle and others, with the availability of enormous data sets against which we can learn. It’s easy to see that such large-scale data affects experimental science, but there are lessons further down the scale too. I spent part of last week at a workshop on data-intensive computing hosted by the National e-Science Centre in Edinburgh. It was a fscinating meeting, and I very grateful for having been invited. Most of the sessions focussed on the challenges of petabyte and exabyte data, but it struck me that many of the points were actually relative rather than absolute: problems that are data-intensive because they have large amounts of data relative to the processing power you can deploy against them. This got me thinking to what extent the characteristics of data-intensive systes can be applied to sensor systems too. One of the most interesting points was made early on, by Alex Szalay of the Sloan Digital Sky Survey, who set out some desiderata for data intensive computing first made by the late Jim Gray — “Gray’s laws”:

Scientific computing revolves around data — not computing
Build solutions that intrisically can scale-out as required
Take the analysis to the data — because the interesting data’s almost certainly too big to move, even with fast backbones
Start with asking for”20 queries” — the most important questions— and recognise that the first 5 will be by far the most important
Go from “working to working” — assume that the infrastructure will change every 2 — 5 years, and design hardware and software accordingly

This is advice hard-won from practice, and it’s easy to see how it affects the largest-scale systems. I think Gray’s laws also work at the micro-scale of sensor networks, and at points in between. Data-intensive science is perhaps better envisioned as data-driven science, in which the data drives the design and evolution computation. This view unifies large- and small-scales: a sensor network needs to respond to the observations it makes of the phenomane it’s sensing, even though the scale of the data (for an individual node) is so small. By focusing on networking we can scale-out solutions, but we also need to consider that several different networks may be needed to take in the different aspects of systems being observed. It’s a mistake to think that we can grow the capabilities of individual nodes too much, since that starts to eat into power budgets. At a data centre level, scale-out tends to mean virtualisation and replication: proven parallel-processing design patterns. At a network level, though, I suspect it means composition and co-design, which we understand substantially less well. Taking the analysis to the data means pushing processing down into the network and reducing the amount of raw data that needs to be returned to a base station. This is a slightly contentious point: do I want to limit the raw data I collect, or should I grab it all in case something emerges later that I need to check against a raw stream? In other words, can I define the information I want to extract from the data stream sufficiently clearly to avoid storing it all? This same point was made by Alan Heavens at the meeting, pointing out that one can do radical data discarding if one has a strong enough model of the pheonmenon against which to evaluate the raw data. Again, the point may be the scale if the data relative to the platform on which it’s being processed rather than in any absolute sense: astonomical data is too, well, “astronomical” to retain even in a large data centre, while sensor data is large relative to node capabilities. It’s an open question whether many systems have strong enough data models to support aggressive discarding, though. The “20 queries” goal is I think key to many things: identify the large questions and address them first. (Richard Hamming made a similar point with regard to research as a whole.) Coupling sensing research to the science (and public policy formation) that needs it is the only way to do this, and strikes me as at least as important as theoretical advances in network sensing science. The engineering challenges of (for example) putting a sensor network into a field are at least as valuable — and worthy of support — as the basic underpinnings. The coupling of computer and physical science also speaks the the need for designing systems for upgrade. The techniques for doinjg this — component design and so forth — are well-explored by computer scientists and under-understood by many practitioners from other disciplines. Designing sensor and data systems for expansion and re-tasking should form a backbone of any research effort. So I think Jim Gray’s pioneering insights into large-scale data may actually be broader than we think, and might also characterise the individually small-scale — but collectively widespread — instrumentation of the physical world. It also suggests that there are end-to-end issues that can usefully form part of the research agenda.