Big, or just rich?

The current focus on "big data" may be obscuring something more interesting: it's often not the pure size of a dataset that's important.

The idea of extracting insight from large bodies of data promises significant advances in science and commerce. Given a large dataset, "big data" techniques cover a number of possible approaches:

  • Look through the data for recurring patterns (data mining)
  • Present a summary of the data to highlight features (analytics)
  • (Less commonly) Identify automatically from the dataset what's happening in the real world (situation recognition)
There's a wealth of UK government data available, for example. Making it machine-readable means it can be presented in different ways, for example geographically. The real opportunities seem to come from cross-overs between datasets, though, where they can be mined and manipulated to find relationships that might otherwise remain hidden, for example the effects of crime on house prices.

Although the size and availability of datasets clearly makes a difference here -- big open data -- we might be confusing two issues. In some circumstances we might be better looking for smaller but richer datasets, and for richer connections between them.

Big data is a strange name to start with: when is data "big"? The only meaningful definition I can think of is "a dataset that's large relative to the current computing and storage capacity being deployed against it" -- which of course means that big data has always been with us, and indeed always will be. It also suggests that data might become less "big" if we become sufficiently interested in it to deploy more computing power to processing it. The alternative term popular in some places, data science, is equally tautologous, as I can't readily name a science that isn't based on data. (This isn't just academic pedantry, by the way: terms matter, if only to distinguish what topics are, and aren't, covered by big data/data science research.)

It's worth reviewing what big data lets us do. Having more data is useful when looking for patterns, since it makes the pattern stand out from the background noise. Those patterns in turn can reveal important processes at work in the world underlying the data, processes whose reach, significance, or even existence may be unsuspected. There may be patterns in the patterns, suggesting correlation or causality in the underling processes, and these can then be used for prediction: if pattern A almost always precedes pattern B in the dataset, then when I see a pattern A in the future I may infer that there's an instance of B coming. The statistical machine learning techniques that let one do this kind of analysis are powerful, but dumb: it still requires human identification and interpretation of the underlying processes to to conclude that A causes B, as opposed to A and B simply occurring together through some acausal correlation, or being related by some third, undetected process. A data-driven analysis won't reliably help you to distinguish between these options without further, non-data-driven insight.

Are there are cases in which less data is better? Our experience with situation recognition certainly suggests that this is the case. When you're trying to relate data to the the real world, it's essential to have ground truth, a record of what actually happened. You can then make a prediction about what the data indicates about the real world, and verify that this prediction is true or not against known circumstances. Doing this well over a dataset provides some confidence that the technique will work well against other data, where your prediction is all you have. In this case, what matters is not simply the size of the dataset, but its relationship to another dataset recording the actual state of the world: it's the richness that matters, not strictly the size (although having more data to train against is always welcome).

Moreover, rich connections may help with the more problematic part of data science, the identification of the processes underlying the dataset. While there may be no way to distinguish causality from correlation within a single dataset -- because they look indistinguishably alike -- the patterns of data points in the one dataset may often be related to patterns and data points in another dataset in which they don't look alike. So the richness provides a translation from one system to another, where the second provides discrimination not available in the first.

I've been struggling to think of an example of this idea, and this is the best I've come up with (and it's not all that good). Suppose we have tracking data for people around an area, and we see that person A repeatedly seems to follow person B around. Is A following B? Stalking them? Or do they live together, or work together (or even just close together)? We can distinguish between these alternatives by having a link from people to their jobs, homes, relationships and the like.

There's a converse concern, which is that poor discrimination can lead to the wrong conclusions being drawn: classifying person B as a potential stalker when he's actually an innocent who happens to follow a similar schedule. An automated analysis of a single dataset risks finding spurious connections, and it's increasingly the case that these false-positives (or -negatives, for that matter) could have real-world consequences.

Focusing on connections between data has its own dangers, of course, since we already know that we can make very precise classifications of people's actions from relatively small, but richly connected, datasets. Maybe the point here is that focusing exclusively on the size of a dataset masks both the advantages to be had from richer connections with other datasets, and the benefits and risks associated with smaller but better-connected datasets. Looking deeply can be as effective (or more so) as looking broadly.