“Data science” and “data scientist” are not good terms — for anything.
The recent revolution in our ability to monitor more and more human (and indeed non-human) activity has spawned a whole new field of study. Known variously as “big data”, “data science”, and “digital humanities”, the idea is that — by studying the data collected — we can perform feats of prediction and customisation that defeat other approaches.
This is not all hype. There’s no doubt that deriving algorithms from data — known as “data-driven design” — can often work better than a priori design. The best-known example of this is Google Translate, whose translation approach is driven almost entirely by applying statistical learning to a large corpus of documents with for which exact translations exist across languages, and using the correlations found as the basis for translation. (The documents in question were actually the core agreements governing the operation of the EU, the acquis communautaire, which explains why Google Translate works best on bureaucratese and worst on poetry.) It turns out that data-driven learning works better in most cases than grammar-directed parsing.
The data-driven approach rests on several pillars. Chief among them is applied machine learning as mentioned above, allowing algorithms to look through bodies of data and learn the correlations that exist between events. (We use similar techniques to analyse sensor data and perform situation recognition. See Ye, Dobson and McKeever. Situation identification techniques in pervasive computing: a review. Pervasive and Mobile Computing 8(1), pp. 36—66. 2012.) Various other statistical techniques can also be applied, ranging up from simple mean and variance calculations. One can usually augment the basic techniques using more structural methods: if you know the structure of the links in a web site, for example, you can learn more about how people navigate because you know that some routes are more probable (by clicking hyperlinks) than others. The results of these analyses then need to be presented as analytics for consumption by managers and decision-makers, and distilling large volumes of information into visually compelling and comprehensible form is a skill in itself.
Going a stage further, one may conduct experiments directly. If you see people searching for the same term, present the results to different people in different ways and see how that influences the links they click. Google again are at the forefront.
The excitement of these techniques has spilled over into the wider science and humanities landscapes. Just within St Andrews this week I’ve talked about projects for analysing DNA data using machine learning, improving clinical trials, building a concept map for two centuries-worth of literature pertaining to Edinburgh, detecting intruders in computer systems, and mapping the births, deaths, and marriages of Scotland from parish records — and it’s only Wednesday. All these activities are inherently data-driven: they only become possible because we can ingest and analyse large data volumes at high-enough speeds.
However, none of this implies that there is a subject called “data science”, and I’m starting to worry that a false premise is being established.
The term “data science” is a tricky one to parse. How many sciences do you know that aren’t based in data? “Data-driven” is the same. That’s not to say that they’re meaningless, exactly: they identify a sub-set of techniques that are qualitatively different to the more theory-driven approaches, and even differentiate themselves from experimentally-driven approaches by their attempt to correlate across datasets rather than being based on a single homogeneous sampling (however large). But that’s a nuanced reading, and a general reader would be forgiven for believing that “data science” was somehow a separate discipline from “ordinary” science, instead of it denoting a set of computational techniques with general applicability across the range of (non-)traditional sciences.
But it’s “data scientist” that really troubles me. This seems to suggest that one can find scientific meaning in the data and the data alone. It seems to suggest that there’s a short-cut to scientific insight through the data (and by implication, computer science) rather than through traditional scientific training. And this simply isn’t true.
The problem is the age-old different between correlation and causality. Correlation is when things happen together: you leave a cup of a tea and a biscuit on a table for long enough, the tea goes cold and the biscuit gets stale. Causality is when one thing happening triggers another thing happening: the tea got cold, and that made the biscuit become stale. Mistaking one for the other leads to all sorts of interesting possibilities: if we put the tea in an insulated cup, the biscuit will stay fresh longer.
Now, the final tea-and-biscuit example is clearly meaningless, but ask yourself this: how did you recognise it as meaningless? Because you understand that the two effects are happening independently because of the passage of time, not because of each other: they are correlated but not causative. You understand this because you have insight into the processes of the world.
And it’s here that the problems start for data scientists. In order to interpret the machine learning, statistics, analytics, or other results, you have to have an understanding of the underlying processes. It doesn’t happen in the data at all. That’s fine for tea and biscuits, and is also probably fairly fine for sales of consumer goods on mass-market web sites, where we have a good intuitive understanding of the processes involved, but will drop-off rapidly as we approach areas that are more complex, more noisy, less intuitive, and less well-understood. How can you differentiate correlation from causality if you don’t understand what’s possible in the underlying domain? How, in fact, can you determine anything from the data in and of itself?
This suggests to me that data scientist isn’t a job description, or even an aspiration: it’s a misnomer that should really be read as “a trained scientist working with lots of rich empirical data”. Not as sexy, but probably more useful and less likely to disappoint everyone involved.