Skip to main content

No data scientists

"Data science" and "data scientist" are not good terms -- for anything.

The recent revolution in our ability to monitor more and more human (and indeed non-human) activity has spawned a whole new field of study. Known variously as "big data", "data science", and "digital humanities", the idea is that -- by studying the data collected -- we can perform feats of prediction and customisation that defeat other approaches.

This is not all hype. There's no doubt that deriving algorithms from data -- known as "data-driven design" -- can often work better than a priori design. The best-known example of this is Google Translate, whose translation approach is driven almost entirely by applying statistical learning to a large corpus of documents with for which exact translations exist across languages, and using the correlations found as the basis for translation. (The documents in question were actually the core agreements governing the operation of the EU, the acquis communautaire, which explains why Google Translate works best on bureaucratese and worst on poetry.) It turns out that data-driven learning works better in most cases than grammar-directed parsing.

The data-driven approach rests on several pillars. Chief among them is applied machine learning as mentioned above, allowing algorithms to look through bodies of data and learn the correlations that exist between events. (We use similar techniques to analyse sensor data and perform situation recognition. See Ye, Dobson and McKeever. Situation identification techniques in pervasive computing: a review. Pervasive and Mobile Computing 8(1), pp. 36--66. 2012.) Various other statistical techniques can also be applied, ranging up from simple mean and variance calculations. One can usually augment the basic techniques using more structural methods: if you know the structure of the links in a web site, for example, you can learn more about how people navigate because you know that some routes are more probable (by clicking hyperlinks) than others. The results of these analyses then need to be presented as analytics for consumption by managers and decision-makers, and distilling large volumes of information into visually compelling and comprehensible form is a skill in itself.

Going a stage further, one may conduct experiments directly. If you see people searching for the same term, present the results to different people in different ways and see how that influences the links they click. Google again are at the forefront.

The excitement of these techniques has spilled over into the wider science and humanities landscapes. Just within St Andrews this week I've talked about projects for analysing DNA data using machine learning, improving clinical trials, building a concept map for two centuries-worth of literature pertaining to Edinburgh, detecting intruders in computer systems, and mapping the births, deaths, and marriages of Scotland from parish records -- and it's only Wednesday. All these activities are inherently data-driven: they only become possible because we can ingest and analyse large data volumes at high-enough speeds.

However, none of this implies that there is a subject called "data science", and I'm starting to worry that a false premise is being established.

The term "data science" is a tricky one to parse. How many sciences do you know that aren't based in data? "Data-driven" is the same. That's not to say that they're meaningless, exactly: they identify a sub-set of techniques that are qualitatively different to the more theory-driven approaches, and even differentiate themselves from experimentally-driven approaches by their attempt to correlate across datasets rather than being based on a single homogeneous sampling (however large). But that's a nuanced reading, and a general reader would be forgiven for believing that "data science" was somehow a separate discipline from "ordinary" science, instead of it denoting a set of computational techniques with general applicability across the range of (non-)traditional sciences.

But it's "data scientist" that really troubles me. This seems to suggest that one can find scientific meaning in the data and the data alone. It seems to suggest that there's a short-cut to scientific insight through the data (and by implication, computer science) rather than through traditional scientific training. And this simply isn't true.

The problem is the age-old different between correlation and causality. Correlation is when things happen together: you leave a cup of a tea and a biscuit on a table for long enough, the tea goes cold and the biscuit gets stale. Causality is when one thing happening triggers another thing happening: the tea got cold, and that made the biscuit become stale. Mistaking one for the other leads to all sorts of interesting possibilities: if we put the tea in an insulated cup, the biscuit will stay fresh longer.

Now, the final tea-and-biscuit example is clearly meaningless, but ask yourself this: how did you recognise it as meaningless? Because you understand that the two effects are happening independently because of the passage of time, not because of each other: they are correlated but not causative. You understand this because you have insight into the processes of the world.

And it's here that the problems start for data scientists. In order to interpret the machine learning, statistics, analytics, or other results, you have to have an understanding of the underlying processes. It doesn't happen in the data at all. That's fine for tea and biscuits, and is also probably fairly fine for sales of consumer goods on mass-market web sites, where we have a good intuitive understanding of the processes involved, but will drop-off rapidly as we approach areas that are more complex, more noisy, less intuitive, and less well-understood. How can you differentiate correlation from causality if you don't understand what's possible in the underlying domain? How, in fact, can you determine anything from the data in and of itself?

This suggests to me that data scientist isn't a job description, or even an aspiration: it's a misnomer that should really be read as "a trained scientist working with lots of rich empirical data". Not as sexy, but probably more useful and less likely to disappoint everyone involved.

Tao Te Ching

Tao Te Ching

Lao Tzu


Whatever your view on Eastern philosophy, this is a book worth reading for its poetry alone. This particular edition, with careful translations and wonderfully atmospheric accompanying photography, is a complete joy and a perfect encapsulation of the words.

5/5. Finished 09 March 2014.

(Originally published on Goodreads.)

The Nazi and the Psychiatrist: Hermann Göring, Dr. Douglas M. Kelley, and a Fatal Meeting of Minds at the End of WWII

The Nazi and the Psychiatrist: Hermann Göring, Dr. Douglas M. Kelley, and a Fatal Meeting of Minds at the End of WWII

Jack El-Hai


I came to this book at the recommendation of a friend, and it's one of the best I've read recently.

The book tells the story of the meeting of psychiatrist Douglas Kelley with Reichsmarschall Hermann Goering in the Nuremberg gaol where the latter was on trial. It's mainly a book about Kelley, Goering's personality having been dissected enough in other works (not least Kelley's own 22 cells in nuremberg, which is itself well worth reading for eyewitness value). It follows Kelley from his childhood as the driven and over-achieving child of a famous, eccentric, Californian family; through his committed treatment of shell-shocked soldiers and his appointment to oversee the mental health of the Nuremberg inmates; to his later career and eventual suicide. The fascination of his suicide is that Kelley chooses the same method as Goering himself -- cyanide -- in a dramatic and public gesture that seems to lack any real motive.

Kelley's commitment to traumatised soldiers notwithstanding, he was not an attractive personality. His psychiatric certainty is alarming, especially given his reliance on Rorschach blot tests and truth drugs that would not nowadays be well thought of. The author does a good job of highlighting the controversy that Rorschach interpretation engendered, even though the consensus now favours Kelley's view that there was no "Nazi personality" or particular criminal characteristic that set the Nuremberg prisoners apart.

This really is an excellent read, and the only reason for giving it only four stars instead of five is this (which may be an unfair criticism anyway): the author never really nails the interaction between Kelley and Goering, in the sense of how the experience affected Kelley's personality. The doctor comes across as self-absorbed, opinionated, and controlling. He steps into the limelight whenever possible, offering his opinions with a quite hair-raising certainty, and one can't quite escape the suspicion that his certainly could easily have abetted miscarriages of justice in his later career. He tries to shape his eldest son in a particular image, is withdrawn and self-absorbed, and kills himself in what almost seems like a fit of pique.

But despite the book's title, the core question remains unanswered: to what extent was Kelley shaped by Nuremberg, and by Goering in particular? Did the experience of the prison change him from confident to arrogant, or was that transition inevitable and perhaps just slightly reinforced? He clearly always had a yearning for fame, and yet received his most famous assignment largely by chance. Was the experience formative for him (as wartime experiences were for so many), or did it simply provide a lever by which to accomplish pre-existing goals? I suspect Nuremberg affected Kelley less than one might imagine, his self-absorption protecting him while allowing him to function. In that sense he was the perfect choice for Nuremberg psychiatrist.

4/5. Finished 07 March 2014.

(Originally published on Goodreads.)

How to Thrive in the Digital Age

How to Thrive in the Digital Age

Tom Chatfield


A short exploration of the state of modern internet and social experience. In some ways the book is mis-named, in that it isn't in any way prescriptive or suggestive of how one should thrive, but rather illustrates some of the issues one should consider in order to: such issues as privacy, time away, imaginary vs real experience, and the like. Definitely worth a read, and with an excellent bibliography pointing to further information.

3/5. Finished 25 February 2014.

(Originally published on Goodreads.)