Demolishing the straw men of big data. This post comes about from reading Tim Harford’s opinion piece in the Financial Times in which he offers a critique of “big data”, the idea that we can perform all the science we want to simply by collecting large datasets and then letting machine learning and other algorithms loose on it. Harford deploys a whole range of criticisms against this claim, all of which are perfectly valid: sampling bias will render a lot of datasets worthless; correlations will appear without causation; the search goes on without hypotheses to guide it, and so isn’t well-founded in falsifiable predictions; and an investigator without a solid background in the science underlying the data is going to have no way to correct these errors. The critique is, in other words, damning. The only problem is, that’s not what most scientists with an interest in data-intensive research are claiming to do. Let’s consider the biggest data-driven project to date, the Large Hadron Collider’s search for the Higgs boson. This project involved building a huge experiment that then generated huge data volumes that were trawled for the signature of Higgs interactions. The challenge was so great that the consortium had to develop new computer architectures, data storage, and triage techniques just to keep up with the avalanche of data. None of this was, however, an “hypothesis-free” search through the data for correlation. On the contrary, the theory underlying the search for the Higgs made quite definite predictions as to what its signature should look like. Nonetheless, there would have been no way of confirming or refuting the correctness of those predictions without collecting the data volumes necessary to make the signal stand out from the noise. That’s data-intensive research: using new data-driven techniques to confirm or refute hypotheses about the world. It gives us another suite of techniques to deploy, changing both the way we do science and the science that we do. It doesn’t replace the other ways of doing science, any more than the introduction of any other technology necessarily invalidates hat came before. Microscopes did not remove the need for, or value of, searching for or classifying new species: they just provided a new, complementary approach to both. That’s not to say that all the big data propositions are equally appropriate, and I’m certainly with Harford in the view that approaches like Google Flu are deeply and fundamentally flawed, over-hyped attempts to grab the limelight. Where he and I diverge is that Harford is worried that all data-driven research falls into this category, and that’s clearly not true. He may be right that a lot of big data research is a corporate plot to re-direct science, but he’s wrong to worry that all projects working with big data are similarly driven. I’ve argued before that “data scientist” is a nonsense term, and I still think so. Data-driven research is just research, and needs the same skills of understanding and critical thinking. The fact that some companies and others with agendas are hijacking the term worries me a little, but in reality is no more significant than the New Age movement’s hijacking of terms like “energy” and “quantum” — and one doesn’t stop doing physics because of that. In fact, I think Harford’s critique is a valuable and significant contribution to the debate precisely because it highlights the need for understanding beyond the data: it’s essentially a call for scientists to only use data-driven techniques in the service of science, not as a replacement for it. An argument, in other words, for a broadly-based education in data-driven techniques for all scientists, and indeed all researchers, since the techniques are equally (if not more) applicable to social sciences and humanities. The new techniques open-up new areas, and we have to understand their strengths and limitations, and use them to bring our subjects forwards — not simply step away because we’re afraid of their potential for misuse. UPDATE 7Apr2014: An opinion piece in the New York Times agrees: “big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement.” The number of statistical land mines is enormous, but the right approach is to be aware of them and make the general research community aware too, so we can use the data properly and to best effect.