It's the tooling, not the terabytes

With all the hype about big data it’s sometimes hard to realise that it’s about more than just data. In fact, it’s real interest doesn’t come from big-ness at all. The term big data is deceptively hard to parse. It’s a relative term, for a start: when you get down to it, all it really means is data that’s large relative to the available storage and processing capacity. From that perspective, big data has always been with us — and always will be. It’s also curiously technology-specific for something that’s garnering such broad interest. A volume of data may be “big” for one platform (a laptop, for example) and not for others (a computing cluster with a large network-attached store). Getting away from the data, what researchers typically mean by big data is a set of computational techniques that can be applied broadly to data somewhat independent of its origins and subject. I’m not a fan of “data science”, but the term does at least focus on techniques and not on data size alone. This is much more interesting, and poses a set of challenges to disciplines — and to computer science — to identify and expand the set of techniques and tools researchers can make use of. What often frightens people about big data (or makes it an attractive career niche, depending on your point of view) is that there is a step-change in how you interact with data beyond a certain size — and it’s this change in tooling requirements that I think is a more significant indicator of a big-data problem than simply the data size. Suppose you collect a dataset consisting of a few hundred samples, each with maybe ten variables — something that could come from a small survey, for example. This size of data can easily be turned into a spreadsheet model, and a large class of researchers have become completely comfortable with building such models. (This didn’t used to be the case: I had a summer job in the late 1980’s building spreadsheet models for a group that didn’t have the necessary experience. Research skills evolve.) Even a survey with thousands of rows would be possible. But what if the survey has several million rows? — for example because it came from an online survey, or because it was sensed automatically in some way. No spreadsheet program can ingest that much data. A few million rows does not constitute what many people would regard as “big data”: it’ll be at most several megabytes. But that’s not the point: rather, the point is that, in order to deal with it, the researchers have to change tools — and change them quite radically. Rather than using a spreadsheet, they have to become programmers; not just that, they have to become programmers who are familiar with languages like Python (to get the libraries), and Hadoop and cloud computing (to be able to scale-out the solutions). They could employ programmers, of course, but that removes them from the immediate and intimate contact with their data that many researchers find extremely valuable, and that personal computers have given us. To many people, hiring a programmer and running calculations in the cloud is suspiciously like a return to the mainframe computing we thought was rendered obsolete decades ago. It’s not the terabytes, it’s the tooling. Big data starts when you cross the threshold from your familiar world of interactive tools and into a more specialised, programmers world. This transition is becoming common in humanities and social sciences, as well as in science and medicine, and to my mind it’s the major challenge of big data. The basic problem is simple: changing tools takes time, expertise, and mental effort, that is taken away from what’s a researcher’s actual research interest. A further disincentive is that the effort may not be rewarded: after all, if this really is research, one is often not sure whether there actually is anything valuable on the other side of a data analysis. In fields where competition is really competitive, like medicine, this feels like a lot of risk for an uncertain reward. There’s evidence to suggest — and I can’t prove this contention — that people are steering clear of doing the experiments they know they should do because they know they’ll generate data volumes they’re uncomfortable with. This, then, is where the big data challenge actually is: minimising the cost of transition from tools that are familiar to tools that are effective on the larger data volumes that are now becoming commonplace. This is a programming and user interface challenge, to make the complex infrastructure appear easy and straightforward to people who want to accomplish tasks on it. A large challenge, but not an unprecedented one: I’m writing this just after the 25th birthday of the World Wide Web that took a complex infrastructure (the internet) and made it usable by everyone. Let’s just not get too hung-up on the idea that data needs to be really big to be interesting in this new data-driven world.

Simon Dobson

It’s the tooling, not the terabytes