We’re six weeks into 2014, and the concept of “big data” continues to make headlines nearly daily. The New York Times recently reported that over the past three years, there’s been a hundredfold increase in Google searches for the term. Last fall, Stanford enrolled more than 700 students in a course on machine learning and statistical algorithms. It was the largest on-campus class all semester.
On the one hand, I’m ecstatic about the promise of using big data to do amazing things:
- Create predictive models for personalized medicine
- Develop brain-like computers that can learn from experience
- Transform workplace strategies
And the massive amounts of data now being collected and stored cheaply require armies of quantitatively minded individuals to make sense of them all.
Some “quants” have even achieved rock-star status. Nate Silver is one example, and more are on the way: two female statisticians are on this year’s Forbes list of 30 most influential scientists under the age of 30, and both work in the big-data fields of imaging and genomics. But is bigger always better when it comes to data?
Big Data: One Term, Many Meanings
“Big data” probably has as many definitions as there are users. It was the same over a decade ago when the term “bioinformatics” became popular. The phrase “big data” means different things to different people, depending on their fields.
Statisticians working in the field of genomics have long had to face a big-data problem of their own; they just called it something else: “high-dimensional data,” or the “large p, small n” problem. They had lots of information—for example, expression levels of thousands of genes or other features (the large p) on a few experimental units (the small n). The challenge was to come up with ways to reduce the dimensionality of the genomic data using clustering, classification and other pattern-recognition approaches. The high cost of the assays, however, meant that they were limited to a “small n”; they could obtain all this genetic information on only a few experimental units at a time.
With information that is captured digitally and relatively cheaply using modern technologies, and often as a by-product of other primary activities, such as Internet searches, online shopping or visits to the doctor, you have a different type of big data: a ton of information on a ton of subjects. Electronic health-record databases, for example, hold a lot of clinical and laboratory data on thousands of patients. From an analytic perspective, this means you now have “huge p, huge n”—the best of both worlds, right? Not necessarily.
Growing Bigger and Less Confident
As Malcolm Gladwell reminds us in his latest book, David and Goliath, sheer size can be a source of tremendous power but it can also obscure major weaknesses. Similarly, just having a lot of data—even if they are messy and unreliable—can give people a false sense of confidence about the accuracy of their results. The caveat with big data is that they often do not come from carefully designed experiments with the goal of producing reliable and generalizable results. As a result, the data aren’t so well understood; the quality can be variable, and they may be from a biased population. For example, users of social networking-sites such as Twitter tend to be younger than non-users. Similarly, patients in some electronic health systems have different socioeconomic characteristics from those of the general U.S. population.
Don’t Let Machines Do All the Work
Powerful computers and analytic tools are available to automate the data processing and analysis; however, it’s dangerous simply to sit back and let the machines do all the work. Machines don’t understand context; algorithms are often based on unverifiable assumptions, and informed decisions need to be made at each step: which method to choose, how to set tuning parameters, what to do with missing or messy data fields. And in the world of clinical research, it is more obvious than ever that interdisciplinary teams of information technology experts, clinicians, epidemiologists, computer scientists and statisticians are needed to integrate diverse knowledge bases and skill sets in tackling big-data issues.
No doubt there is tremendous potential for big data to transform what questions we ask and how we conduct research, and ultimately to advance scientific knowledge. But whether we are dealing with petabytes of data or measurements from a few lab animals, the bottom line is this: sound statistical thinking—good study design, generalizability, reliable measurements, robust algorithms—is the key to valid results. And we shouldn’t forget the value of the most low-tech and cheapest tools of all, common sense and human judgment, when dealing with all kinds of data, big and small.