Collecting massive amounts of information isn’t a big deal—it’s all in the analysis
The recent news that the National Security Agency has been involved in massive information-gathering activities, including widespread acquisition of phone records in the U.S. and monitoring of Internet activity by users abroad, has drawn attention to the phenomenon known as “big data”—the collection, storage and analysis of data sets far larger than those that could be created just a few years ago.
But it is not only government that deals in big data. As the ability to produce and collect digital information of all sorts continues to grow, big data has become a focus for the worlds of finance, commerce, science, agriculture, marketing, social media—just about any enterprise that generates information can find itself wading into the territory of big data.
Remco Chang, assistant professor of computer science in the School of Engineering, specializes in the field of visual analytics, which uses visual techniques to sort information in a way that helps detect patterns and extract meaning. He points out that while “big data” is a popular buzzword, what makes a data set valuable as a problem-solving tool is not just the amount of information in it.
The ability to sort through, analyze and understand the information is crucial, he says. Think of a cluttered basement filled with old textbooks, outgrown toys and half-used craft supplies—they may all be useful for something, someday, but chances are when they’re needed, you won’t even be able to find them, or even remember that they’re there at all.
Tufts Now: How would you define big data? How “big” does it need to be?
Remco Chang: People really can’t seem to agree on that. Nobody seems to know how “big” your data has to be before it should be considered “big.” And is big data big simply because it’s large, or because it contains some amount of complexity? I’m naturally inclined to think the sheer size of the data is not the most important part. You can have data that is large, but not interesting. You can take a super high-res camera and film your wall for 24 hours, and the data—24 hours of wall—will be really big, but nobody cares.
So what is the value of big data? How does it fit into your research?
At the end of the day—and I think this is where big data is interesting—somebody has to analyze it. Collecting is step zero. But now that you have all this stuff sitting in some massive data warehouse—what you do with it?
I use visualization to analyze data, to help people make decisions. Humans don’t make decisions by analyzing gigabytes and terabytes of information. In our heads, we’re generalizing, synthesizing, abstracting to the point of making a “yes or no” decision. We know from psychology and cognitive science research that people don’t consider every piece of data—yet we have to somehow make a decision. How do we reduce the information to a point where people are comfortable making a final call?
Data should serve people. When data is small, it’s still difficult to make decisions. Now the data is huge, and our capacity for decision making hasn’t changed—now you’re just throwing more stuff at me. That will play out in new challenges, and lead us to new perspectives on how people form decisions, how people synthesize information. How do we help people walk through all the noise? That’s where research is really interesting.
Is there any value to amassing data with the idea that it may come in useful at some future point?
Part of the problem of collecting data and hoping something will happen that will make it useful tomorrow, a month from now, a year from now is that you don’t know what’s in your data. What do you expect to be in there, and are you willing to pay for that one in a bazillion chance that you might capture something of interest?
If you’re a really big government or a huge organization, if all of this noise could potentially be useful, it’s possible that collecting data without a clear sense of why you’re doing it could be worthwhile, but I think this is an individual or organizational decision. In terms of financial transactions or medical records, you could use the data to perform some kind of analysis to find general patterns, trends, to potentially find what happens next, to make predictions. But in terms of that one-in-a-bazillion event, you don’t need analysis—you’re just going back to see if you captured it.
What about the risks of spending so much time, energy and money collecting data that you might not be able to do anything with?
If you’re going to talk about big data, you have to talk about different types of uses. Sometimes collecting as much background chatter as possible is useful. It’s like the way governments have dealt with nuclear threats. They devote lots of resources, disproportionate to the threat that a nuclear incident could occur. But if such an incident does occur, their work is incredibly impactful. Are you willing to make that tradeoff? I don’t have the answer.
Some of this sounds a bit like electronic hoarding.
A colleague once used those same words. He said big data for a lot of companies these days is like hoarding. Executives can’t make a decision or come to a consensus about what’s valuable, so they say, “We’re going to collect everything. Get as much as you can and we’ll figure it out later.” The hope is that as time goes on, these organizations will go from being hoarders to being collectors. Collectors actually know what they’re hoarding—they have a sense of what the value is and when it will become valuable.