Big Data

Popular Science this month published a special issue ‘Data is Power: How Information is Driving the Future.” I’ve been thinking about this topic a lot lately (have I really filled up that many disk-and-cloud drives?).   Mostly, I have been struck by how little we have really delved into this information treasure-house we have built.

In 2010, the total amount of data stored globally surpassed one zettabyte. How much data is that? A zettabyte is one trillion gigabytes.  It is the equivalent of 37.5 billion 32GB iPads or the equivalent of all the books in all the academic libraries in the U.S. – times half a million.  Not only that but it is growing exponentially: in 2011 we will create half again as much data as had been created from the start of humankind through 2010!

This data explosion is coming from social media, science, government, entertainment and business.  We Facebook, tweet, post photos, send text messages, and share videos by the billions.  Scientists doing research create massive numbers of data points especially in areas such as astronomy and genetics. Governments at all levels collect more data each year.  The entertainment world generates data as music and videos. And of course every business creates and stores trillions of transactions every year.

To some, this is scary as we come to realize how little privacy we actually have. To others, this is what we need to save the planet and make a better world for all.  To Amazon and Wal-Mart, it’s a business model.  But no one knows more about “big data” than Google.

Eric Schmidt, executive chairman of Google:

“Every two days now we create as much information as we did from the dawn of civilization up until 2003. That’s something like five exabytes [5 billion gigabytes] of data.”

In 2008, Hal Varian, the company’s chief economist, said:

“The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.”

“Data science” has made significant progress in just the past couple of years and may prove to be one of the most important sciences of the 21st century.  As computing and storage capacities increase and drop in cost, researchers everywhere will be able to store more and more information.

For example, for comparable costs, the same cancer research project that could track 5,000 patients in 1990 might track 500,000 patients in 2012, and it would be able to track as much as 100x as many data points as that 1990 study.   The researcher would also have access to algorithms and tools for data analysis that would enable them to “mine” that data for all its worth.  Further, that researcher’s data and analysis can be made available to other researchers for even greater expansion.

Our capacity to collect and store data far exceeds our ability to analysis and understand that data, but that will come.   For now, let the great data gathering continue.