Let me start out by saying that I’m a huge fan of Google’s CEO, Eric Schmidt. Recently, however, I was a little perplexed by a statement he made at Google’s 2010 Atmosphere convention. He said the following:
“There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.”
This is a remarkable statement, and I wanted to understand the details behind it. I decided to poke around and see if I could find out where this data was actually coming from.
The first thing I discovered was that the press loves this quote. (Understandably so– it illustrates a big idea in a catchy sound byte.) The quote has shown up in TechCrunch, ReadWriteWeb, Fox News, Inc, and The Huffington Post, just to name a few. I even used it in my recent talk about big data at TEDxPhilly.
My search for primary sources was less fruitful. No article that mentioned this statement referenced any source except for Eric Schmidt. I came to the conclusion that, assuming these numbers came from published sources, they must have come from separate ones. I set out to find them.
5 Exabytes Every 2 Days
The second half of Schmidt’s quote (which claims that 5 exabytes are created every 2 days) seemed like it would be the easier half to verify. I found an EMC-funded May 2010 report from IDC called “The Digital Universe Decade – Are You Ready?”
That report sizes up the “Digital Universe,” or “the amount of digital information created and replicated in a year.” For 2010, the report put that number at 1.2 Zettabytes (which is 1,200 exabytes). That’s 6.8 exabytes every 2 days, which syncs up pretty well with Schmidt’s claim. Since this report came out shortly before his speech, it’s quite likely this was his primary source for that figure.
5 Exabytes Through 2003
Finding the primary source for this number would be a little trickier, but the use of the year 2003 was a big clue. I looked for studies related to data creation from around that time and came up with a slam-dunk.
One of the most widely-cited reports on data creation is a 2003 study called “How Much Information?” The study came out of UC Berkeley (where Schmidt earned his PhD) and was designed to “estimate the annual size of the stock of new information recorded in storage media, and heard or seen each year in information flows.”
In the first line of the study’s findings, we see a very familiar number: “Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002.”
Five exabytes? Sounds like a direct hit. Unfortunately, Schmidt’s quote implies something entirely different about what that number actually represents.
Comparing Apples and Oranges
No one doubts that the volume of recorded information has grown at a tremendous rate in the past decade. Ironically, Eric Schmidt seems to have misrepresented some information in order to deliver that point in a grandiose way.
Schmidt’s quote implies that 5 exabytes of information was created in total between the dawn of time and 2003. The actual study states that 5 exabytes was created in 2002 alone.
Furthermore, the “5 exabytes in 2002” number from the Berkeley study is recorded information only, whereas the “5 exabytes every 2 days” number from the IDC study includes both “recorded and replicated” information. If we include replicated information (flowing through telephone, radio, TV, and the Internet), the closest comparable number from the Berkeley study comes to 23 Exabytes in 2002 alone. That’s a far stretch from “5 exabytes since the dawn of civilization.”
Based on the primary sources I’ve been able to piece together, the more accurate (but far less sensational) quote would be:
“23 Exabytes of information was recorded and replicated in 2002. We now record and transfer that much information every 7 days.”
Responsibility In Statistics
It has been said that 78% of all statistics are made up. It looks like we can add Schmidt’s quote to the list.
I think really highly of Schmidt, so I was disappointed when I discovered this discrepancy. In fairness, however, I’m just as guilty as he is. Much like the media outlets I mentioned above, I perpetuated this wrong information in my recent talk.
Schmidt’s quote still does a good job of illustrating a valid point. The numbers are just wrong. It’s up to journalists and statisticians to vet this kind of data manipulation before perpetuating it at face-value. When flawed information like this starts influencing decisions, the results can be disastrous.
At RJMetrics, we help online businesses use their own data to make smarter decisions every day. We make sure that the data we present is accurate, and if you’d like to see how your data could be used to better grow your business, you can try out our demo to learn more.