[follow us on Twitter at @RJMetrics and visit our website to help your online business harness the true power of your data]

Let me start out by saying that I’m a huge fan of Google’s CEO, Eric Schmidt. Recently, however, I was a little perplexed by a statement he made at Google’s 2010 Atmosphere convention. He said the following:

“There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.”

This is a remarkable statement, and I wanted to understand the details behind it. I decided to poke around and see if I could find out where this data was actually coming from.

The first thing I discovered was that the press loves this quote. (Understandably so– it illustrates a big idea in a catchy sound byte.) The quote has shown up in TechCrunch, ReadWriteWeb, Fox News, Inc, and The Huffington Post, just to name a few. I even used it in my recent talk about big data at TEDxPhilly.

My search for primary sources was less fruitful. No article that mentioned this statement referenced any source except for Eric Schmidt. I came to the conclusion that, assuming these numbers came from published sources, they must have come from separate ones. I set out to find them.

5 Exabytes Every 2 Days

The second half of Schmidt’s quote (which claims that 5 exabytes are created every 2 days) seemed like it would be the easier half to verify. I found an EMC-funded May 2010 report from IDC called “The Digital Universe Decade – Are You Ready?

That report sizes up the “Digital Universe,” or “the amount of digital information created and replicated in a year.” For 2010, the report put that number at 1.2 Zettabytes (which is 1,200 exabytes). That’s 6.8 exabytes every 2 days, which syncs up pretty well with Schmidt’s claim. Since this report came out shortly before his speech, it’s quite likely this was his primary source for that figure.

5 Exabytes Through 2003

Finding the primary source for this number would be a little trickier, but the use of the year 2003 was a big clue. I looked for studies related to data creation from around that time and came up with a slam-dunk.

One of the most widely-cited reports on data creation is a 2003 study called “How Much Information?” The study came out of UC Berkeley (where Schmidt earned his PhD) and was designed to “estimate the annual size of the stock of new information recorded in storage media, and heard or seen each year in information flows.”

In the first line of the study’s findings, we see a very familiar number: “Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002.”

Five exabytes? Sounds like a direct hit. Unfortunately, Schmidt’s quote implies something entirely different about what that number actually represents.

Comparing Apples and Oranges

No one doubts that the volume of recorded information has grown at a tremendous rate in the past decade. Ironically, Eric Schmidt seems to have misrepresented some information in order to deliver that point in a grandiose way.

Schmidt’s quote implies that 5 exabytes of information was created in total between the dawn of time and 2003. The actual study states that 5 exabytes was created in 2002 alone.

Furthermore, the “5 exabytes in 2002” number from the Berkeley study is recorded information only, whereas the “5 exabytes every 2 days” number from the IDC study includes both “recorded and replicated” information. If we include replicated information (flowing through telephone, radio, TV, and the Internet), the closest comparable number from the Berkeley study comes to 23 Exabytes in 2002 alone. That’s a far stretch from “5 exabytes since the dawn of civilization.”

Based on the primary sources I’ve been able to piece together, the more accurate (but far less sensational) quote would be:

“23 Exabytes of information was recorded and replicated in 2002. We now record and transfer that much information every 7 days.”

Responsibility In Statistics

It has been said that 78% of all statistics are made up. It looks like we can add Schmidt’s quote to the list.

I think really highly of Schmidt, so I was disappointed when I discovered this discrepancy. In fairness, however, I’m just as guilty as he is. Much like the media outlets I mentioned above, I perpetuated this wrong information in my recent talk.

Schmidt’s quote still does a good job of illustrating a valid point. The numbers are just wrong. It’s up to journalists and statisticians to vet this kind of data manipulation before perpetuating it at face-value. When flawed information like this starts influencing decisions, the results can be disastrous.

At RJMetrics, we help online businesses use their own data to make smarter decisions every day. We make sure that the data we present is accurate, and if you’d like to see how your data could be used to better grow your business, you can try out our demo to learn more.

  • Duane Johnson

    I appreciate the fact-checking that you did here; it’s important not to replicate sound bites that are incorrect. However, I think “load of crap” is a bit of an exaggeration of Schmidt’s error. In the same way that you could argue “the earth is not spherical, it’s an oblate spheroid” you could argue that there were not 5 exabytes created or replicated in 2002, there were 23. (See Isaac Asimov’s great essay on this topic, The Relativity of Wrong).
    If the increase in information generation and transfer is exponential since the dawn of the written human record–and we have every reason to believe that this has been the case–then ALL previously recorded information (y) is less than any stepwise increase in time (x). For example, in a 2^n exponential growth situation, the third step (8) is greater than the sum of all previous steps (1+2+4=7). I think this is what Schmidt is getting at.

  • Robert J. Moore

    I definitely confess to a sensationalist title.
    However I think it’s important to note that it’s not a matter of “5 vs. 23” — it’s actually “5 throughout all of recorded history” vs “23 in one year” — the discrepancy is several orders of magnitude.

  • Al Brown

    How can they say its a load of crap without precisely defining how crap is in the average load? And is crap even loaded any more? We got a pipe for that at my house.

  • Daniel

    Thx for the great post

  • Duane Johnson

    I’m not sure if you understood my point about the nature of exponentially increasing functions. If we are recording history in an exponentially increasing manner, on the order of twice as much information every year (for example), then “all of recorded history” prior to 2003 is going to be equal to just under one full year of data in 2003. The unintuitive mathematics of it is as valid in 2003 as it is this year–if the trend has continued, then in 2011 we will see more data than all prior years combined.

  • Robert J. Moore

    You make a totally valid point. The only thing I don’t totally agree with is that the increase in information generation and transfer is exponential since the dawn of the written human record.
    You said we have “every reason to believe that has been the case,” but the 2003 study estimated 30% annual growth in the number between 1999 and 2002. That puts those four years alone around 3x of the 2002 number he was using. I understand there is steep decay as you go back in time, but I don’t think its unrealistic to think that all recorded history might represent 5x 2002’s number or more, putting Schmidt’s statement 20x off the mark.

  • Al

    Duane – this is under the assumption that the equation is exponential. I’m no mathematician, but it seems your point hinges on a broad assumption. Regardless, I did learn something from your post that I hadn’t noticed before regarding exponential functions.

  • Oriste

    Good to see that “information” is still freely used when “data” is meant.

  • Adam

    This all depends where he actually got his sources. Just because you happened to find a study from 2003 that involved 5 exabytes doesn’t mean that he used it as his source.
    Don’t forget, he’s the CEO of Google, for fuck’s sake. THE information company. Whatever metrics he used, they’re subject to some wiggle room.

  • Roshan

    not a lot of difference there with the data! so not sure of of the point that you are trying to raise. then again the amount of data depends upon the sources.

  • Pingback: The Data-Driven Revolution in Healthcare

  • Pingback: El Big Data | Debug

  • Pingback: NCTE Day 2 Part B… 22/30 | Constant Learning...