Google Flu Trends’ Failure Shows Good Data > Big Data

Google Flu Trends’ Failure Shows Good Data > Big Data

by Kaiser Fung  |   8:00 AM March 25, 2014

In their best-selling 2013 book Big Data: A Revolution That Will Transform How We Live, Work and Think, authors Viktor Mayer-Schönberger and Kenneth Cukier selected Google Flu Trends (GFT) as the lede of chapter one. They explained how Google’s algorithm mined five years of web logs, containing hundreds of billions of searches, and created a predictive model utilizing 45 search terms that “proved to be a more useful and timely indicator [of flu] than government statistics with their natural reporting lags.”

Unfortunately, no. The first sign of trouble emerged in 2009, shortly after GFT launched, when it completely missed the swine flu pandemic. Last year, Nature reported that Flu Trends overestimated by 50% the peak Christmas season flu of 2012. Last week came the most damning evaluation yet.  In Science, a team of Harvard-affiliated researchers published their findings that GFT has over-estimated the prevalence of flu for 100 out of the last 108 weeks; it’s been wrong since August 2011. The Science article further points out that a simplistic forecasting model—a model as basic as one that predicts the temperature by looking at recent-past temperatures—would have forecasted flu better than GFT.

In short, you wouldn’t have needed big data at all to do better than Google Flu Trends. Ouch.

In fact, GFT’s poor track record is hardly a secret to big data and GFT followers like me, and it points to a little bit of a big problem in the big data business that many of us have been discussing: Data validity is being consistently overstated. As the Harvard researchers warn: “The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.”

The amount of data still tends to dominate discussion of big data’s value. But more data in itself does not lead to better analysis, as amply demonstrated with Flu Trends. Large datasets don’t guarantee valid datasets. That’s a bad assumption, but one that’s used all the time to justify the use of and results from big data projects. I constantly hear variations on the “N=All therefore it’s good data” argument, from real data analyts: “Since Google has 80% of the search market, we can ignore the other search engines. They don’t matter.” Or, “Since Facebook has a billion accounts, it has substantively everyone.”

Poor assumptions are neither new nor unpredictable. When the mainstream economists collectively failed to predict the housing bubble: their neoclassical model is built upon several assumptions including the Efficient Markets Hypothesis, which suggests that market prices incorporate all available information, and, as Paul Krugman says, leads to the “general belief that bubbles just don’t happen.”

In the wake of epic fails like these, the natural place to look for answers is in how things are being defined in the first place. In the business community, big data’s definition is often some variation on McKinsey’s widely-circulated big data report (PDF), which defines big data as “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”

Can we do better? I started asking myself and other data analysts what are the key differences between datasets that underlie today’s GFT-like projects and the datasets we were using five to 10 years ago. This has led to what I call the OCCAM framework, a more honest assessment of the current state of big data and the assumptions lurking in it.

Big data is:

Observational: much of the new data come from sensors or tracking devices that monitor continuously and indiscriminately without design, as opposed to questionnaires, interviews, or experiments with purposeful design

Lacking Controls: controls are typically unavailable, making valid comparisons and analysis more difficult

Seemingly Complete: the availability of data for most measurable units and the sheer volume of data generated is unprecedented, but more data creates more false leads and blind alleys, complicating the search for meaningful, predictable structure

Adapted: third parties collect the data, often for a purposes unrelated to the data scientists’, presenting challenges of interpretation

Merged: different datasets are combined, exacerbating the problems relating to lack of definition and misaligned objectives

This is far less optimistic a definition, but a far more honest appraisal of the current state of big data.

The worst outcome from the Science article and the OCCAM framework, though, would be to use them as evidence that big data’s “not worth it.” Honest appraisals are meant to create honest progress, to advance the discipline rather than fuel the fad.

Progress will come when the companies involved in generating and crunching OCCAM datasets restrain themselves from overstating their capabilities without properly measuring their results. The authors of the Science article should be applauded for their bravery in raising this thorny issue. They did a further service to the science community by detailing the difficulty in assessing and replicating the algorithm developed by Google Flu Trends researchers. They discovered that the published information about the algorithm is both incomplete and inaccurate. Using the reserved language of academics, the authors noted: “Oddly, the few search terms offered in the papers [by Google researchers explaining their algorithm] do not seem to be strongly related with either GFT or the CDC data—we surmise that the authors felt an unarticulated need to cloak the actual search terms identified.” [emphasis added]

In other words, Google owes us an explanation as to whether it published doctored data without disclosure, or if its highly-touted predictive model is so inaccurate that the search terms found to be the most predictive a few years ago are no longer predictive. If companies want to participate in science, they need to behave like scientists.

Like the Harvard researchers, I am excited by the promises of data analytics. But I’d like to see our industry practice what we preach, conducting honest assessment of our own successes and failures. In the meantime, outsiders should be attentive to the challenges of big data analysis, as summarized in the OCCAM framework, and apply considerable caution in interpreting such analyses.

 

About bambooinnovator
Kee Koon Boon (“KB”) is the co-founder and director of HERO Investment Management which provides specialized fund management and investment advisory services to the ARCHEA Asia HERO Innovators Fund (www.heroinnovator.com), the only Asian SMID-cap tech-focused fund in the industry. KB is an internationally featured investor rooted in the principles of value investing for over a decade as a fund manager and analyst in the Asian capital markets who started his career at a boutique hedge fund in Singapore where he was with the firm since 2002 and was also part of the core investment committee in significantly outperforming the index in the 10-year-plus-old flagship Asian fund. He was also the portfolio manager for Asia-Pacific equities at Korea’s largest mutual fund company. Prior to setting up the H.E.R.O. Innovators Fund, KB was the Chief Investment Officer & CEO of a Singapore Registered Fund Management Company (RFMC) where he is responsible for listed Asian equity investments. KB had taught accounting at the Singapore Management University (SMU) as a faculty member and also pioneered the 15-week course on Accounting Fraud in Asia as an official module at SMU. KB remains grateful and honored to be invited by Singapore’s financial regulator Monetary Authority of Singapore (MAS) to present to their top management team about implementing a world’s first fact-based forward-looking fraud detection framework to bring about benefits for the capital markets in Singapore and for the public and investment community. KB also served the community in sharing his insights in writing articles about value investing and corporate governance in the media that include Business Times, Straits Times, Jakarta Post, Manual of Ideas, Investopedia, TedXWallStreet. He had also presented in top investment, banking and finance conferences in America, Italy, Sydney, Cape Town, HK, China. He has trained CEOs, entrepreneurs, CFOs, management executives in business strategy & business model innovation in Singapore, HK and China.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

<span>%d</span> bloggers like this: