By David Steinberg.
A 2008 (November 11) New York Times article trumpeted the success of Google Flu Trends (GFT) in tracking the progress of annual influenza outbreaks in the United States. The GFT assessments used data on search words to estimate the number of individuals affected with influenza like illness (ILI). By comparison to the official tracking results of the US Center for Disease Control (CDC), GFT appeared to be just as accurate and much more rapidly available – after all, search word counts can be collated on line by Google whereas the CDC estimate required more laborious data collection from sentinel clinics throughout the US. However, subsequent years began to show serious discrepancies between GFT and official data, leading to skepticism about the value of such immediate, but indirect, data for real time tracking tasks. Researchers began to take a more careful look at the data and the methodology and identified a number of statistical flaws in the GFT approach. In turn, their work stimulated a number of major improvements. This article by Yang, Santillana and Kou presents a statistical model that very effectively uses Google search data to track influenza outbreaks. The model is named ARGO – for AutoRegression with Google search data). The model does not use only the search counts; it is much more sophisticated, while still keeping to some simple principles. The main features are that it also incorporates the CDC data as they become available (at a time lag behind the search counts), it reflects seasonality as estimated from past years and it exploits changes in users’ search behavior over time. Failure to adjust for such trends was one of the reasons for a substantial overestimate by GFT of the 2012 influenza outbreak. This is a fascinating example showing how clear statistical thinking and understanding can be an essential component in exploiting “big data” opportunities.
Read the paper:
Accurate estimation of influenza epidemics using Google search data via ARGO. Shihao Yang , Mauricio Santillana, and S. C. Kou. 2015, Proceedings of the National Academy of Sciences, 112(47), 14473-14478