Looked up something on Wikipedia lately? That may be enough for a computer to guess if you’re sick. According to a new study in PLOS Computational Biology, the most accurate, timely data about how the flu is spreading in the US may come from Wikipedia. Lookups of pages like Influenza, Flu Season, and Tamiflu matched the CDC’s official estimates of influenza-like illness better than the alternately lauded and derided Google Flu Trends.
While CDC data is considered a gold standard for flu tracking, it relies on clinic reports and lab testing, and so is always a week or two behind what’s happening in the real world. Timely flu tracking helps hospitals, health departments, and the like to plan ahead: Do we need a new shipment of flu vaccine? How many doctors and nurses should be working the emergency room this weekend?
The biggest name in digital epidemiology, for the moment, is Google Flu Trends, which launched in 2008 and has performed well for the most part, with notable blips in 2009 and 2012 that led to harsh criticism (and, of course, updates to the algorithm). But Google doesn’t make its data or methods public, so David McIver and John Brownstein from Boston Children’s Hospital ran their analysis on Wikipedia lookups, which are publicly available. “Making everything more open makes it more collaborative,” says McIver, who hopes that others will build on his project to improve the althorithm or expand its use to other diseases–maybe heart disease or diabetes, maybe STIs. “Maybe we’ll get a different answer [about a disease’s prevalence] than what traditional sources have given in the past.”
But is it a problem that disease data is so publicly available? I asked McIver if there was a data set he wishes he could get his hands on, and while he said there was no “holy grail,” he mentioned that some of his colleagues are trying to tease trends from electronic medical records. Hospitals keep those records accurate and up-to-date, but researchers can’t play with them willy-nilly: nobody wants their medical records made public.
So, for now, digital epidemiology works from public data and voluntary surveys. It’s hard to argue with disease tracking projects that can provide early detection of outbreaks, as with cholera in Haiti, or pinpoint restaurants that could give you food poisoning. Google tracks not just flu but also dengue. One flu-tracking project even gives participants nasal swab kits to verify their disease status.
If digital epidemiology does reach out to other diseases–actually, not if, but when–will we wish our seemingly innocuous online traces were less public? What if you were a data point in, say, a gonorrhea study? An early facebook analysis found that individuals were indeed identifiable from “anonymized” data. “Just because [data is] accessible doesn’t make it ethical,” wrote a pair of analysts in 2011. Our digital footprints, like our DNA, are turning out to contain more information about us than we thought.