Saturday, June 22, 2013

There’s a Fly in My Tweets

Many important public health questions are difficult and costly to answer. What kind of risks do highly localized sources of pollution, like dry cleaners that use volatile chemicals, pose to the health of nearby residents? Are people with many friends healthier, or do those friendships increase the likelihood of infectious disease? Do frequent visits to public spaces like bars, gyms and restaurants affect a person’s health?

Researchers have been striving for generations to answer such questions, using health surveys of samples of individuals and computational studies of simulated populations. Now, however, the rise of social media and the burgeoning field of data science provide powerful tools to find high-precision, real-world answers with little cost or effort.

The millions of people posting to sites like Twitter and Facebook can be viewed as a vast organic sensor network, providing a real-time stream of data about the social, biological and physical worlds. While people use social media to build and maintain their social ties, the “data exhaust” of their postings can be analyzed to provide an enormous range of information at a population scale.

For example, my research group at the University of Rochester has analyzed Twitter postings from millions of cellphone users in New York City to develop a system to monitor food-poisoning outbreaks at restaurants.

We began by creating algorithms that can identify tweets about a given topic with near-perfect precision, even if the words and phrases used vary widely. The GPS information embedded in tweets sent from cellphones lets us integrate them with a variety of geographic databases.

We then feed the information into what we call the nEmesis system, whose development was led by our graduate student Adam Sadilek, now a researcher at Google. It begins by finding tweets that are sent from restaurants, which we can locate on Google Maps with 97 percent accuracy, thanks to GPS coordinates.

When a user is identified as having been at a restaurant, all of his or her tweets, from anywhere, are collected for the next 72 hours and analyzed to discover if any appear to report food poisoning symptoms, like vomiting, diarrhea, abdominal pain, fever or chills.

Such reports are rare but significant. Over a four-month period, our system collected 3.8 million tweets, from which we were able to trace 23,000 restaurant visitors and found 480 reports of likely food poisoning. Restaurants were then scored by the number of food poisoning reports from their patrons.

The Twitter reports are not an exact indicator — any individual case could well be caused by factors unrelated to the restaurant meal. But in aggregate the numbers are revealing. Working with Vincent Silenzio, who teaches in the department of community and preventive medicine at our medical school, we compared the results with the current database of restaurant inspections conducted by New York City’s Department of Health and Mental Hygiene. We found significant correlation between restaurants’ violation scores and the Twitter-based scores.

Our project isn’t alone. While an army of corporations are busy data mining social media for marketing, a small but growing number of research groups have initiated similar efforts to leverage the torrent of online information for social good.

by Henry Kautz, NY Times |  Read more:
Image: Olimpia Zagnoli