Presentation is loading. Please wait.

Presentation is loading. Please wait.

ChatterGrabber.py Methods and Development A System for High Throughput Social Media Data Collection By James Schlitt, in collaboration with Elizabeth Musser,

Similar presentations


Presentation on theme: "ChatterGrabber.py Methods and Development A System for High Throughput Social Media Data Collection By James Schlitt, in collaboration with Elizabeth Musser,"— Presentation transcript:

1 ChatterGrabber.py Methods and Development A System for High Throughput Social Media Data Collection By James Schlitt, in collaboration with Elizabeth Musser, Dr. Bryan Lewis, and Dr. Stephen Eubank

2 Social media surveillance is a valuable tool for epidemiological research: – Pros: Cheap, consistent, and easy to parse data source. – Cons: Volume & specificity vary, content cannot be easily verified. – Tweepy provides easy Twitter API access via python. Introduction

3 Developed under MIDAS funding, the Virginia Department of Health requested a tool to track Norovirus and gastrointestinal illness (GI) outbreaks within Montgomery County, VA with the following capabilities: – Automated surveillance of social media. – No special skills required to use. – Forward compatible for GIS applications. Twitter was well suited to GI outbreak surveillance due to the short duration of infection and Tweet-worthy symptoms. For example: @ATweeter20 My hubs did some vomiting w/ his flu. Had stuff messing w/ his tummy - high fever & snot was bad. Get better! Challenged by low population density, high degree of linguistic confounders. The Twitter Norovirus Study

4 Tweepy.py:Python wrapper for Twitter RESTful APIs Gnip: Twitter commercial partner Methods Considered

5 Search: Up to 100% of all tweets matching query, may use multiple queries. Search by location and/ or keywords. ~35 mile search radius limit. All tweets within the last week. Query rate limited per Twitter OAuth key to 180 searches every 15 minutes. Narrower geographic coverage, but very flexible. Streaming: Up to 1% of total stream volume by location or keywords from Twitter. About 10 keywords per query, 1 query per stream, and 1 stream per OAuth key. Tweets come in real-time. Tweet pull rate limited by Twitter. Most commonly used, great for whole country studies with simple queries. Twitter Method Comparison

6 Official Twitter data partner. Historic or real-time, variety of services. Large volume, representative sample. Excellent choice when affordable! Prices not public, quoted in a 2010 interview as: – 5% stream for $60k/year. – 50% stream for $360k/year. Gnip Method Comparison

7 Given a partial data sample, how can we accurately track tweets in an area with low engagement? – 12 potential NRV Norovirus/GI Tweets per day. – 4 suspected hits after human confirmation. – Long keyword list requires multiple queries. Twitter 1% streaming limited by query length and volume and Gnip was not affordable, that leaves the search method... Challenges

8 ChatterGrabber: A search method based social media data miner developed in Python. – GDI Google Docs interface included for simplified partner access. – Specialized hunters pull from GDI Spreadsheets to set run parameters. – Multiple logins may be used to increase search frequency during collaborative experiments. – No limits on query length. – Data sent nightly to subscribers as CSV. – Summary of history presented in dashboard (under development) ChatterGrabber Introduction

9 High redundancy & error tolerance for long term experiments: – If multiple API keys used, functional keys take up the work of failed keys until they may be reconnected. – Daemon automatically executes & resumes experiments on start up and after an interruption. – Any hunter may be resumed up to 1 week after termination without loss of incoming data. ChatterGrabber Reliability

10 General Execution Yes No Yes No Partition conditions into {x} queries Partition conditions into {x} queries Search radius > 35 miles? Search radius > 35 miles? Generate {y} Coordinate sets via covering algorithm Generate {y} Coordinate sets via covering algorithm Prepare search With |x| queries Prepare search With |x| queries Prepare search with |x|*|y| queries Prepare search with |x|*|y| queries Run Twitter search, from last tweet ID recorded for location and query pair Run Twitter search, from last tweet ID recorded for location and query pair Filter results by phrases, classifiers, and location; sleep Filter results by phrases, classifiers, and location; sleep Has a new day begun? Has a new day begun? Store data, send subscribers CSV and config link Store data, send subscribers CSV and config link Pull list of condition phrases & config from Google spreadsheet Pull list of condition phrases & config from Google spreadsheet

11 ChatterGrabber GDI Interface Example

12 Pure Query Based: Conditions, qualifiers, & exclusions. Searches by conditions, keeps if qualifier and no exclusions present. Simple, easy to setup, but vulnerable to complexities of wording. NLTK* Based: Take output from conditions search, manually classify. Train NLTK maxEnt or Naïve Bayesian classifier via content n-grams. Classifier discards tweets that don’t fit desired categories. Powerful, but requires longer setup, representative tweet sample. ChatterGrabber Search Methods *NLTK: Natural Language Tool Kit

13 Tweet Linguistic Classification Using NLTK mode? Using NLTK mode? Classify Tweet by features Classify Tweet by features Is Tweet classification sought? Is Tweet classification sought? Extract features from Tweet Extract features from Tweet Discard Tweet Discard Tweet Does Tweet contain an exclusion? Does Tweet contain an exclusion? Does Tweet contain a qualifier? Does Tweet contain a qualifier? Store Tweet data and derived data Store Tweet data and derived data Yes No Yes No Tweet passed for classification Tweet passed for classification Keeping non-hits? Keeping non-hits?

14 NLTK Classifier Example

15 Large lat/lon boxes filled via covering algorithm. Fine and coarse geolocations obtained via GoogleMapsV3 API: – If coordinates to tweet are present, finds street address. – If common name present, finds coordinates, then searches by coordinates for proper name/ street address of position. – If location is outside of lat/lon box, discards tweet. All geo queries cached locally, shared between experiments, and pulled on demand to reduce API utilization. ChatterGrabber Geographic Methods

16 Basic Execution: 1. Create GDI sheet, run initial experiment. 2. Check first results for confounders, update keyword lists. 3. Rerun experiment with new keywords, monitor periodically for new keywords & memes. If Greater Specificity Desired: 1. Run whole country experiment with desired query list. 2. Score output manually & enable NLTK classification. 3. Expand area as desired. Work Flow

17 Results Found and geolocated 4,000-8,000 suspected Norovirus tweets per day across the US during peak Norovirus season. Preliminary estimates of 70-80% accuracy with 2,000 tweet training set

18 Results Continued

19 Results exceed the geographic and temporal resolution of existing surveillance systems, complicating verification No true denominator, ChatterGrabber only collects queried hits. Not all desired information is available in social media, some may be incomplete or falsified. ChatterGrabber is just an information gathering method, external analysis and review needed for validity. Twitter users will differ from population at large. Limitations

20 ● ChatterGrabber provides an easy to use social media surveillance tool – Natural Language Processing speeds illness identification. – Geographic region directed searching allows complete coverage of a user defined jurisdiction. ● ChatterGrabber can successfully identify GI illness related tweets in a population. – 220 Million USA Tweets per day – 6,000 matches per day by Nationwide NLTK search. – 353 matches per day by Virginia keyword search. – 136 matches per day by Virginia NLTK search. Conclusions

21 Streamlined web interface needed for NDSSL long term studies. Real-time bioterrorism surveillance methods under evaluation using gun violence as a proof of concept. Norovirus visualization & dashboard under development by Elizabeth Musser. Tick bite zoonosis and unlicensed tattoo hepatitis risk tracking underway by Pyrros Telionis. Vaccine sentiment tracking underway by Meredith Wilson. Next Steps

22

23 Firearm violence related tweets by time of day Next Steps

24

25

26 Design and execution of real-world use by state and local public health offices. Dashboard deployed and customized for users across Virginia. Evaluation of pre and post deployment practice. Assessment of utility and iterative refinement. If interested contact: blewis@vbi.vt.edu Next Steps: Public Health Outreach

27 Python Resources I.Roesslein, J. (2009). Tweepy (Version 1.8) [Computer program]. Available at https://github.com/tweepy/tweepy (Accessed 1 November 2013) II.Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O’Reilly Media Inc. (Accessed 14 January 2014) III.Google Developers (2012). gdata-python-client (Version 3.0) [Computer program]. Available at http://code.google.com/p/gdata-python-client/ (Accessed 6 January 2014) IV.McKinney, W. (2010). Data structures for statistical computing in Python. In Proc. 9th Python Sci. Conf (pp. 51-56) V.Tigas, M. (2014). GeoPy (Version 0.99) [Computer program]. Available at https://github.com/geopy/geopy (Accessed 21 December 2013) VI.KilleBrew, K. (2013). query_places.py [Computer program]. Available at https://gist.github.com/flibbertigibbet/7956133 (Accessed 27 January 2014) VII.Coutinho, R. (2007, August 22nd) Sending emails via Gmail with Python [Web log Post]. Retrieved January 5 th fromhttp://kutuma.blogspot.com/2007/08/sending-emails-via-gmail-with-python.html Relevant Papers I.Rivers, C. M., & Lewis, B. L. (2014). Ethical research standards in a world of big data. F1000Research, 3. II.Young, S. D., Rivers, C., & Lewis, B. (2014). Methods of using real-time social media technologies for detection and remote monitoring of HIV outcomes. Preventive medicine. III.Chakraborty, P., Khadivi, P., Lewis, B., Mahendiran, A., Chen, J., Butler, P.,... & Ramakrishnan, N. Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions. SDM14 References

28 Questions?


Download ppt "ChatterGrabber.py Methods and Development A System for High Throughput Social Media Data Collection By James Schlitt, in collaboration with Elizabeth Musser,"

Similar presentations


Ads by Google