BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean)

1 BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean)

2 BIG Data & Twitter

3 WHAT IS BIG DATA ? In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. 《 Wikipedia Big data 》 Source:

4 WHAT IS BIG DATA ? In 2001, Doug Laney use 3V model to describe Big Data ‒ Volume: amount of data ‒ Velocity: speed of data in and out ‒ Variety: range of data types and sources ‒ Veracity: truth or fact of data

5 WHAT IS BIG DATA ? In 2012, Gartner updated the definition – Still advocate 3V model for describing data – Require new forms of processing – Enhanced decision making – Insight discovery – Process optimization

6 HOW BIG IS BIG DATA ? Beyond the ability of commonly used A few dozen terabytes (10 7 ) to many petabytes (10 8 ) − 2008: Google processes 20 PB a day − 2009: Facebook has 2.5 PB user data + 15 TB/day − 2009: eBay has 6.5 PB user data + 50 TB/day − 2011: Yahoo! has 180-200 PB of data − 2012: Facebook ingests 500 TB/day

7 NEW TECHNOLOGY FOR BIG DATA Hadoop – Developed by Apache Software Foundation – Derived from Google's MapReduce & File System – Able to process peta-bytes scale database NoSQL (Not Only SQL) – Relational databases is not applicable for all cases – NoSQL is a new choose for non-relational databases – Adopted by Google, Facebook, Twitter, etc.

8 WHAT IS TWITTER? The fastest, simplest way to communicate More than 140M active users Majority source from mobile 60% of user is out of U.S. More than 400M visitors More than 400M tweets/day (peak: 25K/sec) 1,000 employees (majority in San Francisco) 50% of employee are engineers Expect to hit nearly $1 billion on global ad revenue in 2014 by eMarketer

9 TWITTER HISTORY Evan Williams on the genesis of Twitter, ICWSM, April 2007: − A side project started from Jack Dorsey’s idea Oct, 2006 − Wanted a ubiquitous status message − A community of people answering the question “what are you doing?” − Exploded at SXSW, SF earthquakes (2011) − Good for collective “backchanneling” − High “Ambient intimacy” − Huge API usage was unexpected, as was the rise of the @ sign for replies




13 TWITTER TOWN HALL July 6, 2011

14 Mapping the global Twitter heartbeat: The geography of Twitter, May 2013 Source: TWITTER STATS


16 Source: Pew Research Center's Internet &American Life Project Winter 2012 Tracking Survey, January 20- February 19, 2012. N=2,253 adults age 18 and older, including 901 cell phone interviews. Interviews conducted in English and Spanish. The margin of error is +/-2.7 percentage points for internet users. **Represents significant difference compared with all other rows in group.



19 Twitter Dev

20 TWITTER ACCOUNT Register a Twitter account (required)

21 REGISTER A TWITTER APPLICATION Twitter developer web site: Select “My applications”

22 REGISTER A TWITTER APPLICATION Click “Create a new application” Application List

23 REGISTER A TWITTER APPLICATION Fill the required information 1. 2. 3.

24 REGISTER A TWITTER APPLICATION Agree developer rules and fill captcha 1. 2.

25 REGISTER A TWITTER APPLICATION Go back to application list and click your application Click “Settings”

26 REGISTER A TWITTER APPLICATION Select “Read, Write and Access direct messages” Click “Update this Twitter application’s settings”

27 REGISTER A TWITTER APPLICATION Click “Create my access token”


29 Twitter API Resource

30 REST API Source:


32 TWEET CRAWL API Source: Source: ResourceDescription Request Limit (Per User) Request Limit (Via OAuth) GET statuses/show/:id Returns a single Tweet, specified by the id parameter. 180 / 15 mins POST statuses/update Updates the authenticating user's current status, also known as tweeting. -- GET search/tweets Returns a collection of relevant Tweets matching a specified query. 180 / 15 mins450 / 15 mins POST statuses/filter Returns public statuses that match one or more filter predicates. -- GET statuses/firehose This endpoint requires special permission to access. Returns all public statuses. --

33 tmhOAuth LIBRARY Website: $ git clone Current Version 0.8.2 Author: Matt Harris @themattharris Goal: ‒ Support OAuth 1.0A ‒ Use authorization headers instead of query string or POST parameters ‒ Allow uploading of images ‒ Provide enough information to assist with debugging

34 CRAWLING WITH REST API New a Oauth object contains authentication token Set parameters for API Use Twitter REST API to obtain tweets

35 CRAWLING WITH STREAMING API New a Oauth object contains authentication token Set parameters for API Construct a connection to Twitter server

36 WHAT IS OAuth ? OAuth = Open Authentication What is OAuth: ‒ An open protocol to allow secure API authorization in a simple and standard method from desktop and web applications. Goal of OAuth: ‒ Request token URL ‒ Authorize URL ‒ Access token URL

37 NORMAL SEARCH OPERATORS OperatorFinds tweets... twitter searchcontaining both "twitter" and "search". This is the default operator. "happy hour"containing the exact phrase "happy hour". love OR hatecontaining either "love" or "hate" (or both). beer -rootcontaining "beer" but not "root". #haikucontaining the hashtag "haiku". from:alexiskoldsent from person "alexiskold". to:techcrunchsent to person "techcrunch". @mashablereferencing person "mashable". "happy hour" near: "san francisco"containing the exact phrase "happy hour" and sent near "san francisco". near:NYC within:15misent within 15 miles of "NYC".

38 SEARCH PARAMETERS (REST) Source: ParameterDescription qA UTF-8, URL-encoded search query of 1,000 characters maximum geocodeReturns tweets within a given radius of the given coordinates. langRestricts tweets to the given language, given by an ISO 639-1 code. localeSpecify the language of the query you are sending. (Only ja) result_typeSpecifies from mixed, recent or popular. countThe number of tweets to return per page (<=100) untilReturns tweets generated before the given date. since_idReturns results with an ID greater than the specified ID. max_idReturns results with an ID less than or equal to the specified ID. include_entitiesThe entities node will be disincluded when set to false. callbackThe response will use the JSONP format with a callback.

39 SEARCH PARAMETERS (STREAMING) Source: ParameterDescription followIndicating the users to return statuses for in the stream. trackKeywords to track. locationsSpecifies a set of bounding boxes to track. delimitedSpecifies whether messages should be length-delimited. stall_warningsSpecifies whether stall warnings should be delivered.


41 CRAWLING EFFICIENCY Keyword Streaming APIREST API Proportion (S/R)* TotalTPSTotalTPS YouTube143,869,82130.286,306,3551.3322.81 News41,482,1088.737,906,2151.665.25 Google28,720,5256.047,474,6871.573.84 Obama8,503,8341.795,271,1871.111.61 *TPS: Tweet Per Second*S/R: Streaming/REST

42 LARGE-SCALE CRAWLING Track WordSizeDurationFrom – To#Tweet1 Year YouTube12.0 G21 days 2013-07-07 15:12:25 2013-07-28 13:10:01 52,913,498 209 G News5.7 G22 days 2013-07-07 15:07:15 2013-07-28 13:10:00 21,894,823 95 G Http15.0 G21 days 2013-07-07 15:44:13 2013-07-28 13:10:00 62,976,451 261 G Apple1.0 G22 days 2013-07-07 15:07:20 2013-07-28 13:10:01 4,038,241 17 G Android4.1 G20 days 2013-07-07 15:20:43 2013-07-28 13:10:00 16,605,070 75 G Obama682 M22 days 2013-07-07 15:07:05 2013-07-28 13:10:01 2,768,149 11 G

43 Twitter + MySQL

44 SINGLE NODE CRAWLING TYPE Guideline for single node crawling: − Each streaming needs to authenticate itself − Total data size seems bounded (i.e. #Tweet to crawler is limited) − Prevent aggressively connecting to Twitter server − Crawling with different Twitter accounts is recommended Tweet Crawler Tweets Streaming - B Twitter Server Tweets Streaming - C Tweets Streaming - A …

45 MULTI-NODE CRAWLING TYPE Guideline for multi-node crawling: − Automatically check connection status − Automatically update databases summary information − Design the crawl program with well log file report function − Design a good database schema for distributed accessing Tweet Crawler Tweets Streaming - B Twitter Server Tweets Streaming - A Tweet Crawler

46 DESIGN TWEET TABLE NameTypeDescriptionIndex Type IdBIGINT UNSIGNEDUnique index ID in databasePRIMARY tweet_idBIGINT UNSIGNEDOfficial Tweet IDUNIQUE textVARCHAR( 150 )Tweet content- screen_nameVARCHAR( 255 )User screen name- user_idBIGINT UNSIGNEDUser ID- followers_countINTNumber of followers- friends_countINTNumber of friends- created_atDATETIMETweet create time- languageVARCHAR( 5 )Language to Tweet- sourceVARCHAR( 150 )Device or browser to Tweet- urls_countINTNumber of URL in the Tweet-

47 SETTING ENVIRONMENT Install packages ‒ # apt-get install php5 php5-curl ‒ # apt-get install mysql-client mysql-server ‒ # apt-get install phpmyadmin ‒ Set Apache2 as web server when install phpymadmin

48 SETTING ENVIRONMENT Create databsase and table for Tweet crawling − Create a *.sql file for database format − Change directory to that file − # mysql -h {$HOST} -u {$USER} -p{$PASSWORD} − mysql> \. {$SQL_FILE}

49 SETTING ENVIRONMENT Check the database by phpmyadmin − Open browser and connect URL http://localhost/phpmyadminhttp://localhost/phpmyadmin − Select database and check the structure

50 CRAWLING REAL-TIME TWEETS Connect database Save Tweet into database

51 CRAWLING REAL-TIME TWEETS Copy all files in twitter_watch to /var/www/twitter_watch ‒ # cp twitter_watch/server.php /var/www/twitter_watch ‒ # cp twitter_watch/logic.hjs /var/www/twitter_watch ‒ # cp twitter_watch/index.html /var/www/twitter_watch Start crawling tweets ‒ $ php5 watch.php

52 CRAWLING REAL-TIME TWEETS Click “Browse” to show crawling Tweets in database

53 CRAWLING REAL-TIME TWEETS Real-Time update Tweets by JQuery ‒ Browse http://localhost/twitter_watch/index.htmlhttp://localhost/twitter_watch/index.html

54 TROUBLESHOOTING Access denied for user 'root'@'localhost' (using password: NO)‘ # /etc/init.d/mysql stop # mysqld_safe --skip-grant-tables & # mysql -u root mysql mysql> UPDATE user SET Password=PASSWORD(‘xxx') where USER='root'; mysql> FLUSH PRIVILEGES; mysql> quit; # /etc/init.d/mysql restart Be aware of time synchronization # apt-get install ntp # ntpdate -s # hwclock --systohc

55 URL @ Tweet SURLMINE Incremental Mining of Significant URLs in Real-Time and Large-Scale Social Streams PAKDD 2013

56 WHY URL? High percentage of URLs have been embedded in Tweets − Content length limitation and information completeness URL is an universal language without linguistic differences URL is able to connect different social media platforms Tweet with URL has been verified with low spam possibility Social MediaCharacter LimitNature Twitter140 charactersShort message Plurk140 charactersShort message LinkedIn200 ~ 689 charactersJob opportunities Google+100,000 charactersMix information Facebook63,206 charactersMix information YouTube1,000 charactersVideo sharing

57 CHALLENGE URL shorterners make URLs hard to be analyzed The usage of various URL shortening services are different URL shorterner is time-effective which could expired anytime A general solution to expand URL shorterner to original URL Some of URLs link to phishing websites Keywordoriginalbit.lytinyurlow.lygoo.glothersURL % YouTube96.49%0.95%0.14%0.10%0.12%2.20%90.80% News37.92%17.92%1.10%0.00%2.17%40.89%75.77% Google54.49%16.30%0.98%2.28%4.12%21.83%60.67% Obama30.20%23.33%2.27%2.62%2.87%38.71%54.22%

58 EXPAND URL SHORTERNERS Recursively tracking web page redirections − Be aware of to be identified as DNS attack (cache table) − Redirection link may changes with various browsers

59 URL STATS @ TWEET Track Word#Tweet#URLURL %URL Per Second YouTube 529,82,16649,975,035 94.32 %27.62 News 21,948,83715,572,228 70.95 %8.60 Http 62,976,45142,249,898 67.09 %23.41 Apple 4,045,3332,670,731 66.02 %1.48 Android 16,605,07015,242,497 91.79 % 8.44 Obama 2,771,791950,780 34.30 %0.53

60 TRACK “TAIWAN” ON TWITTER We demand the truth and justice!

61 Thank You Q & A

