Twitter, Big Data, and Other Ramblings Robert Dittmer.

2 Perspective on those V words  Volume- 1% of the Twitter stream for roughly one month was about 68 million Tweets. Now multiply that by 100. Facebook has the same problem.  Velocity- How do you analyze thousands of points of data in real-time? SQL Server sure isn’t going to do that.  Variety- Social Media, Manufacturing, Sales, Financial, CRM, Web Traffic, External  Think about what goes into Amazon recommending you a book or movie  Veracity- It all means nothing if it’s not at least somewhat clean

3 What do you do with a Tweet?  Sentiment Analysis is assigning a numerical value to a word  Positive, Negative, Neutral connotation  Methods for performing Sentiment Analysis  “Dumb” Method- Break down text into individual words and compare with a sentiment dictionary. AKA “Bag of Words”  “Smart” Method- Use a natural language processing tool to analyze parts of speech and calculate sentiment based on context  Example Tweet  “The Apple iPad sucks. The new Google Nexus 7 is awesome!”

4 Collecting Tweets  Twitter uses a RESTful service to stream Tweets  Steps to start streaming your own Tweets  Go to and create an application  Generate your OAuth credentials  Find an open-source Twitter library  Tweepy (Python)  Tweetinvi (C#)  Plug your credentials in and modify the example

5 The Tweet, the Whole Tweet, and Nothing but the Tweet  JSON Format (Key-Value Pair)  Notable Fields  ID  CreatedAt  Text  Entities  Hashtags  URLs  Latitude, Longitude

6 What does a Tweet look like?  {"filter_level":"medium","contributors":null,"text":"Iron man 3 was awesome =)","geo":{"type":"Point","coordinates":[50.73529254,- 4.00720746]},"retweeted":false,"in_reply_to_screen_name":null,"truncated":false,"lang":"en","entities":{"symbols":[],"urls":[],"ha shtags":[],"user_mentions":[]},"in_reply_to_status_id_str":null,"id":330043889589288960,"source":" Twitter for Android ","in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu May 02 19:39:29 +0000 2013","in_reply_to_user_id":null,"favorite_count":0,"id_str":"330043889589288960","place":{"id":"0613276b16c0d59f","bounding_b ox":{"type":"Polygon","coordinates":[[[-4.335135,50.429347],[-4.335135,50.874614],[-3.732303,50.874614],[- 3.732303,50.429347]]]},"place_type":"city","name":"West Devon","attributes":{},"country_code":"GB","url":"","country":"United Kingdom","full_name":"West Devon, Devon"},"user":{"location":"okehampton","default_profile":false,"statuses_count":1345,"profile_background_tile":true,"lang":"en", "profile_link_color":"FC0AFC","profile_banner_url":"","id":503242 961,"following":null,"favourites_count":492,"protected":false,"profile_text_color":"0084B4","description":"vicky pollards twin sister ( the nice one )","verified":false,"contributors_enabled":false,"profile_sidebar_border_color":"FFFFFF","name":"vicki phillips ","profile_background_color":"FA03DD","created_at":"Sat Feb 25 16:40:53 +0000 2012","default_profile_image":false,"followers_count":149,"profile_image_url_https":" 83034337/fb9a8158c125dbb5a0650f58206880e0_normal.jpeg","geo_enabled":true,"profile_background_image_url":"","profile_background_image_url_htt ps":"","follow_request_ sent":null,"url":"","utc_offset":0,"time_zone":"Casablanca","notifications":null,"profile_use _background_image":true,"friends_count":1059,"profile_sidebar_fill_color":"DDEEF6","screen_name":"vixoakleophill","id_str":"503 242961","profile_image_url":" eg","listed_count":0,"is_translator":false},"coordinates":{"type":"Point","coordinates":[-4.00720746,50.73529254]}}

7 My Tweet Collection  Collected for roughly one month  Lots of trial and error  Originally used Tweepy, but ran into errors  Switched to Tweetinvi and it worked  About 68 million Tweets  Apple  Amazon  Google  Microsoft  Netflix  Tesla  Ford (Probably should have used a different car company)

8 Yahoo! Finance Detour  Use an HTTP request to get stock data  ZN+TSLA+F&f=snb2b3opl1t1d1 ZN+TSLA+F&f=snb2b3opl1t1d1  Create a metric with stock data and compare the sentiment of a company to their performance

9 Big Data (and regular data) Tools  Talend Open Studio  Hadoop  SAP HANA

10 Talend Open Studio  Open Source ETL Tool  Built on Eclipse  Data Quality and Format Issues  Even though I saved Tweets in delimited format, issues remained  Iterated through all 12,736 files with 5000 tweets each  Verified each row against a schema  Mapped to different output files  Tweet (Fact table)  Tracks  User Mentions  Hashtags  URLs  Demo Time!

11 Hadoop Overview  Based on the Hadoop Distributed File System and MapReduce  MapReduce is a way of parallelizing code using batch processing  Map finds the data you’re looking for  Reduce aggregates that data (count, sum, average)  Embarrassingly parallel processing  Each server in a Hadoop cluster is referred to as a Node  NameNode  DataNode  Blocks of data are replicated to three nodes  Extremely fault tolerant


13 More Hadoop  Open-source technology  Cloudera vs. Hortonworks  Intel, IBM, MapR, Amazon EMR  Cloudera and Hortonworks are the two biggest faces of Hadoop  Intel actively contributes to optimize it for Xeon Processors  IBM and MapR also involved  Big companies and entities use it

14 Hadoop Projects  Hive  Data Warehouse on top of Hadoop  Uses HiveQl (essentially SQL with a few extras) to query data  Abstracts MapReduce processes  Has an ODBC connector to allow it talk to anything that talks to databases  Pig  Uses a language called Pig Latin to analyze data  Data flow language abstracts MapReduce for easy use for data analysts  HBase  Billions of rows and millions of columns  Distributed column data store

15 Hadoop Trivia Time  Who created Hadoop?  Why is it called Hadoop?  Who developed the concept of MapReduce?  What does Facebook Messenger use to store its data?  Who created Hive?  What is Accumulo and who created it?

16 2 nd Generation Hadoop  Much faster than previous versions  Hive 0.12 is up to 50X faster than previous versions  Hortonworks Stinger project aims for 100X performance improvement  Projects like Spark are moving towards real-time analysis  In-memory cluster compute analysis  Streaming processing with routines written in Python and Scala  Shark is an implementation of Hive using Spark instead of MapReduce

17 Hadoop Sentiment Analysis  Used the “Dumb” method of Sentiment Analysis  Import the data into HDFS and create Hive tables  Tweet  Sentiment Dictionary  Explode words in each tweet to create a view with TweetID and Word  Join with the Sentiment Dictionary on the word to get sentiment value  Demo Time!

18 SAP HANA  In-Memory, Column-Store database  Loads all data into main-memory  Analyze billion of rows with sub-second response time  Column-store table structure  Allows for much better compression and parallelization than row-store  Used for real-time analytics  Available with an on premise appliance or cloud-based VM

19 Why is SAP HANA Awesome?  Column-stores are naturally very good at parallelization  In-Memory means no waiting on IO from disks and is still hundreds of times faster than SSD  Feature rich  Text analytics  Predicative Analytics Library  Application Server  It is an actual Database and does everything a database does  Demo Time

20 SAP HANA Sentiment Analysis  Sentiment is calculated when creating a full-text index on the text of the tweet  Creates a sentiment value for each tweet  Analyze by my different dimensions  Aggregate sentiment by hour  Demo Time!

21 Other Text Analysis Options  Python Natural Language Toolkit  Analyze parts of speech and context  Should be possible to integrate with Hadoop (The Google did not help)

22 Other Big Data Problems  A GE Engine on a transatlantic flight generates 2TB of sensor data  There’s four engines on a 747  What does the LHC at CERN do with their 15 petabytes of data they create annually?  How does the NSA store a yottabyte of data?  How does a small online gaming company analyze their customer base to increase retention and margins?

23 How is Sentiment Analysis Being Used?  Companies ingest their social media feeds into these systems  If a Tweet or Facebook post meets a certain criteria, an automated or human response can be requested

24 Hot vs. Cold Data  Hot data is the recent data you are most interested in  Keep this data in SAP HANA for real-time processing  Archive it after a period of time: 1 month, 3 months, 6 months, etc…  Cold Data is your historical data  Data warehouses that can handle massive volumes of data are EXPENSIVE!!!!  Use Hadoop and Hive as your data warehouse  It only costs the hardware  Still able to analyze cold data, store it cheaply, and integrate with SAP HANA

