Presentation is loading. Please wait.

Presentation is loading. Please wait.

Social Media, Data Integration, and Human Computation

Similar presentations


Presentation on theme: "Social Media, Data Integration, and Human Computation"— Presentation transcript:

1 Social Media, Data Integration, and Human Computation
AnHai Doan University of Wisconsin @WalmartLabs @WalmartLabs

2 Find houses with 2 bedrooms under 400K
A Journey Starting in Worked in data integration combine multiple data sources into one e.g, aggregation/comparison shopping sites, Google Scholar use schema matching, information extraction, entity disambiguation Ph.D. thesis focused on schema matching homes.com Find houses with 2 bedrooms under 400K realestate.com fsbo.com

3 Schema Matching Developed automatic solution using machine learning
address price 31 Bagley Ct ... 250K 12 Hope St ... 375K address = location price = sold-at location sold-at 14 Main St ... 249,000 25 West St ... 324,000 Developed automatic solution using machine learning Realized that automatic solutions are not good enough only 65-85% accuracy need human intervention Proposed a crowdsourcing approach

4 Crowdsourced Schema Matching
address price 31 Bagley Ct ... 250K 12 Hope St ... 375K address = location Yes, Yes, No location sold-at 14 Main St ... 249,000 25 West St ... 324,000 Can crowdsource other DI tasks too Difficult to publish Building data integration systems via mass collaboration, WebDB-03 Subsequent reviews: great work, I don’t believe it, neutral Build a large-scale DI system on the Web Show that crowdsourcing is practical

5 Started DBLife Project in 2005
Superpages Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary HV Jagadish HV Jagadish Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages * * * * * give-talk * * SIGMOD-07 SIGMOD-07 * * * * * * * * File system RDBMS Hadoop

6 Example Superpage

7 Example Crowdsourcing
Picture is removed if enough users vote “no”.

8 Project Status in 2009 Wanted to know what’s going on in industry
Data integration overall methodology: VLDB-07a, VLDB-07b, CIDR-09 DI operators: VLDB-07c optimization: VLDB-07c, SIGMOD-08, ICDE-08a, SIGMOD-09a provenance/others: ICDE-07a, ICDE-07b, VLDB-08a Crowdsourcing / human computation schema matching: ICDE-08b best-effort information extraction: SIGMOD-08 human feedback into the DI pipeline: SIGMOD-09b how lay users can query the database: SIGMOD-09c System development hard to build/maintain systems in academia Wanted to know what’s going on in industry Wanted to take DBLife to the next level Joined Kosmix in 2010 to do “DBLife on steroids”

9 Kosmix Founded by Anand Rajaraman & Venky Harinarayan
formerly of Junglee, sold to Amazon for 250M 55M in funding, 30+ engineers Integrated Web data sources into a giant taxonomy all places people topic pages IMDB Musicbrainz Tripadvisor Wikipedia Information extraction Entity disambiguation Entity merging ... actors Angelia Jolie Mel Gibson File system RDBMS Hadoop

10 Raised many interesting challenges
- e.g., incremental updates, recycling human edits Very good in certain topics (e.g., health) But hard to compete with Google and Wikipedia Switched to social media in early 2010

11 Social Media Exploding
100 million tweets per day 1 billion Facebook shares per day 1.5 million Foursquare checkins per day 40,000 Flickr photos per second Every two days now we create as much information as we did from the dawn of civilization up until  2003. -- Eric Schmidt  Every two days now we create as much information as we did from the dawn of civilization up until  2003, according to Schmidt. That’s something like five exabytes of data, he says. Let me repeat that: we create as much information in two days now as we did from the dawn of man through 2003. “The real issue is user-generated content,” Schmidt said. He noted that pictures, instant messages, and tweets all add to this. 11

12 Switching Made Much Business Sense
Lot of social media data Lot of people using it, spending a lot of time on it lot of links now come from social media, not search engines Google is worried (hence Buzz, Google+, Google++) New level playing field Have a secret weapon: the giant taxonomy Next hot Internet wave SoLoMo = social + local + mobile But can we build interesting applications? What is social media good for?

13 From Frivolous to Serious
95% of tweets is still junk I feel good today Help teenagers track Justin Bieber the background noise of Twitter Charlie Sheen, celebrity fighting, Weiner losing his job Foster customer relationships follow your dentist Spread news Manage disasters Promote e-commerce Help organize events, movements revolutions

14 Lot of Companies / Actions in This Space
Build platforms for social media how to tweet more effectively Understand social media social analytics / route relevant information to users Use social media to make predictions Use social media to affect real-world changes Mostly operate at the keyword level how many times the keyword “Obama” has been mentioned today? Kosmix: the leader in performing semantic analysis how many times the entity President Obama has been mentioned today? “Obama”, “Barack”, “Barry”, “BO”, “the Pres”, “the Messiah”, ...

15 Kosmix Solution Crowd sourcing
internal analysts, users, Mechanical Turks, others IMDB Musicbrainz Wikipedia Social Genome Applications Information extraction Entity disambiguation Entity merging Schema matching Event detection Event monitoring ... Highly scalable real-time infrastructure File system RDBMS Hadoop Muppet Slates Stream servers

16 Social Genome all places people Twitter users FB users actors
@melgibson @dsmith … mel-gibson davesmith … Angelia Jolie Mel Gibson tweet-about the-same-as events @dsmith: Mel crashed Maserati is gone. sports celebrities politics … Gibson car crash Egyptian uprising capital-of Egypt Cairo related-to located-in Tahrir @far213: Tahrir is packed!

17

18 Building Social Genome: Three Sample Challenges
places people Twitter users FB users actors @melgibson @dsmith … mel-gibson davesmith … Angelia Jolie Mel Gibson tweet-about the-same-as events @dsmith: Mel crashed Maserati is gone. sports celebrities politics … Gibson car crash Egyptian uprising capital-of Egypt Cairo related-to located-in Tahrir @far213: Tahrir is packed!

19 Extraction and Disambiguation: Traditional Methods Ill Suited for Social Media
all places people events actors directors sports celebrities politics … Angelia Jolie Mel Gibson Mel Brooks Gibson car crash Egyptian uprising Mel was arrested again. What a dramatic fall since his Oscar-winning day. Extraction use rule-based / NLP / machine learning techniques Disambiguation Long-term, Web context: actor, movie, Oscar, Hollywood Extraction use dictionaries use rules Disambiguation @dsmith: mel crashed maserati is gone. Short-term, social context: crash, car, Maserati

20 Must Maintain a Highly Dynamic Social Genome
all places people events actors directors sports celebrities politics … Angelia Jolie Mel Gibson Mel Brooks Gibson car crash Egyptian uprising Short-term, social context: crash, car, Maserati Long-term, Web context: actor, movie, Oscar, Hollywood Latency less than 2 seconds

21 The Giant Traditional Taxonomy is the Secret Weapon
all capital-of places people Egypt Cairo actors located-in Angelia Jolie Mel Gibson Tahrir Without it, dictionary-based extraction is not possible Provide a framework to “understand” social media, find related concepts, “hang” social contexts Very hard to develop, takes years like learning a new foreign language Partly explains why it was hard for others to catch up  Must integrate traditional data well, then bootstrap

22 Event Detection: Current Solutions
Twitter 4square Facebook Myspace Flickr events Event detection sports celebrities politics … Gibson car crash Egyptian uprising Focus on Twitter + Foursquare Lot of current work in academia / industry Limitations of most of the current solutions exploit just one kind of heuristics e.g., find popular, strongly correlated words (Egypt, revolt) does not exploit crowdsourcing does not scale not designed explicitly for parallelism

23 Event Dection: Kosmix Solution
Candidate events Population 1 Detector 1 Event evaluator and ranker Ranked events Twitter Foursquare Detector 2 Candidate events Population 2 Candidate events Population 3 Detector n ... Hadoop Muppet Slates Stream servers

24 Event Monitoring: Current Solutions
Egyptian uprising Baltimore shooting @far213: Tahrir is packed! @dsmith: Baltimore shooting on TV5! Manually write rules to match tweets to events e.g., tweet contains certain keywords / userids  positive conceptually simple, relatively easy to implement often achieve high initial precision Limitations expensive, don’t scale manually writing good rules can be hard rules often become invalid/inadequate over time e.g., Baltimore shooting  John Hopkins shooting 24

25 Event Monitoring: Kosmix Solution
Twitter firehose Baltimore shooting Tweets Initial profile {Baltimore, shoot} “Baltimore shooting on TV5!” “Baltimore shooting. John Hopkins shut down.” ... Learning algorithm New profile {Baltimore, shoot, John Hopkins} 25

26

27 Social Analytics with The NYTimes
e.g. Location, Sentiment, Entity extraction, etc. Stats Tweets Annotators Tweets & Dimensions SocialCubes How many are tweeting about Barack Obama in New York, by the minute for last 60 mins, by hour for last 24 hours, and by day for last 10 days? Location New York Arizona California How many people in Arizona feel positive of the new Medicare plan? Barack Obama Hillary Clinton Topics Medicare How many feel negative of Barack Obama across the US? Negative Positive Neutral Sentiment Barack Obama, President Obama, the Pres, Barry, BO, ...

28 Social Monitoring with an Unknown Agency
146 in past 5 mins 3267 in past 12 hours Twitter firehose Count tweets related to Wael Ghonim Egyptian uprising Justin Bieber Charlie Sheen Jordan unrest North China unrest Tibet West Southeast Bought by Walmart in May 2011

29 The Walmart Acquisition
Deal reported to be M Kosmix based in San Bruno local office in India plan new offices in China and Brazil 100 persons today, actively hiring

30 Why? 400+ B in revenue, only 5-10B online vs. 34B of Amazon
Major problems if won’t catch up within 5-10 years see Borders @WalmartLabs can help in many ways Provides a core of technical people, attract more Improve traditional e-commerce SEO, SEM, search on walmart.com build a vast product taxonomy Helps build the e-commerce of the future social, local, and mobile a good way to catch up and leapfrog Amazon

31 Improve Traditional E-Commerce
all products Product data from thousands of vendors In-house data Web data books cars search ads Information extraction Entity disambiguation Entity merging ... US cars Ford Chevrolet File system RDBMS Hadoop

32 Help Build the E-Commerce of Future: Social, Local, and Mobile
O2O (Online 2 Offline) emerging as a major trend increasingly tighter integration of online and offline parts e.g., Groupon, Living Social Social, local, and mobile commerce examples gift recommendation: “I love salt!” “Your friend has just tweeted about the movie SALT. Would you like to buy something related for her birthday?” personalized “Groupon” with vendors: “You seem to be interested in gourmet coffee. If 50 persons sign up to buy the new DeLonghi coffee maker, you can get that for a 50% discount.” stocking a local store a Siri-like shopping assistant

33 Wrapping Up Social media has become a major frontier on Web
Integrating social data is fundamentally much harder than integrating “traditional” data lack of context dynamic environment, new concepts appear quickly quality issues, lots of spam quick spread of information, user activities fast data solution will change over time, need human in the loop to monitor Must integrate “traditional” data well, then bootstrap giant taxonomy critical Crowdsourcing becomes indispensible but raises interesting challenges


Download ppt "Social Media, Data Integration, and Human Computation"

Similar presentations


Ads by Google