Gerhard Weikum Harvesting and Organizing Knowledge from the Web In collaboration with Giorgiana.

Slides:



Advertisements
Similar presentations
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Advertisements

YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,
YAGO-NAGA Project Presented By: Mohammad Dwaikat To: Dr. Yuliya Lierler CSCI 8986 – Fall 2012.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Graph Data Management Lab, School of Computer Science Put conference information here.
„IP“ is not always „Internet Protocol“ A long and a very short example for IP problems in Web 2.0 research Ralf Schenkel Joint work with Tom Crecelius,
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
SFU, CMPT 741, Fall 2009, Martin Ester 418 Outlook Outline Trends in KDD research Graph mining and social network analysis Recommender systems Information.
Gerhard Weikum Harvesting, Searching, and Ranking Knowledge from the Web Joint work with Georgiana.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
Tagging Systems Austin Wester. Tags A keywords linked to a resource (image, video, web page, blog, etc) by users without using a controlled vocabulary.
Database and Information- Retrieval Methods for Knowledge Discovery Database and Information- Retrieval Methods for Knowledge Discovery Gerhard Weikum,
Information Retrieval in Practice
Information Retrieval
Which Nobel laureate survived both world wars and all his four children? Tandem What‘s this? Question AnsweringPhoto AnnotationTimeline Analysis.
The Social Web: A laboratory for studying s ocial networks, tagging and beyond Kristina Lerman USC Information Sciences Institute.
Overview of Web Data Mining and Applications Part I
Saarbrucken / Germany ¨
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Databases & Information Retrieval Maya Ramanath ( Further Reading: Combining Database and Information-Retrieval Techniques for Knowledge Discovery. G.
Tag-based Social Interest Discovery
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
DBrev: Dreaming of a Database Revolution Gjergji Kasneci, Jurgen Van Gael, Thore Graepel Microsoft Research Cambridge, UK.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Ontology-Based Information Extraction: Current Approaches.
Flexible Text Mining using Interactive Information Extraction David Milward
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Summary Knowledge Bases from Web are Real, Big & Useful: Entities, Classes & Relations Key Asset for Intelligent Applications: Semantic Search, Question.
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
1 NAGA: Searching and Ranking Knowledge Gjergji Kasneci Joint work with: Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum.
Tutorial: Knowledge Bases for Web Content Analytics
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Gaby Nativ, SDBI  Motivation  Other Ontologies  System overview  YAGO Dive IN  LEILA  NAGA  Conclusion.
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Social Information Processing March 26-28, 2008 AAAI Spring Symposium Stanford University
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Data mining in web applications
Information Retrieval in Practice
Einat Minkov University of Haifa, Israel CL course, U
Search Engine Architecture
Lecture 1: Introduction and the Boolean Model Information Retrieval
Information Retrieval (in Practice)
The Power of Networks Six Principles That Connect Our Lives
Optimize your research performance using SciVal
Information Retrieval
Smart Portal To Protect Child Online
Web Mining Department of Computer Science and Engg.
Information Networks: State of the Art
Introduction to Search Engines
Introduction Dataset search
Presentation transcript:

Gerhard Weikum Harvesting and Organizing Knowledge from the Web In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald

Gerhard Weikum ADBIS Oct 1, /53 Vision Opportunity: Turn the Web (and Web 2.0 and Web ) into the world‘s most comprehensive knowledge base (semantic DB) Challenge: seize opportunity and make it happen! Approach: combine and exploit synergies of hand-crafted, high-quality knowledge sources  Semantic Web automatic knowledge extraction  Statistical Web social networks and human computing  Social Web

Gerhard Weikum ADBIS Oct 1, /53 Proof of Relevance There is a growing mountain of research. … A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory. Vannevar Bush: As We May Think, 1945.

Gerhard Weikum ADBIS Oct 1, /53 Proof of Relevance Tim Berners-Lee: In the Semantic Web information is given well-defined meaning. Jim Gray: … system can answer questions about the text as precisely and quickly as a human expert. Brewster Kahle: The goal of universal access to our cultural heritage is within our grasp. Jimmy Wales: Our big-picture vision is to share knowledge with all of humanity. Al Gore: The future will be better tomorrow.

Gerhard Weikum ADBIS Oct 1, /53 Proof of Relevance ? A journey of a thousand miles begins with a single step. You cannot open a book without learning something. To know that we know what we know, and that we do not know what we do not know, that is true knowledge. Confucius, BC Denis Diderot, Ignorance is less remote from the truth than prejudice. Sentences are like sharp nails, which force truth upon our memories. When science, art, literature, and philosophy are simply the manifestation of personality, they can make a man's name live for thousands of years.

Gerhard Weikum ADBIS Oct 1, /53 Why Google and Wikipedia Are Not Enough Nobel laureate who survived both world wars and his children drama with three women making a prophecy to a British nobleman that he will become king proteins that inhibit both protease and some other enzyme connection between Thomas Mann and Goethe differences in Rembetiko music from Greece and from Turkey neutron stars with Xray bursts > erg s -1 & black holes in 10‘‘ market impact of Web2.0 technology in December 2006 sympathy or antipathy for Germany from May to August 2006 Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB/graph“) ! Answer „knowledge queries“ such as:

Gerhard Weikum ADBIS Oct 1, /53 Outline Introduction: Search for Knowledge Conclusion Harvesting Knowledge Leibniz Approach Planck Approach Darwin Approach

Gerhard Weikum ADBIS Oct 1, /53 Three Roads to Knowledge Leibniz Approach: Handcrafted High-Quality Knowledge Sources (Semantic Web) Planck Approach: Large-scale Information Extraction & Harvesting (Statistical Web) Darwin Approach: Social Wisdom from Web 2.0 Communities (Social Web)

Gerhard Weikum ADBIS Oct 1, /53 Leibniz Approach (Semantic Web) Handcrafted High-Quality Knowledge: Ontologies and other Lexical Sources Build on Rigorous Knowledge Atoms („Characteristica Universalis“) Gottfried Wilhelm Leibniz ( )

Gerhard Weikum ADBIS Oct 1, /53 High-Quality Knowledge Sources General-purpose ontologies for Semantic Web: SUMO, Cyc, etc.

Gerhard Weikum ADBIS Oct 1, /53 High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman)... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)

Gerhard Weikum ADBIS Oct 1, /53 High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family enzyme -- (any of several complex proteins that are produced by cells and act as catalysts in specific biochemical reactions) => protein -- (any of a large group of nitrogenous organic compounds that are essential constituents of living cells;...) => macromolecule, supermolecule... => organic compound -- (any compound of carbon and another element or a radical)... => catalyst, accelerator -- ((chemistry) a substance that initiates or accelerates a chemical reaction without itself being affected) => activator -- ((biology) any agency bringing about activation;...) concepts and relations; can be cast into description logics or graph, with weights for relation strengths (derived from co-occurrence statistics)

Gerhard Weikum ADBIS Oct 1, /53 High-Quality Knowledge Sources Domain ontologies (UMLS, GeneOntology, etc.) 1 Mio. biomedical concepts, 135 categories, 54 relationships (e.g. virus causes (disease | symptom) )

Gerhard Weikum ADBIS Oct 1, /53 High-Quality Knowledge Sources Wikipedia and other lexical sources 2 Mio. articles 40 Mio. hyperlinks many 1000‘s of categories and lists more than 100 languages growing very fast

Gerhard Weikum ADBIS Oct 1, /53 {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]] [[Humboldt-Universität zu Berlin]] [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]] … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources

Gerhard Weikum ADBIS Oct 1, /53 YAGO: Yet Another Great Ontology [F. Suchanek, G. Kasneci, G. Weikum: WWW‘07] Turn Wikipedia into explicit knowledge base (semantic DB) Exploit hand-crafted categories and templates Represent facts as explicit knowledge triples: relation (entity1, entity2) (in FOL, compatible with RDF, OWL-lite, XML, etc.) Map (and disambiguate) relations into WordNet concept DAG entity1entity2 relation Max_PlanckKiel bornIn Kiel City isInstanceOf Examples:

Gerhard Weikum ADBIS Oct 1, /53 YAGO Knowledge Representation Entity Max_PlanckApril 23, 1858 Person CityCountry subclass Location subclass instanceOf subclass bornOn “Max Planck” means “Dr. Planck” means subclass October 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means “Max Karl Ernst Ludwig Planck” Physicist instanceOf subclass Biologist subclass concepts individuals words Knowledge Base # Facts KnowItAll SUMO WordNet OpenCyc Cyc YAGO Online access and download at Accuracy  97%

Gerhard Weikum ADBIS Oct 1, /53 YAGO Disambiguation & Uncertainty additional harvesting of relations from natural-language texts by info-extraction tools Entity Paris(Myth.)Paris(France)France Person CityCountry subclass Mythological Figure instanceOf Location subclass instanceOf0.9 subclass1.0 subclass 1.0 instanceOf1.0 locatedIn 0.95 “Paris” means0.1 means0.7 “France” means0.9 subclass1.0 “La Grande Nation” means 0.2 capture confidence value for each fact Paris Hilton means0.05 Celebrity instanceOf 0.4 subclass 0.7

Gerhard Weikum ADBIS Oct 1, /53 NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07] queries with regular expressions Ling$xscientist isa hasFirstName | hasLastName $yZhejiang locatedIn * worksFor discovery queries connectedness queries Beng Chin Ooi (coAuthor | advisor) * Thomas MannGoethe * German novelist isa Kiel$xscientist isa bornIn Graph-based search on YAGO-style knowledge bases with built-in ranking based on statistical language model $x Nobel prize hasWon $a diedOn $y hasSon $b diedOn >

Gerhard Weikum ADBIS Oct 1, /53 NAGA: Searching Knowledge q: Fisher isa scientist Fisher isa $x mathematician_ —subClassOf—> scientist_ Alumni_of_Gonville_and_Caius_College,_Cambridge —subClassOf—> alumnus_ "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_ = Ronald_Fisher = scientist_ $X = alumnus_ = Irving_Fisher = scientist_ $X = social_scientist_ = James_Fisher = scientist_ $X = ornithologist_ = Ronald_Fisher = scientist_ $X = theorist_ = Ronald_Fisher = scientist_ $X = colleague_ = Ronald_Fisher = scientist_ $X = organism_ …

Gerhard Weikum ADBIS Oct 1, /53 NAGA: Searching & Ranking Knowledge q: Fisher isa scientist Fisher isa $x Score: E-13 mathematician_ —subClassOf—> scientist_ "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_ th_century_mathematicians —subClassOf—> mathematician_ = Ronald_Fisher = scientist_ $X = mathematician_ = Ronald_Fisher = scientist_ $X = statistician_ = Ronald_Fisher = scientist_ $X = president_ = Ronald_Fisher = scientist_ $X = geneticist_ = Ronald_Fisher = scientist_ $X = scientist_ … Online access at

Gerhard Weikum ADBIS Oct 1, /53 Ranking Factors Confidence: Prefer results that are likely to be correct  Certainty of IE  Authenticity and Authority of Sources Informativeness: Prefer results that are likely important May prefer results that are likely new to user  Frequency in answer  Frequency in corpus (e.g. Web)  Frequency in query log Compactness: Prefer results that are tightly connected  Size of answer graph bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars“ (Martian Bloggeria) q: isa (Einstein, $y) isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) Einstein vegetarian BohrNobel Prize Tom Cruise 1962 isa bornIn diedIn won

Gerhard Weikum ADBIS Oct 1, /53 Summary of Leibniz Approach Hand-crafted knowledge sources are great assets, but expensive, partial, and isolated Great mileage even from informal & semiformal sources Connecting & reconciling different sources gives added value (and sometimes is not even that hard) Challenge: Develop methods for comprehensive, highly accurate mappings across many knowledge sources Cross-lingual Cross-temporal Scalable

Gerhard Weikum ADBIS Oct 1, /53 Planck Approach (Statistical Web) Information Extraction & Harvesting: Gather Entities, Relations, Facts Live with Uncertainty Max Planck ( )

Gerhard Weikum ADBIS Oct 1, /53 Information Extraction (IE): Text to Records Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person BirthDate BirthPlace... Person ScientificResult Max Planck Quantum Theory Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Planck‘s constant  Js Constant Value Dimension combine NLP, pattern matching, lexicons, statistical learning

Gerhard Weikum ADBIS Oct 1, /53 IE Technology: Rules, Patterns, Learning For natural-language text and for heterogeneous sources: NLP techniques (parser, PoS tagging) for tokenization identify patterns (e.g. regular expressions) as features train statistical learners for segmentation and labeling use learned model to automatically tag newly seen input Ian Foster, father of the Grid, talks at the GES conference in Germany on 05/02/07. NPVBNNNPNNINDTPPINNP ADJDTINCD Training data: The WWW conference takes place in Banff in Canada. Today‘s keynote speaker is Dr. Berners-Lee from W3C. The panel in Edinburgh, chaired by Ron Brachman from Yahoo!, … …

Gerhard Weikum ADBIS Oct 1, /53 Knowledge Acquisition from the Web Learn Semantic Relations from Entire Corpora at Large Scale (as exhaustively as possible but with high accuracy) Examples: all cities, all basketball players, all composers headquarters of companies, CEOs of companies, synonyms of proteins birthdates of people, capitals of countries, rivers in cities which musician plays which instruments who discovered or invented what which enzyme catalyzes which biochemical reaction Existing approaches and tools use almost-unsupervised pattern matching and learning: seeds (known facts)  patterns (in text)  (extraction) rule  (new) facts

Gerhard Weikum ADBIS Oct 1, /53 Methods for Web-Scale Fact Extration Example: city (Seattle) in downtown Seattle city (Seattle) Seattle and other towns city (Las Vegas) Las Vegas and other towns plays (Zappa, guitar) playing guitar: … Zappa plays (Davis, trumpet) Davis … blows trumpet seeds  text  rules  new facts Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other townsX and other towns plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other towns X and other towns plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other townsX and other towns plays (Zappa, guitar) playing guitar: … Zappaplaying Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y in downtown Delhicity(Delhi) Coltrane blows saxplays(C., sax) city(Delhi) plays(Coltrane, sax) city(Delhi) old center of Delhi plays(Coltrane, sax) sax player Coltrane city(Delhi) old center of Delhiold center of X plays(Coltrane, sax) sax player ColtraneY player X Assessment of facts & generation of rules based on statistics Rules can be more sophisticated: playing NN: (ADJ|ADV)* NP & class(head(NP))=person  plays (head(NP), NN)

Gerhard Weikum ADBIS Oct 1, /53 Performance of Web-IE State-of-the-art precision/recall results: Anecdotic evidence: invented (A.G. Bell, telephone) married (Hillary Clinton, Bill Clinton) isa (yoga, relaxation technique) isa (zearalenone, mycotoxin) contains (chocolate, theobromine) contains (Singapore sling, gin) invented (Johannes Kepler, logarithm tables) married (Segolene Royal, Francois Hollande) isa (yoga, excellent way) isa (your day, good one) contains (chocolate, raisins) plays (the liver, central role) makes (everybody, mistakes) relationprecision recall corpus systems countries80% 90% Web KnowItAll cities80% ??? Web KnowItAll scientists60% ??? WebKnowItAll CEOs80% 50% News Snowball, LEILA birthdates80% 70% Wikipedia LEILA instanceOf40% 20% WebText2Onto, LEILA precision value-chain: entities 80%, attributes 70%, facts 60%, events 50%

Gerhard Weikum ADBIS Oct 1, /53 Beyond Surface Learning with LEILA Almost-unsupervised Statistical Learning with Dependency Parsing (Cologne, Rhine), (Cairo, Nile), … (Cairo, Rhine), (Rome, 0911), ( ,  [0..9]*  ), … Paris was founded on an island in the Seine (Paris, Seine) SsPvMVpDs Js DG Js MVp NPVP PPNP PPNP Cologne lies on the banks of the Rhine SsMVpDMcMpDg JsJp NPPPVPNPPPNP People in Cairo like wine from the Rhine valley MpJsOs SpMvpDs Js AN NP PPVPPPNP Limitation of surface patterns: who discovered or invented what “Tesla’s work formed the basis of AC electric power” “Al Gore funded more work for a better basis of the Internet” Learning to Extract Information by Linguistic Analysis [F. Suchanek et al.: KDD’06] We visited Paris last summer. It has many museums along the banks of the Seine. LEILA outperforms other Web-IE methods in precision and recall, but dependency parser is slow

Gerhard Weikum ADBIS Oct 1, /53 IE Efficiency and Accuracy Tradeoffs precision vs. recall: two-stage processing (filter pipeline) 1)recall-oriented harvesting 2)precision-oriented scrutinizing preprocessing indexing: NLP trees & graphs, N-grams, PoS-tag patterns ? exploit ontologies? exploit usage logs ? turn crawl&extract into set-oriented query processing candidate finding efficient phrase, pattern, and proximity queries optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06] IE is cool, but what‘s in it for DB folks? [see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi]

Gerhard Weikum ADBIS Oct 1, /53 Summary of Planck Approach Human text (and speech) is diverse and produced at higher rate than manual high-quality annotations ? Deep NLP and advanced ML are computational bottleneck IE offers reasonably robust and scalable methods for harvesting named entities and binary relations Challenge: Achieve Web-scale IE throughput that can sustain rate of new content production (e.g. blogs) (may need large-scale P2P/Grid) with > 90% accuracy and Wikipedia-like coverage ? Disambiguation (entity matching, record linkage) needed „Joe Hellerstein (UC Berkeley)“ = „Prof. Joseph M. Hellerstein, California)“ „Max Planck Institute“ = „MPI“ ≠ „MPI“ = „Message Passing Institute“

Gerhard Weikum ADBIS Oct 1, /53 Darwin Approach (Social Web) Social Wisdom & Natural Selection: Evolution of (Web 2.0) species Survival of the fittest Charles Darwin ( )

Gerhard Weikum ADBIS Oct 1, /53 „Wisdom of Crowds“ at Work on Web 2.0 Information enrichment & knowledge extraction by humans: Collaborative Recommendations & QA Amazon (product ratings & reviews, recommended products) Netflix: movie DVD rentals  $ 1 Mio. Challenge answers.yahoo, iknow.baidu, etc. Social Tagging and Folksonomies del.icio.us: Web bookmarks and tags flickr: photo annotation, categorization, rating YouTube: same for video Human Computing in Game Form ESP and Google Image Labeler: image tagging Peekaboom: image segmenting and tagging Verbosity: facts from natural-language sentences Online Communities dblife.cs.wisc.edu for database research for language technology Yahoo! Groups, Myspace, Facebook, etc. etc.

Gerhard Weikum ADBIS Oct 1, /53 Social Tagging: Example Flickr (1)

Gerhard Weikum ADBIS Oct 1, /53 Social Tagging: Example Flickr (2)

Gerhard Weikum ADBIS Oct 1, /53 Social Tagging: Example Flickr (3)

Gerhard Weikum ADBIS Oct 1, /53 Social-Tagging Community > 1 Mio. users > 100 Mio. photos > 1 Bio. tags 30% monthly growth Source:

Gerhard Weikum ADBIS Oct 1, /53 ESP Game [Luis von Ahn et al ] taboo: pyramid Louvre museum Paris art played against random, anonymous partner on Internet my labels: reflection your partner has suggested: 3 labels my labels: reflection water your partner has suggested: 7 labels my labels: reflection water Mitterand Mona Lisa your partner has suggested: 11 labels my labels: reflection water Mitterand Mona Lisa metro lignes 7, 14 your partner has suggested: 17 labels my labels: reflection water Mitterand Mona Lisa metro lignes 7, 1 Da Vinci code Congratulations! You scored 1 point! Congratulations! You scored 1 point! Game with a purpose Collects annotations (wisdom) Can exploit tag statistics (crowds) Attracts people, fun to play, some play hours ESP game collected > 10 Mio. tags from > users 5000 people could tag all photos on the Web in 4 weeks (human computing)

Gerhard Weikum ADBIS Oct 1, /53 More Human Computing Verbosity [von Ahn 2006]: Collect common-knowledge facts (relation instances) 2 players: Narrator (N) and Guessor (G) N gives stylized clues: is a kind of …, is used for …, is typically near/in/on …, is the opposite of … random pairing for independence, can build statistics over many games for same concept Peekaboom, Phetch, etc.: locating & tagging objects in images, finding images, etc. incentives to play ? game design for moving up the value-chain ?

Gerhard Weikum ADBIS Oct 1, /53 Dark Side of Social Wisdom Spam (Web spam – not just for anymore): lucky online casino, easy MBA diploma, cheap V!-4-gra, etc.; law suits about „appropriate Google rank“ Truthiness: degree to which something is truthy (not necessarily facty); truthy := property of something you know from your guts Disputes: editorial fights over critical Wikipedia articles; Citizendium: new endeavor with "gentle expert oversight" Dishonesty, Bias, …

Gerhard Weikum ADBIS Oct 1, /53

Gerhard Weikum ADBIS Oct 1, /53

Gerhard Weikum ADBIS Oct 1, /53 The Wisdom of Crowds: PageRank PageRank (PR): links are endorsements & increase page authority, authority is higher if links come from high-authority pages with and equivalent to principal eigenvector: random walk: uniformly random choice of links + random jumps; add bias to transitions and jumps for personal PR, TrustRank, etc. Authority (page q) = stationary prob. of visiting q Social Ranking

Gerhard Weikum ADBIS Oct 1, /53 The Wisdom of Crowds: Beyond PR Typed graphs: data items, users, friends, groups, postings, ratings, queries, clicks, … with weighted edges  spectral analysis of various graphs Evolving over time  tensor analysis users tags docs

Gerhard Weikum ADBIS Oct 1, /53 Decentralized Graph Analysis Decentralized computation in peer-to-peer network with arbitrary, a-priori unknown overlaps of graph fragments Graph spectral analysis applied to: pages, sites, tags, users, groups, queries, clicks, opinions, etc. as nodes assessment and interaction relations as weighted edges can compute various notions of authority, reputation, trust, quality local subgraph 3 local subgraph 1 local sub- graph 2 global graph

Gerhard Weikum ADBIS Oct 1, /53 JXP Algorithm [J.X. Parreira, G. Weikum: WebDB’05, VLDB’06] Decentralized, asynchronous, peer-to-peer algorithm based on theory of Markov-chain aggregation (state lumping) [P.J. Courtois 1977, C.D. Meyer 1988] each peer aggregates non-local part of global graph into „world node“ peers meet randomly, exchange data about their local computations, and iterate their local computations Theorem: authority scores from local computations converge to global scores supported by Minerva system

Gerhard Weikum ADBIS Oct 1, /53 Summary of Darwin Approach Social tagging and social networks (Web 2.0) are potentially valuable knowledge sources Challenges: Design a game that intrigues serious scientists to „semantically“ annotate their scholarly work Develop an analysis method that identifies the „best“ facts, resilient to egoistic and malicious behaviors (incl. coalitions) Games (human computing) are an interesting way of enticing „knowledge input“ and collecting statistics Spectral analysis is a highly versatile tool for rating & ranking that can be extended and scaled by decentralized algorithms

Gerhard Weikum ADBIS Oct 1, /53 Outline Introduction: Search for Knowledge Conclusion Harvesting Knowledge Leibniz Approach Planck Approach Darwin Approach

Gerhard Weikum ADBIS Oct 1, /53 Summary Not covered here: search and ranking graph IR (for ER graphs, RDF, cross-linked XML, etc.) new ranking models (e.g. statistical LM for graphs) efficient and scalable query processing Harvesting knowledge & organizing in semantic DB/graph for scholarly Web, digital libraries, enterprise know-how, online communities, etc. Three roads to knowledge: Leibniz / Semantic Web: ontologies, encyclopedia, etc. Planck / Statistical Web: large-scale IE from text, speech, etc. Darwin / Social Web: wisdom of crowds, tagging, folksonomies

Gerhard Weikum ADBIS Oct 1, /53 Major Challenges Generalize YAGO approach (Wikipedia + WordNet) Methods for comprehensive, highly accurate mappings across many knowledge sources cross-lingual, cross-temporal scalable in size, diversity, number of sources Pursue DB support towards efficient IE (and NLP) Achieve Web-scale IE throughput that can sustain rate of new content production (e.g. blogs) with > 90% accuracy and Wikipedia-like coverage Integrate handcrafted knowledge with NLP/ML-based IE Incorporate social tagging and human computing

Gerhard Weikum ADBIS Oct 1, /53 Potential Synergies among Leibniz, Planck, and Darwin Leibniz Semantic Web Planck Statistical Web Darwin Social Web bootstrap validateemerge communities statistics & feedback knowledge core

Gerhard Weikum ADBIS Oct 1, /53 Thank you !