Gerhard Weikum Max Planck Institute for Informatics Knowledge Harvesting from Text and Web Sources Part 3: Knowledge.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
CrowdER - Crowdsourcing Entity Resolution
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Gerhard Weikum Max Planck Institute for Informatics For a Few Triples More.
Leveraging Data and Structure in Ontology Integration Octavian Udrea 1 Lise Getoor 1 Renée J. Miller 2 1 University of Maryland College Park 2 University.
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Information Retrieval in Practice
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up in.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Entity.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
DEANNA: Natural Language Questions for the Web of Data Mohamed Yahya † Klaus Berberich † Shady Elbassiousni * Maya Ramanath ‡ Volker Tresp # Gerhard Weikum.
Survey of Semantic Annotation Platforms
Graphical models for part of speech tagging
DBrev: Dreaming of a Database Revolution Gjergji Kasneci, Jurgen Van Gael, Thore Graepel Microsoft Research Cambridge, UK.
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
ALIP: Automatic Linguistic Indexing of Pictures Jia Li The Pennsylvania State University.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Natural Language Questions for the Web of Data Mohamed Yahya 1, Klaus Berberich 1, Shady Elbassuoni 2 Maya Ramanath 3, Volker Tresp 4, Gerhard Weikum 1.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Using linked data to interpret tables Varish Mulwad September 14,
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Sofus A. Macskassy Fetch Technologies
Nam Khanh Tran L3S Research Center, Leibniz Universität Hannover
Probabilistic Data Management
Result of Ontology Alignment with RiMOM at OAEI’06
Adaptive entity resolution with human computation
Property consolidation for entity browsing
Discovering Emerging Entities with Ambiguous Names
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Information Networks: State of the Art
Summarization for entity annotation Contextual summary
Entity Linking Survey
Presentation transcript:

Gerhard Weikum Max Planck Institute for Informatics Knowledge Harvesting from Text and Web Sources Part 3: Knowledge Linking

Quiz Time 3-2 How many days do you need to visit all Shangri-La places on this planet? Answer: 365 Source: geonames.org 3-2

Quiz Time 3-3 How many days do you need to visit all Shangri-La places on this planet? 3-3

Linkied Data: RDF Triples on the Web 30 Bio. triples 500 Mio. links 3-4

owl:sameAs rdf.freebase.com/ns/ en.rome owl:sameAs data.nytimes.com/ Coord geonames.org/ / city_of_rome N 43° 12' 46'' W 75° 27' 20'' dbpprop:citizenOf dbpedia.org/resource/ Rome rdf:type rdf:subclassOf yago/ wordnet:Actor rdf:type rdf:subclassOf yago/ wikicategory:ItalianComposer yago/ wordnet: Artist prop:actedIn imdb.com/name/nm / Linked RDF Triples on the Web prop: composedMusicFor imdb.com/title/tt / dbpedia.org/resource/ Ennio_Morricone 3-5

owl:sameAs rdf.freebase.com/ns/ en.rome_ny owl:sameAs data.nytimes.com/ Coord geonames.org/ / city_of_rome N 43° 12' 46'' W 75° 27' 20'' dbpprop:citizenOf dbpedia.org/resource/ Rome rdf:type rdf:subclassOf yago/ wordnet:Actor rdf:type rdf:subclassOf yago/ wikicategory:ItalianComposer yago/ wordnet: Artist prop:actedIn imdb.com/name/nm / Linked RDF Triples on the Web prop: composedMusicFor imdb.com/title/tt / dbpedia.org/resource/ Ennio_Morricone Referential data quality? Hand-crafted sameAs links? generated sameAs links? ? ? ? 3-6

RDF Entities on the Web 3-7

RDF Entities on the Web 3-8

Entity-Name Ambiguity 3-9

Entities in HTML

Entity Markup in HTML: Towards Standardized Microformats

Entity Markup in HTML: Towards Standardized Microformats

Web Page in Standard HTML Jane Doe Professor Whitworth Institute 405 Whitworth Seattle WA (425) Jane's home page: janedoe.com Graduate students: Alice Jones Bob Smith 3-13

Web Page in HTML with Microdata Jane Doe Professor Whitworth Institute 405 N. Whitworth Seattle, WA (425) Jane's home page: janedoe.com Graduate students: Alice Jones Bob Smith

Web-of-Data vs. Web-of-Contents 3-15 Critical for knowledge linkage: entity name ambiguity  more structured data combined with text  boosted by knowledge harvesting methods

Embedding RDFa in Web Contents May 2, 2011 Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th. <html … May 2, 2011 Maestro Morricone <a rel="sameAs" resource="dbpedia…/Ennio_Morricone "/> … Smetana Hall … <span property="rdf:type" resource="yago:performance"> The concert will feature … <span property="event:date" content=" "> July 1 RDF data and Web contents need to be interconnected RDFa & microformats provide the mechanism Need ways of creating more embedded RDF triples! 3-16

Outline... Entity-Name Disambiguation Motivation  Wrap-up Mapping Questions into Queries Entity Linkage 3-17

Named-Entity Disambiguation Harry fought with you know who. He defeats the dark lord. 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB) Three NLP tasks: Harry Potter Dirty Harry Lord Voldemort The Who (band) Prince Harry of England 3-18

Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Named Entity Disambiguation D5 Overview May 30, 2011 Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy … … … KB Eli (bible) Eli Wallach Mentions (surface names) Entities (meanings) Dollars Trilogy Lord of the Rings Star Wars Trilogy Benny Andersson Benny Goodman Ecstasy of Gold Ecstasy (drug) ? 3-19

Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mention-Entity Graph Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) bag-of-words or language model: words, bigrams, phrases 3-20

Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mention-Entity Graph Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) joint mapping 3-21

Mention-Entity Graph 22 / 20 Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy(drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. 3-22

Mention-Entity Graph 23 / 20 KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) American Jews film actors artists Academy Award winners Metallica songs Ennio Morricone songs artifacts soundtrack music spaghetti westerns film trilogies movies artifacts Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. 3-23

Mention-Entity Graph 24 / 20 KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) _the_Ugly Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. 3-24

Mention-Entity Graph 25 / 20 KB+Stats Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) Metallica on Morricone tribute Bellagio water fountain show Yo-Yo Ma Ennio Morricone composition The Magnificent Seven The Good, the Bad, and the Ugly Clint Eastwood University of Texas at Austin For a Few Dollars More The Good, the Bad, and the Ugly Man with No Name trilogy soundtrack by Ennio Morricone weighted undirected graph with two types of nodes Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. 3-25

Collective Learning with Prob. Factor Graphs (Chakrabarti et al.: KDD‘09): model P[m|e] by similarity and P[e1|e2] by coherence consider likelihood of P[m1 … mk | e1 … ek] factorize by all m-e pairs and e1-e2 pairs use hill-climbing, LP, etc. for solution Different Approaches Combine Popularity, Similarity, and Coherence Features (Cucerzan: EMNLP‘07, Milne/Witten: CIKM‘08): for sim (context(m), context(e)): consider surrounding mentions and their candidate entities use their types, links, anchors as features of context(m) set m-e edge weights accordingly use greedy methods for solution 3-26

Joint Mapping Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e)

Mention-Entity Popularity Weights Collect hyperlink anchor-text / link-target pairs from Wikipedia redirects Wikipedia links between articles Interwiki links between Wikipedia editions Web links pointing to Wikipedia articles … Build statistics to estimate P[entity | name] Need dictionary with entities‘ names: full names: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corporation short names: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, … nicknames & aliases: Terminator, City of Angels, Evil Empire, … acronyms: LA, UCLA, MS, MSFT role names: the Austrian action hero, Californian governor, the CEO of MS, … … plus gender info (useful for resolving pronouns in context): Bill and Melinda met at MS. They fell in love and he kissed her. [Milne/Witten 2008, Spitkovsky/Chang 2012] 3-28

Mention-Entity Similarity Edges Extent of partial matchesWeight of matched words Precompute characteristic keyphrases q for each entity e: anchor texts or noun phrases in e page with high PMI: Match keyphrase q of candidate e in context of mention m Compute overall similarity of context(m) and candidate e „Metallica tribute to Ennio Morricone“ The Ecstasy piece was covered by Metallica on the Morricone tribute album. 3-29

Entity-Entity Coherence Edges Precompute overlap of incoming links for entities e1 and e2 Alternatively compute overlap of anchor texts for e1 and e2 or overlap of keyphrases, or similarity of bag-of-words, or … Optionally combine with type distance of e1 and e2 (e.g., Jaccard index for type instances) For special types of e1 and e2 (locations, people, etc.) use spatial or temporal distance 3-30

Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search [J. Hoffart et al.: EMNLP‘11]

Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search [J. Hoffart et al.: EMNLP‘11]

Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search [J. Hoffart et al.: EMNLP‘11]

Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search [J. Hoffart et al.: EMNLP‘11]

Alternative: Random Walks for each mention run random walks with restart (like personalized PR with jumps to start mention(s)) rank candidate entities by stationary visiting probability very efficient, decent accuracy    3-35   

AIDA: Accurate Online Disambiguation

AIDA: Accurate Online Disambiguation

AIDA: Very Difficult Example 3-38

AIDA: Very Difficult Example 3-39

AIDA: Accurate Online Disambiguation

AIDA: Accurate Online Disambiguation

AIDA: Accurate Online Disambiguation

AIDA: Accurate Online Disambiguation

Some NED Online Tools for J. Hoffart et al.: EMNLP 2011, VLDB P. Ferragina, U. Scaella: CIKM R. Isele, C. Bizer: VLDB Reuters Open Calais S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD D. Milne, I. Witten: CIKM perhaps more some use Stanford NER tagger for detecting mentions

NED: Experimental Evaluation Benchmark: Extended CoNLL 2003 dataset: 1400 newswire articles originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase difficult texts: … Australia beats India …  Australian_Cricket_Team … White House talks to Kreml …  President_of_the_USA … EDS made a contract with …  HP_Enterprise_Services Results: Best: AIDA method with prior+sim+coh + robustness test 82% recall, 87% mean average precision Comparison to other methods, see paper J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP

Ongoing Research & Remaining Challenges More efficient graph algorithms (multicore, etc.) Short and difficult texts: tweets, headlines, etc. fictional texts: novels, song lyrics, etc. incoherent texts Disambiguation beyond entity names: coreferences: pronouns, paraphrases, etc. common nouns, verbal phrases (general WSD) Leverage deep-parsing structures, leverage semantic types Example: Page played Kashmir on his Gibson subj obj mod Allow mentions of unknown entities, mapped to null Structured Web data: tables and lists 3-46

General Word Sense Disambiguation {songwriter, composer} {cover, perform} {cover, report, treat} {cover, help out} Which song writers covered ballads written by the Stones ? 3-47

Handling Out-of-Wikipedia Entities last.fm /Nick_Cave/Weeping_Song wikipedia.org /Weeping_(song) wikipedia.org/ Nick_Cave last.fm /Nick_Cave/O_Children last.fm /Nick_Cave/Hallelujah wikipedia /Hallelujah_(L_Cohen) wikipedia /Hallelujah_Chorus wikipedia /Children_(2011 film) wikipedia.org/ Good_Luck_Cave Cave composed haunting songs like Hallelujah, O Children, and the Weeping Song. 3-48

Handling Out-of-Wikipedia Entities last.fm /Nick_Cave/Weeping_Song wikipedia.org /Weeping_(song) wikipedia.org/ Nick_Cave last.fm /Nick_Cave/O_Children last.fm /Nick_Cave/Hallelujah wikipedia /Hallelujah_(L_Cohen) wikipedia /Hallelujah_Chorus wikipedia /Children_(2011 film) wikipedia.org/ Good_Luck_Cave Cave composed haunting songs like Hallelujah, O Children, and the Weeping Song. Gunung Mulu National Park Sarawak Chamber largest underground chamber eerie violin Bad Seeds No More Shall We Part Bad Seeds No More Shall We Part Murder Songs Leonard Cohen Rufus Wainwright Shrek and Fiona Nick Cave & Bad Seeds Harry Potter 7 movie haunting choir Nick Cave Murder Songs P.J. Harvey Nick and Blixa duet Messiah oratorio George Frideric Handel Dan Heymann apartheid system South Korean film 3-49

Handling Out-of-Wikipedia Entities Characterize all entities (and mentions) by sets of keyphrases Entity coherence then becomes: keyphrases overlap, no need for href link data For each mention add a „self“ candidate: out-of-KB entity with keyphrases computed by Web search Efficient comparison of two keyphrase-sets  two-stage hashing, using min-hash sketches and LSH KORE (e,f) =  p  e,q  f PO(p,q) 2  min(  e (p),  f (q))  p  e  e (p) +  q  f  f (q) entities e,f with phrase weights  PO(p,q) =  w  p  q min(  p (w),  q (w)) phrases p,q with word weights   w  p  q max(  p (w),  q (w)) [J. Hoffart et al.: CIKM‘12] 3-50

Variants of NED at Web Scale How to run this on big batch of 1 Mio. input texts?  partition inputs across distributed machines, organize dictionary appropriately, …  exploit cross-document contexts How to deal with inputs from different time epochs?  consider time-dependent contexts, map to entities of proper epoch (e.g. harvested from Wikipedia history) How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies) Tools can map short text onto entities in a few seconds 3-51

Outline... Entity-Name Disambiguation Motivation  Wrap-up Mapping Questions into Queries Entity Linkage  3-52

Word Sense Disambiguation for Question-to-Query Translation Select ?p Where { ?p type person. ?p actedIn Casablanca_(film). ?p isMarriedTo ?w. ?w type writer. ?w bornIn Rome. } “Who played in Casablanca and was married to a writer born in Rome?” Translation with WSD Translation with WSD Question SPARQL KB Answer ?p ?w 3-53 QA system DEANNA [M. Yahya et al.: EMNLP‘12] yago-naga/deanna/

DEANNA in a Nutshell DEANNA Question SPARQL KB Answers Phrase detection Phrase mapping Phrase mapping Dependency detection Dependency detection Joint Disambig. Joint Disambig. Query Generation Query Generation 3-54

DEANNA in a Nutshell DEANNA Question SPARQL KB Answers Phrase detection Phrase mapping Phrase mapping Dependency detection Dependency detection Joint Disambig. Joint Disambig. Query Generation Query Generation 3-55

DEANNA in a Nutshell DEANNA Question SPARQL KB Answers Phrase detection Phrase mapping Phrase mapping Dependency detection Dependency detection Joint Disambig. Joint Disambig. Query Generation Query Generation 3-56

DEANNA in a Nutshell DEANNA Question SPARQL KB Answers Phrase detection Phrase mapping Phrase mapping Dependency detection Dependency detection Joint Disambig. Joint Disambig. Query Generation Query Generation 3-57

DEANNA Components DEANNA Question SPARQL KB Phrase detection Phrase mapping Phrase mapping Dependency detection Dependency detection Joint Disambig. Joint Disambig. Query Generation Query Generation Answers

Phrase Detection Casablanca played played in Who married married to was married to a writer Concepts: entities & classes: dictionary-based Relations: mainly use Reverb [Fader et al: EMNLP’11]: V | VP | VW*P … was/VBD married/VBN to/TO a/DT… ConceptPhrase Casablanca Casablanca, Morocco Casablanca_(film) Casablanca the film Casablanca_(film) Casablanca … … 3-59

DEANNA Components DEANNA Question SPARQL KB Phrase detection Phrase mapping Phrase mapping Dependency detection Dependency detection Joint Disambig. Joint Disambig. Query Generation Query Generation

Phrase Mapping Casablanca played played in e:White_House e:Casablanca e:Casablanca_(film) e:Played_(film) r:actedIn r:hasMusicalRole Concepts: entities & classes: dictionary-based Relations: Dictionary -based ConceptPhrase Casablanca Casablanca, Morocco Casablanca_(film) Casablanca the film Casablanca_(film) Casablanca Played_(film) Played RelationPhrase actedIn acted in actedIn played in hasMusicalRole plays hasMusicalRole mastered 3-61

DEANNA Components DEANNA Question SPARQL KB Phrase detection Phrase mapping Phrase mapping Dependency detection Dependency detection Joint Disambig. Joint Disambig. Query Generation Query Generation

Dependency Detection Look for specific patterns in dependency graph [de Marneffe et al. LREC’06] a writer was born born Rome q1q1 c:writer r:bornInPlace r:bornOnDate e:Max_Born e:Born_(film) e:Sydne_Rome e:Rome 3-63

Disambiguation Graph q1q1 q2q2 q3q3 a writer Casablanca played played in Who married married to was married to was born born Rome c:writer r:bornInPlace r:bornOnDate e:Max_Born e:Born_(film) e:Sydne_Rome e:Rome e:White_House e:Casablanca e:Casablanca_(film) e:Played_(film) r:actedIn r:hasMusicalRole c:person e:Married_(series) c: married_person r:isMarriedTo q-nodes Phrase-nodes Semantic nodes 3-64

DEANNA Components DEANNA Question SPARQL KB Phrase detection Phrase mapping Phrase mapping Dependency detection Dependency detection Joint Disambig. Joint Disambig. Query Generation Query Generation

Joint Disambiguation - ILP ILP: Integer Linear Programming maximize α Σ i,j w i,j Y i,j + β Σ k,l v k,l Z k,l + … Subject to:  No token in multiple phrases,  Triples observe type constraints, … 3-66

Joint Disambiguation – Objective α Σ i,j w i,j Y i,j + β Σ k,l v k,l Z k,l Semantic nodes q1q1 a writer was born born Rome c:writer r:bornInPlace r:bornOnDate e:Max_Born e:Born_(film) e:Sydne_Rome e:Rome q-nodes Phrase nodes Coherence Edges Similarity Edges Prior 3-67

Joint Disambiguation – Objective α Σ i,j w i,j Y i,j + β Σ k,l v k,l Z k,l Semantic nodes Coherence q1q1 a writer was born born Rome c:writer r:bornInPlace r:bornOnDate e:Max_Born e:Born_(film) e:Sydne_Rome e:Rome q-nodes Phrase nodes Similarity Edges Coherence Edges 3-68

Joint Disambiguation – Constraints A phrase node can be assigned to only one semantic node: Casablanca e:White_House e:Casablanca e:Casablanca_(film) Phrase nodes Semantic nodes a Y a,1 Y a,2 Y a,3 α Σ i,j w i,j Y i,j + β Σ k,l v k,l Z k,l 3-69

Joint Disambiguation – Constraints Classes translate to type-constrained variables  Every semantic triple should have a class to join & project! person actedIn Casablanca_(film) ▼ ?x type person. ?x actedIn Casablanca_(film) q1q1 a writer was born Rome c:writer r:bornInPlace r:bornOnDate e:Sydne_Rome e:Rome q-nodes e:The_Writer (magazine) Phrase nodes Semantic nodes 3-70

DEANNA Components DEANNA Question SPARQL KB Phrase detection Phrase mapping Phrase mapping Dependency detection Dependency detection Joint Disambig. Joint Disambig. Query Generation Query Generation

Structured Query Generation SELECT ?p WHERE { ?w type writer. ?w bornIn Rome. ?p type person. ?p actedIn Casablanca_(film). ?p isMarriedTo ?w } q1q1 q2q2 q3q3 a writer Casablanca played in Who was married to was born Rome c:writer r:bornIn e:Rome e:Casablanca_(film) r:actedIn c:person r:isMarriedTo 3-72

Outline... Entity-Name Disambiguation Motivation  Wrap-up Mapping Questions into Queries Entity Linkage   3-73

Entity Linkage for the Web of Data owl:sameAs rdf.freebase.com/ns/ en.rome_ny owl:sameAs data.nytimes.com/ Coord geonames.org/ / city_of_rome N 43° 12' 46'' W 75° 27' 20'' dbpprop:citizenOf dbpedia.org/resource/ Rome rdf:type rdf:subclassOf yago/ wordnet:Actor rdf:type rdf:subclassOf yago/ wikicategory:ItalianComposer yago/ wordnet: Artist prop:actedIn imdb.com/name/nm / prop: composedMusicFor imdb.com/title/tt / dbpedia.org/resource/ Ennio_Morricone sameAs links ? Where? How? ? ? ? 30 Bio. triples 500 Mio. links 3-74

Record Linkage (Entity Resolution) Susan B. Davidson Peter Buneman University of Pennsylvania Yi Chen record 1record N Issues in … Int. Conf. on Very Large Data Bases O.P. Buneman S. Davison U Penn Y. Chen Issues in … VLDB Conf. Y. Davidson Penn Station S. Chen Issues in … XLDB Conference record 2 P. Baumann S. Davidson Penn State Cheng Y. Issues in … PVLDB record 3 … Sean Penn Halbert L. Dunn: Record Linkage. American Journal of Public Health H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., Find equivalence classes of entities, and records, based on: similarity of values (edit distance, n-gram overlap, etc.) joint agreement of linkage  similarity joins, grouping/clustering, collective learning, etc.  often domain-specific customization (similarity measures etc.) 3-75

Entity Linkage via Markov Logic Susan B. Davidson Peter Buneman University of Pennsylvania Yi Chen record 1record N Issues in … Int. Conf. on Very Large Data Bases O.P. Buneman S. Davison U Penn Y. Chen Issues in … VLDB Conf. Y. Davidson Penn Station S. Chen Issues in … XLDB Conference record 2 P. Baumann S. Davidson Penn State Cheng Y. Issues in … PVLDB record 3 … Find equivalence classes of entities, and records, based on: similarity of values (edit distance, n-gram overlap, etc.) joint agreement of linkage  similarity joins, grouping/clustering, collective learning, etc. Sean Penn Halbert L. Dunn: Record Linkage. American Journal of Public Health H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., prob. / uncertain rules: sameTitle(x,y)  sameAuths(x,y)  sameVenue(x,y)  sameAs(x,y) sameTitle(x,y)  sameAuths(x,y)  sameAffil(x,y)  sameAs(x,y) overlapAuths(x,y)  sameAffil(x,y)  sameAuths(x,y) sameAs(rec1.auth1, rec2.auth1) [0.2] sameAs(rec1.auth1, rec2.auth2) [0.9] … specify in Markov Logic or as factor graph generate MRF (or …) and solve by MCMC (or …) (Singla/Domingos: ICDM’06, Hall/Sutton/McCallum:KDD’08) 3-76

sameAs-Link Test across Sources LOD source 1LOD source 2 sameAs ? ? ? ? ? eiei ejej similarity: sim (e i, e j ) coherence: coh (x  N(e i ), y  N(e j )) neighborhoods: N(e i ), N(e j ) sameAs (e i, e j )  sim (e i, e j ) ≥ …   x,y coh(x,y) ≥ … record linkage problem 3-77

sameAs-Link Generation across Sources LOD source 1LOD source 2 LOD source 3 ekek sameAs ? ejej eiei … … … 3-78

sameAs-Link Generation across Sources LOD source 1LOD source 2 LOD source 3 ekek sameAs ? ejej eiei sim(e i, e j ): likelihood of being equivalent, mapped to [-1,1] coh(x, y): likelihood of being mentioned together, mapped to [-1,1] 0-1 decision variables: X ij … X jk … X ik … objective function:  ij ( X ij sim(e i,e j ) + X ij  x  N i, y  N j coh(x,y) ) +  jk ( … ) +  ik ( … ) = max! constraints:  j X ij  1 for all i … (1  X ij ) + (1  X jk )  (1  X ik ) for all i, j, k … Joint Mapping ILP model or prob. factor graph or … Use your favorite solver How? at Web scale ??? 3-79

Similarity Flooding Graph with record / entity pairs as nodes (sameAs candidates) and edges connecting related pairs: R(x,y) and S(u,w) and sameAs candidates (x,u), (y,w)  edge between (x,u) and (y,w) Node weights: belief strength in sameAs(x,u) Edge weights: degree of relatedness Iterate until convergence: propagate node weights to neighbors new node weight is linear combination of inputs Related to belief propagation algorithms, label propagation, etc. 3-80

Blocking of Match Candidates Avoid computing O(n 2 ) similarities between records / entities Group potentially matching records Run more accurate & more expensive method per group at risk of missing some matches Iterative Blocking: distribute found matches to other blocks, then repeat per-block runs Multi-type Joint Resolution blocks of different record types (author, venue, etc.) propagate matches to other types, then repeat runs Name Zip 1 John Doe John Doe Jon Foe Jane Foe Jane Fog Group by zip code: {1,4,5} and {2,3}  sameAs(4,5), sameAs(2,3) Group by 1st char of lastname: {1,2} and {3,4,5}  sameAs(1,2), sameAs(4,5) 3-81

Iterative Blocking for Joint Resolution with Multiple Entity Types [Whang et al. 2012] PublicationsAuthorsVenues heuristics for constructing efficient execution plans exploiting „influence graph“ after round 1 after round

RiMOM Method Risk Minimization Based Ontology Matching Method for joint matching of concepts (entities, classes) & properties (relations) [Juanzi Li et al.;TKDE‘09] Strategies using variety of matching criteria: Linguistic-based: edit distance context vector distance … Structure-based: similarity flooding … keg.cs.tsinghua.edu.cn/project/RiMOM/ 3-83

COMA++ Framework [E. Rahm et al.] Joint schema alignment and entity matching Comprehensive architecture with many plug-ins for customizing to specific application Blocked matchers parallelizable on Map-Reduce platform dbs.uni-leipzig.de/Research/coma.html/ 3-84

PARIS Method [F. Suchanek et al. 2012] webdam.inria.fr/paris/ Probabilistic Alignment of Relations, Instances, and Schema: joint reasoning on sameEntity, sameRelation, sameClass with direct probabilistic assessment P[literal1  literal2] = … same constant value P[r1  r2] = … sub-relation P[e1  e2] = … same entity P[c1  c2] = … sub-class Matching entities of DBpedia with YAGO: 90% precision, 73% recall, after 4 iterations, 5 h run-time Iterate through probabilistic equations Empirically converges to fixpoint 3-85

PARIS Method [F. Suchanek et al. 2012] webdam.inria.fr/paris/ P[literal1  literal2] = … based on similarity and co-occurrence P[x  y] = (1   r(x,u),r(y,w) (1  fun(r  1 )  P[u  w])) if relations were already aligned same entity   r(x,u) (1  fun(r)   r(y,w) (1  P[u  w]))) considering negative evidence fun(r) = #x:  y: r(x,y) #x,y: r(x,y)) degree to which r Is a function where 3-86 P[Shanri-La  Zhongdian] = … fun(bornIn -1 ) P[Jet Li  Li Lianjie]

PARIS Method [F. Suchanek et al. 2012] webdam.inria.fr/paris/ P[s  r]: P[s  r] = #x,u: s(x,u)  r(x,u) #x,u: s(x,u) if entities were already resolved P[s  r] = with same-entity probabilities  s(x,u) (1   r(y,w) (1  P[x  y]  P[u  w]))  s(x,u) (1   y,w (1  P[x  y]  P[u  w])) sub-relation 3-87

PARIS Method [F. Suchanek et al. 2012] webdam.inria.fr/paris/ P[x  y] = (1   s(x,u),r(y,w) (1  P[s  r]  fun(s  1 )  P[u  w])  with sub-relation probabily same entity revisited (1  P[s  r]  fun(r  1 )  P[u  w]))   s(x,u),r(y,w) (1  P[s  r]  fun(s)   r(y,w) (1  P[u  w]))  considering negative evidence (1  P[s  r]  fun(r)   r(y,w) (1  P[u  w])) 3-88

PARIS Method [F. Suchanek et al. 2012] webdam.inria.fr/paris/ P[c  d]: P[c  d] = #x type(x,c))  type(x,d) #x: type(x,c) if entities were already resolved P[c  d] = with same-entity probabilities  x:type(x,c) (1   y:type(y,d) (1  P[x  y])) #x: type(x,c) sub-class 3-89

Partitioned MLN Method V. Rastogi et al. 2011] Use Markov Logic Network for entity resolution Partition MLN with replication of nodes so that: Each node has its neighborhood in the same partition Repeat local computation: run MLN inference via MCMC on each partition (in parallel) message passing: exchange beliefs (on sameAs) among partitions with overlapping node sets Until convergence R1: sim(x,y)  sameAuthor(x,y) R2: sim(x,y)  coAuthor(x,a)  coAuthor(y,b)  s ameAuthor(a,b)  sameAuthor(x,y) 3-90

LINDA: Linked Data Alignment at Scale [C. Böhm et al. 2012] uses context sim and joint inference to process sameAs matrix with transitivity and other constraints alternates between setting sameAs and recomputing sim puts promising candidate pairs in priority queue queue is partitioned and processing parallelized Experiment with BTC+ dataset: 3 Bio. quads 345 Mio. triples 95 Mio. URIs Result after 30 h run-time: 12.3 Mio. sameAs 66% precision > 80% for Dbpedia-Yago 3-91

Cross-Lingual Linking Source: Z. Wang et al.: WWW‘12 + simpler than monolingual: natural equivalences, interwiki links  harder than monolingual: different terminologies & structures Z. Wang et al. WWW‘12: factor-graph learning  200,000 sameAs T. Nguyen et al. VLDB‘12: sim features & LSI  infobox mappings en.wikipedia.org: 3.5 Mio. articles baike.baidu.com: 4 Mio. articles 3-92

Challenges Remaining Entity linkage is at the heart of semantic data integration ! More than 50 years of research, still some way to go! Benchmarks: OAEI Ontology Alignment & Instance Matching: oaei.ontologymatching.orgoaei.ontologymatching.org TAC KBP Entity Linking: TREC Knowledge Base Acceleration: trec-kba.orgtrec-kba.org Highly related entities with ambiguous names George W. Bush (jun.) vs. George H.W. Bush (sen.) Out-of-Wikipedia entities with sparse context Enterprise data (perhaps combined with Web2.0 data) Entities with very noisy context (in social media) Records with complex DB / XML / OWL schemas 3-93

TREC Task: Knowledge Base Acceleration Goal: assist Wikipedia / KB editors recommend key citations as evidence of truth recommend infobox structure and categories recommend entity links and external links 3-94

TREC Task: Knowledge Base Acceleration 

Outline... Entity-Name Disambiguation Motivation  Wrap-up Mapping Questions into Queries Entity Linkage    3-96

Take-Home Lessons Web of Linked Data is great 100‘s of KB‘s with 30 Bio. triples and 500 Mio. links mostly reference data, dynamic maintenance is bottleneck connection with Web of Contents needs improvement Entity detection and disambiguation is key for creating sameAs links in text (RDFa, microformats) for machine reading, semantic authoring, knowledge base acceleration, … Integrated methods for aligning entities, classes and relations Linking entities across KB‘s is advancing combine popularity, similarity, and coherence extend towards general WSD (e.g. for QA) NED methods come close to human quality 3-97

Open Problems and Grand Challenges Automatic and continuously maintained sameAs links for Web of Linked Data with high accuracy & coverage Combine algorithms and crowdsourcing for NED & ER Robust disambiguation of entities, relations and classes with active learning, minimizing human effort or cost/accuracy Relevant for question answering & question-to-query translation Key building block for KB building and maintenance Entity name disambiguation in difficult situations Short and noisy texts about long-tail entities in social media 3-98

End of Part 3 Questions? 3-99

J. Hoffart, M. A. Yosef, I. Bordino, et al.: Robust Disambiguation of Named Entities in Text. EMNLP 2011 J. Hoffart et al.: KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. CIKM 2012 R.C. Bunescu, M. Pasca: Using Encyclopedic Knowledge for Named entity Disambiguation. EACL 2006 S. Cucerzan: Large-Scale Named Entity Disambiguation Based on Wikipedia Data. EMNLP 2007 D.N. Milne, I.H. Witten: Learning to link with wikipedia. CIKM 2008 S. Kulkarni et al.: Collective annotation of Wikipedia entities in web text. KDD 2009 G.Limaye et al: Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB 2010 A. Rahman, V. Ng: Coreference Resolution with World Knowledge. ACL 2011 L. Ratinov et al.: Local and Global Algorithms for Disambiguation to Wikipedia. ACL 2011 M. Dredze et al.: Entity Disambiguation for Knowledge Base Population. COLING 2010 P. Ferragina, U. Scaiella: TAGME: on-the-fly annotation of short text fragments. CIKM 2010 X. Han, L. Sun, J. Zhao: Collective entity linking in web text: a graph-based method. SIGIR 2011 M. Tsagkias, M. de Rijke, W. Weerkamp.: Linking Online News and Social Media. WSDM 2011 J. Du et al.: Towards High-Quality Semantic Entity Detection over Online Forums. SocInfo 2011 V.I. Spitkovsky, A.X. Chang: A Cross-Lingual Dictionary for English Wikipedia Concepts, LREC 2012 J.R. Finkel, T. Grenager, C. Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL 2005 V. Ng: Supervised Noun Phrase Coreference Research: The First Fifteen Years. ACL 2010 S. Singh, A. Subramanya, F.C.N. Pereira, A. McCallum: Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models. ACL 2011 T. Lin et al.: No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities. EMNLP 2012 A. Rahman, V. Ng: Inducing Fine-Grained Semantic Classes via Hierarchical Classification. COLING 2010 X. Ling, D.S. Weld: Fine-Grained Entity Recognition. AAAI 2012 R. Navigli: Word sense disambiguation: A survey. ACM Comput. Surv. 41(2), 2009 M. Yahya et al.: Natural Language Questions for the Web of Data. EMNLP 2012 S. Shekarpour: Automatically Transforming Keyword Queries to SPARQL on Large-Scale KBs. ISWC 2011 Recommended Readings: Disambiguation 3-100

Recommended Readings: Linked Data and Entity Linkage T. Heath, C. Bizer: Linked Data: Evolving the Web into a Global Data Space. Morgan&Claypool, 2011 A. Hogan, et al.: An empirical survey of Linked Data conformance. J. Web Sem. 14, 2012 H. Glaser, A. Jaffri, I.C. Millard: Managing Co-Reference on the Semantic Web. LDOW 2009 J. Volz, C.Bizer, M.Gaedke, G.Kobilarov : Discovering and Maintaining Links on the Web of Data. ISWC 2009 F. Naumann, M. Herschel: An Introduction to Duplicate Detection. Morgan&Claypool, 2010 H.Köpcke et al: Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing 2010 H. Köpcke et al.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 2010 S. Melnik, H. Garcia-Molina, E. Rahm: Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. ICDE 2002 S. Chaudhuri, V. Ganti, R. Motwani: Robust Identification of Fuzzy Duplicates. ICDE 2005 S.E. Whang et al.: Entity Resolution with Iterative Blocking. SIGMOD 2009 S.E. Whang, H. Garcia-Molina: Joint Entity Resolution. ICDE 2012 L. Kolb, A. Thor, E. Rahm: Load Balancing for MapReduce-based Entity Resolution. ICDE 2012 J.Li, J.Tang, Y.Li, Q.Luo: RiMOM: A dynamic multistrategy ontology alignment framework. TKDE 21(8), 2009 P. Singla, P. Domingos: Entity Resolution with Markov Logic. ICDM 2006 I.Bhattacharya, L. Getoor: Collective Entity Resolution in Relational Data. TKDD 1(1), 2007 R. Hall, C.A. Sutton, A. McCallum: Unsupervised deduplication using cross-field dependencies. KDD 2008 V. Rastogi, N. Dalvi, M. Garofalakis: Large-Scale Collective Entity Matching. PVLDB 2011 F. Suchanek et al.: PARIS: Probabilistic Alignment of Relations, Instances, and Schema. PVLDB 2012 Z. Wang, J. Li, Z. Wang, J. Tang: Cross-lingual knowledge linking across wiki knowledge bases. WWW 2012 T. Nguyen et al.: Multilingual Schema Matching for Wikipedia Infoboxes. PVLDB 2012 A.Hogan et al.: Scalable and distributed methods for entity matching. J. Web Sem. 10, 2012 C. Böhm et al.: LINDA: Distributed Web-of-Data-Scale Entity Matching. CIKM 2012 J. Wang, T. Kraska, M. Franklin, J. Feng: CrowdER: Crowdsourcing Entity Resolution. PVLDB

Knowledge Harvesting: Overall Take-Home Lessons KB‘s are great opportunity in the big-data era: revive old AI vision, make it real & large-scale ! challenging, but high pay-off Strong success story on entities and classes Many opportunities remaining: temporal knowledge, spatial, visual, commonsense vertical domains: health, music, travel, … Good progress on relational facts Methods for open-domain relation discovery Search and ranking: Combine facts (SPO triples) with witness text Extend SPARQL, LM‘s for ranking, UI unclear Entity linking: From names in text to entities in KB sameAs between entities in different KB‘s / DB‘s 1-102

Knowledge Harvesting: Research Opportunities & Challenges Explore & exploit synergies between semantic, statistical, & social Web methods: statistical evidence + logical consistency + wisdom of the crowd ! For DB / AI / IR / NLP / Web researchers: efficiency & scalability consistency constraints & reasoning search and ranking deep linguistic patterns & statistics text (& speech) disambiguation killer app for uncertain data management knowledge-base life-cycle and more 1-103

de: vielen Dank en: thank you fr: Merci beaucoup es: muchas gracias cmn: 非常谢谢你 ru: Большое спасибо tib: ཐུགས་རྗེ་ཆེ་། yue: 唔該 wu: 谢谢侬 expression of gratitude dai: ขอบคุณ 3-104