Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to.

Similar presentations


Presentation on theme: "Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to."— Presentation transcript:

1 Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to Entities and Relations

2 Acknowledgements

3 Big Picture: Opportunities Now ! KB Population Info Extraction Semantic Authoring Entity Linkage Web of Data Web of Users & Contents Very Large Knowledge Bases Semantic Docs Disambiguation

4 Big Picture: Opportunities Now ! KB Population Info Extraction Semantic Authoring Entity Linkage Web of Data Web of Users & Contents Very Large Knowledge Bases Semantic Docs Disambiguation This talk: How Do We Search this World of Knowledge, Data, and Text (and cope with ambiguity) for Knowledge Harvesting see talks at College de France and at VLDB School in Kunming

5 Web of Data: RDF, Tables, Microdata YAGO Cyc TextRunner/ ReVerb WikiTaxonomy/ WikiNet SUMO ConceptNet 5 BabelNet ReadTheWeb 30 Bio. SPO triples (RDF) and growing

6 Web of Data: RDF, Tables, Microdata YAGO 30 Bio. SPO triples (RDF) and growing 10M entities in 350K classes 120M facts for 100 relations 100 languages 95% accuracy 4M entities in 250 classes 500M facts for 6000 properties live updates 25M entities in 2000 topics 100M facts for 4000 properties powers Google knowledge graph Ennio_Morricone type composer Ennio_Morricone type GrammyAwardWinner composer subclassOf musician Ennio_Morricone bornIn Rome Rome locatedIn Italy Ennio_Morricone created Ecstasy_of_Gold Ennio_Morricone wroteMusicFor The_Good,_the_Bad_,and_the_Ugly Sergio_Leone directed The_Good,_the_Bad_,and_the_Ugly

7 owl:sameAs rdf.freebase.com/ns/ en.rome owl:sameAs data.nytimes.com/ Coord geonames.org/ /roma N 41° 54' 10'' E 12° 29' 2'' dbpprop:citizenOf dbpedia.org/resource/ Rome rdf:type rdfs:subclassOf yago/ wordnet:Actor rdf:type rdfs:subclassOf yago/ wikicategory:ItalianComposer yago/ wordnet: Artist prop:actedIn imdb.com/name/nm / Linked RDF Triples on the Web prop: composedMusicFor imdb.com/title/tt / dbpedia.org/resource/ Ennio_Morricone 500 Mio. links

8 Embedding (RDF) Microdata in HTML Pages May 2, 2011 Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th. … Smetana Hall … The concert will feature … July 1 Supported by RDFa and microformats like schema.org

9 Outline Opportunities Now Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Semantic Search Today 

10 Semantic Search Today (1)

11

12

13

14

15 Semantic Search Today (2) Select ?x Where { ?x type composer [western movie]. ?x wasBornIn ?y. ?y locatedIn Europe. }

16 Semantic Search Today (2) Select ?x Where { ?x type composer. ?x participatedIn ?y. ?y type western_film. }

17 Semantic Search Today (3)

18

19

20 Semantic Search Today (4)

21 Key problem in semantic search: diversity and ambiguity of names and phrases !

22 Outline Opportunities Now Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Semantic Search Today  

23 Three Different NLP Problems Harry fought with you know who. He defeats the dark lord. 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB) Three NLP tasks: Harry Potter Dirty Harry Lord Voldemort The Who (band) Prince Harry of England 3-23

24 Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Named Entity Disambiguation D5 Overview May 30, 2011 Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy … … … KB Eli (bible) Eli Wallach Mentions (surface names) Entities (meanings) Dollars Trilogy Lord of the Rings Star Wars Trilogy Benny Andersson Benny Goodman Ecstasy of Gold Ecstasy (drug) ? 3-24

25 Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mention-Entity Graph Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) bag-of-words or language model: words, bigrams, phrases 3-25

26 Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mention-Entity Graph Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) joint mapping 3-26

27 Mention-Entity Graph 27 / 20 Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy(drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. 3-27

28 Mention-Entity Graph 28 / 20 KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) American Jews film actors artists Academy Award winners Metallica songs Ennio Morricone songs artifacts soundtrack music spaghetti westerns film trilogies movies artifacts Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. 3-28

29 Mention-Entity Graph 29 / 20 KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) _the_Ugly Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. 3-29

30 Mention-Entity Graph 30 / 20 KB+Stats Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) Metallica on Morricone tribute Bellagio water fountain show Yo-Yo Ma Ennio Morricone composition The Magnificent Seven The Good, the Bad, and the Ugly Clint Eastwood University of Texas at Austin For a Few Dollars More The Good, the Bad, and the Ugly Man with No Name trilogy soundtrack by Ennio Morricone weighted undirected graph with two types of nodes Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. 3-30

31 Joint Mapping Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e)

32 Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search [J. Hoffart et al.: EMNLP‘11]

33 Mention-Entity Popularity Weights Collect hyperlink anchor-text / link-target pairs from Wikipedia redirects Wikipedia links between articles Interwiki links between Wikipedia editions Web links pointing to Wikipedia articles … Build statistics to estimate P[entity | name] Need dictionary with entities‘ names: full names: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corp. short names: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, … nicknames & aliases: Terminator, City of Angels, Evil Empire, … acronyms: LA, UCLA, MS, MSFT role names: the Austrian action hero, Californian governor, CEO of MS, … … plus gender info (useful for resolving pronouns in context): Bill and Melinda met at MS. They fell in love and he kissed her. [Milne/Witten 2008, Spitkovsky/Chang 2012] 3-33

34 Mention-Entity Similarity Edges Extent of partial matchesWeight of matched words Precompute characteristic keyphrases q for each entity e: anchor texts or noun phrases in e page with high PMI: Match keyphrase q of candidate e in context of mention m Compute overall similarity of context(m) and candidate e „Metallica tribute to Ennio Morricone“ The Ecstasy piece was covered by Metallica on the Morricone tribute album. 3-34

35 Entity-Entity Coherence Edges Precompute overlap of incoming links for entities e1 and e2 Alternatively compute overlap of anchor texts for e1 and e2 or overlap of keyphrases, or similarity of bag-of-words, or … Optionally combine with type distance of e1 and e2 (e.g., Jaccard index for type instances) For special types of e1 and e2 (locations, people, etc.) use spatial or temporal distance 3-35

36 AIDA: Accurate Online Disambiguation 3-36

37 AIDA: Accurate Online Disambiguation 3-37

38 AIDA: Very Difficult Example 3-38

39 AIDA: Very Difficult Example 3-39

40 AIDA: Accurate Online Disambiguation 3-40

41 AIDA: Accurate Online Disambiguation 3-41

42 Some NED Online Tools for J. Hoffart et al.: EMNLP 2011, VLDB 2011 https://d5gate.ag5.mpi-sb.mpg.de/webaida/ P. Ferragina, U. Scaella: CIKM R. Isele, C. Bizer: VLDB Reuters Open Calais S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD D. Milne, I. Witten: CIKM perhaps more some use Stanford NER tagger for detecting mentions 3-42

43 NED: Experimental Evaluation Benchmark: Extended CoNLL 2003 dataset: 1400 newswire articles originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase difficult texts: … Australia beats India …  Australian_Cricket_Team … White House talks to Kreml …  President_of_the_USA … EDS made a contract with …  HP_Enterprise_Services Results: Best: AIDA method with prior+sim+coh + robustness test 82% recall, 87% mean average precision Comparison to other methods, see paper J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP

44 Ongoing Research & Remaining Challenges More efficient graph algorithms (multicore, etc.) Short and difficult texts: tweets, headlines, etc. fictional texts: novels, song lyrics, etc. incoherent texts Disambiguation beyond entity names: coreferences: pronouns, paraphrases, etc. common nouns, verbal phrases (general WSD) Leverage deep-parsing structures, leverage semantic types Example: Page played Kashmir on his Gibson subj obj mod Allow mentions of unknown entities, mapped to null Structured Web data: tables and lists 3-44

45 Variants of NED at Web Scale How to run this on big batch of 1 Mio. input texts?  partition inputs across distributed machines, organize dictionary appropriately, …  exploit cross-document contexts How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies) Tools can map short text onto entities in a few seconds 3-45

46 Outline Opportunities Now Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Semantic Search Today   

47 Deep Question Answering 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain This town is known as "Sin City" & its downtown is "Glitter Gulch" William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel As of 2010, this is the only former Yugoslav republic in the EU YAGO knowledge back-ends question classification & decomposition D. Ferrucci et al.: Building Watson. AI Magazine, Fall IBM Journal of R&D 56(3/4), 2012: This is Watson.

48 Semantic Keyword Search Need to map (groups of) keywords onto entities & relationships based on name-entity similarities/probabilities q: composer Rome scores westerns [Ilyas et al. Sigmod‘10] Media Composer video editor Western Digital Rome (Italy) goal in football film music composer (creator of music) Rome (NY) Lazio Roma western movies western world Western (airline) AS Roma Western (NY) … born in …… plays for …… used in …… recorded at …

49 Natural Language Questions are Natural Who composed scores for westerns and is from Rome? translate question into Sparql query: dependency parsing to decompose question mapping of question units onto entities, classes, relations Who composed scores for westerns and is from Rome? map results into tabular or visual presentation or speech

50 From Questions to Queries NL question: Who composed scores for westerns and is from Rome? scores for westerns is from Rome Who composed scores Dependency parsing exposes structure of question  „triploids“ (sub-cues) 2-50

51 From Triploids to Triples Who composed scores for westerns and is from Rome? Who is from Rome Who composed scores scores for westerns ?x composed scores ?x bornIn Rome scores contributesTo ?y ?y type westernMovie ?x type composer ?x composed ?s ?s contributesTo ?y ?s type music 2-51

52 Pattern Dictionary for Relations [N. Nakashole et al.: EMNLP 2012] WordNet-style dictionary/taxonomy for relational phrases based on SOL patterns (syntactic-lexical-ontological) Relational phrases can be synonymous One relational phrase can subsume another Relational phrases are typed Problem: cope with language diversity & ambiguity Example: composed …, wrote …, created …, … “graduated from”  “obtained degree in * from” “and $PRP ADJ advisor”  “under the supervision of” “wife of”  “ spouse of” graduated from released covered covered

53 PATTY: Pattern Taxonomy for Relations [N. Nakashole et al.: EMNLP 2012, demo at VLDB 2012] SOL patterns with 4 Mio. instances Derived from large data (Wikipedia, NYT, ClueWeb) by scalable sequence mining accessible at:

54 Disambiguation Mapping for Triploids Who composed scores for westerns and is from Rome? composed scores scores for westerns is from Rome Who q1 q2 q3 q4 Combinatorial Optimization by ILP (with type constraints etc.) e: Rome (Italy) e: Lazio Roma c: person c: musician e: WHO r: created r: wroteComposition r: wroteSoftware c:soundtrack r: soundtrackFor r: shootsGoalFor r: bornIn r: actedIn c: western movie e: Western Digital weighted edges (coherence, similarity, etc.)

55 Relaxing Overconstrained Queries Select ?p Where { ?p composed ?s. ?s type music. ?s for ?m. ?m type movie. ?p bornIn Rome. } Select ?p Where { ?p composed ?s. ?s type music. ?s for ?m. ?m type movie [western]. ?p bornIn Rome. } Select ?p Where { ?p ?rel1 ?s [composed]. ?s type music. ?s ?rel2 ?m. ?m type movie [western]. ?p bornIn Rome. } with extended SPARQL-FullText: SPOX quad patterns (S. Elbassuoni et al.: CIKM‘10, ESWC’11, SIGIR‘12) Select ?p Where { ?p composed ?s. ?s type music. ?s for ?m. ?m type movie [western]. ?p bornIn Rome. }

56 Preliminary Results (M. Yahya et al.: WWW‘12, EMNLP‘12)

57 Outline Opportunities Now Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Semantic Search Today    

58 Disambiguation Mapping Who composed scores for westerns and is from Rome? composed scores scores for westerns is from Rome Who q1 q2 q3 q4 e:Rome (Italy) e:Lazio Roma c:person c:musician e:WHO r:created r:wroteComposition r:wroteSoftware c:soundtrack r:soundtrackFor r:shootsGoalFor r:bornIn r:actedIn c:western movie e:Western Digital weighted edges (coherence, similarity, etc.) Selection: X i Assignment: Y ij Joint Mapping: Z kl [M.Yahya et al.: EMNLP‘12]

59 Disambig. Mapping: Objective Function Who composed scores for westerns and is from Rome? composed scores scores for westerns is from Rome Who q1 q2 q3 q4 e:Rome (Italy) e:Lazio Roma c:person c:musician e:WHO r:created r:wroteComposition r:wroteSoftware c:soundtrack r:soundtrackFor r:shootsGoalFor r:bornIn r:actedIn c:western movie e:Western Digital weighted edges (coherence, similarity, etc.) Selection: X i Assignment: Y ij Joint Mapping: Z kl maximize   i,j w ij Y ij +   k,l v kl Z kl +… subject to: 1)Y ij  X i for all i,j 2)  j Y ij  1 for all i 3)Z kl   i,j Y ik and Z kl   j Y il for all k,l 4)X i,Y ij,Z kl  {0,1} w ij v kl

60 Disambig. Mapping: Constraints Who composed scores for westerns and is from Rome? composed scores scores for westerns is from Rome Who q1 q2 q3 q4 e:Rome (Italy) e:Lazio Roma c:person c:musician e:WHO r:created r:wroteComposition r:wroteSoftware c:soundtrack r:soundtrackFor r:shootsGoalFor r:bornIn r:actedIn c:western movie e:Western Digital weighted edges (coherence, similarity, etc.) Selection: X i Assignment: Y ij Joint Mapping: Z kl maximize   i,j w ij Y ij +   k,l v kl Z kl +… subject to: 5)Q hi = 1   g Q hg = 3 for all h,i 6)X i + X g  1 for all mutually exclusive i,g 7)Q hi = 1   g,j Q hg Y gj = 1 for relation nodes j w ij v kl Selection: Q hi

61 Disambig. Mapping: Type Constraints Who composed scores for westerns and is from Rome? composed scores scores for westerns is from Rome Who q1 q2 q3 q4 e:Rome (Italy) e:Lazio Roma c:person c:musician e: WHO r:created r:wroteComposition r:wroteSoftware c:soundtrack r:soundtrackFor r:shootsGoalFor r:bornIn r:actedIn c:western movie e:Western Digital weighted edges (coherence, similarity, etc.) Selection: X i Assignment: Y ij Joint Mapping: Z kl maximize   i,j w ij Y ij +   k,l v kl Z kl +… subject to: 8)Y ij = 1 and j is relation node and Z kj =1 and Z jl =1  domain(j)  types(k) and range(j)  types(l) w ij v kl Selection: Q hi ILP optimizers like Gurobi solve this in 1 or 2 seconds

62 Outline Opportunities Now Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Semantic Search Today     

63 Summary Web of Data & Knowledge & Text (RDF + Phrases) Calls for Semantic Search by Entities, Classes & Relations Diversity & Ambiguity of Names and Phrases Calls for Disambiguation Mapping Strong Story for Entity Name Disambiguation Ongoing Work on Relation Phrase Disambiguation Cornerstone of Question Answering with Natural Language or Advanced Keywords Great opportunity towards next-generation search Challenging problems: robustness, scale, dynamics & transfer

64 Take-Home Message Solve „Who composed the Ecstasy and other pieces for westerns?“  can solve semantic search with natural-language disambiguation


Download ppt "Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to."

Similar presentations


Ads by Google