Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gerhard Weikum Max Planck Institute for Informatics For a Few Triples More.

Similar presentations


Presentation on theme: "Gerhard Weikum Max Planck Institute for Informatics For a Few Triples More."— Presentation transcript:

1 Gerhard Weikum Max Planck Institute for Informatics For a Few Triples More

2 Acknowledgements

3 LOD: RDF Triples on the Web

4 owl:sameAs rdf.freebase.com/ns/ en.rome owl:sameAs data.nytimes.com/ Coord geonames.org/ /roma N 41° 54' 10'' E 12° 29' 2'' dbpprop:citizenOf dbpedia.org/resource/ Rome rdf:type rdfs:subclassOf yago/ wordnet:Actor rdf:type rdfs:subclassOf yago/ wikicategory:ItalianComposer yago/ wordnet: Artist prop:actedIn imdb.com/name/nm / LOD: Linked RDF Triples on the Web prop: composedMusicFor imdb.com/title/tt / dbpedia.org/resource/ Ennio_Morricone

5 LOD: Linked RDF Triples on the Web Size: 30 Billion triples Linkage: 500 Million links Dynamics:encyclopedic reference data

6 The Good, the Bad, and the Ugly

7 30 billion triples – still not enough ? No! Consider: 1.Dynamics 2.Linkage 3.Ubiquity For a Few Triples More

8 Outline Why More Triples: Dynamics, Linkage, Ubiquity Web-Scale Linkage Explain Title  Wrap-up Linkage & Ubiquity: Named-Entity Disambiguation

9 1. Dynamics: in a Fast Paced World Anecdotic examples: Chairman and CEO, Apple Inc.

10 1. Dynamics: As Fresh As Possible

11 1. Dynamics: Updates in the Web of Data

12 1. Dynamics: Closer to the Sources RDF Data on the Web produced by: Maintained, but mostly „static“ reference collections (e.g. geo) Periodic exports from curated databases (e.g. gov, bio, music) Periodic extraction from Web sources (e.g. encyclopedia, news) Tags in social streams and advertisements mostly fresh often stale very noisy  Get closer to the data origin: RDF engines (Sparql APIs) for production DBs view-maintenance by pub-sub push (feeds) Deep-Web crawl/query for surfacing of RDF data

13 1. Dynamics: Nothing Lasts Forever Even old and „static“ data often needs temporal scope (timepoint, timespan) for proper interpretation Need to add temporal properties to RDF and SPARQL with reification, or use quads (quints, pints, etc.) [11-Jun-2002, 2008] [Oct-2011, now] [1999] PaulMcCartney hasSpouse HeatherMills PaulMcCartney hasSpouse NancyShevell PaulMcCartney gotHonor SirPaul 1: 2: 3: 1 validFrom 11-Jun validUntil validFrom Oct happendOn 1999 Select ?w Where { ?id1: PM gotHonor SirPaul. ?id1 happendOn ?t. ?id2: PM hasSpouse ?w. ?id2 validFrom ?b. ?id2 validUntil ?e. ?t containedIn [?b,?e]. } but: principled, expressive, easy-to-use

14 1. Dynamics: Nothing Lasts Forever

15 2. Linkage: sameAs Links dbpedia.org/resource/Linda_Louise_Eastman owl:sameAs yago-knowledge.org/resource/Linda_McCartney owl:SameAs dbpedia.org/page/Clint_Eastwood data.linkedmdb.org/page/film/38166 owl:sameAs de.dbpedia.org/page/Zwei_glorreiche_Halunken LOD statistics: 30 Bio. triples, 500 Mio. links 330 Mio. links trivial (ID-based) within pub, within bio 10‘s Mio. links near-trivial Dbpedia  Freebase  Yago  GeoNames sameas.org: 17 Mio. bundles for 50 Mio. URIs data.nytimes.com: 5000 people, 2000 locations Way too few for a world with: 1 Mio. people, 10 Mio. locations, 10‘s Mio. species, 6 Mio. books, 2 Mio. movies, 10 Mio. songs, etc. etc.

16 2. Linkage: sameAs Coverage

17 2. Linkage: sameAs Accuracy

18 3. Ubiquity: Web-of-Data & Web-of-Contents

19 3. Ubiquity: Web of Data & Other Contents RDF data and Web contents need to be interconnected RDFa & microformats provide the mechanism How do we get the Web RDF-annotated (at large scale)? Largely automated, but allow humans in the loop

20 3. Ubiquity: Web of Data & Other Contents May 2, 2011 Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th. … Smetana Hall … The concert will feature … July 1

21 Why a Few Triples More? Dynamics: Where is the live data? Linkage: Where are the links in Linked Data? Ubiquity: Where are the paths between the Web-of-Data and the Web? Linked Data is great! But still in its infancy Need to add triples to capture further issues:

22 Outline Why More Triples: Dynamics, Linkage, Ubiquity Web-Scale Linkage Explain Title  Wrap-up Linkage & Ubiquity: Named-Entity Disambiguation 

23 Entities on the Web

24 Named-Entity Disambiguation (NED) Harry fought with you know who. He defeats the dark lord. 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB) Three NLP tasks: Harry Potter Dirty Harry Lord Voldemort The Who (band) Prince Harry of England

25 Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mentions, Meanings, Mappings D5 Overview May 30, 2011 Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy … … … KB Eli (bible) Eli Wallach Mentions (surface names) Entities (meanings) Dollars Trilogy Lord of the Rings Star Wars Trilogy Benny Andersson Benny Goodman Ecstasy of Gold Ecstasy (drug) ?

26 Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mention-Entity Graph Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) bag-of-words or language model: words, bigrams, phrases

27 Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mention-Entity Graph Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) joint mapping

28 Mention-Entity Graph / 20 Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy(drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films.

29 Mention-Entity Graph / 20 KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) American Jews film actors artists Academy Award winners Metallica songs Ennio Morricone songs artifacts soundtrack music spaghetti westerns film trilogies movies artifacts Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films.

30 Mention-Entity Graph / 20 KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) _the_Ugly Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films.

31 Mention-Entity Graph / 20 KB+Stats Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) Metallica on Morricone tribute Bellagio water fountain show Yo-Yo Ma Ennio Morricone composition The Magnificent Seven The Good, the Bad, and the Ugly Clint Eastwood University of Texas at Austin For a Few Dollars More The Good, the Bad, and the Ugly Man with No Name trilogy soundtrack by Ennio Morricone weighted undirected graph with two types of nodes Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films.

32 Joint Mapping Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e)

33 Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search [J. Hoffart et al.: EMNLP‘11]

34 Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search [J. Hoffart et al.: EMNLP‘11]

35 Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search [J. Hoffart et al.: EMNLP‘11]

36 Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search [J. Hoffart et al.: EMNLP‘11]

37 Named-Entity Disambiguation: State-of-the-Art Online tools: https://d5gate.ag5.mpi-sb.mpg.de/webaida/ etc. Literature: Razvan Bunescu, Marius Pasca: EACL 2006 Silviu Cucerzan: EMNLP 2007 David Milne, Ian Witten: CIKM 2008 S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009 G. Limaye, S. Sarawagi, S. Chakrabarti: VLDB 2010 Paolo Ferragina, Ugo Scaella: CIKM 2010 Mark Dredze et al.: COLING 2010 Johannes Hoffart et al.: EMNLP 2011 etc.

38 NED: Experimental Evaluation Benchmark: Extended CoNLL 2003 dataset: 1400 newswire articles originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase difficult texts: … Australia beats India …  Australian_Cricket_Team … White House talks to Kreml …  President_of_the_USA … EDS made a contract with …  HP_Enterprise_Services Results: Best: AIDA method with prior+sim+coh + robustness test 82% recall, 87% mean average precision Comparison to other methods, see paper J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011

39 AIDA: Accurate Online Disambiguation

40 AIDA: Accurate Online Disambiguation

41 AIDA: Accurate Online Disambiguation

42 AIDA: Accurate Online Disambiguation

43 AIDA: Accurate Online Disambiguation

44 Interesting Research Issues More efficient graph algorithms (multicore, etc.) Allow mentions of unknown entities, mapped to null Short and difficult texts: tweets, headlines, etc. fictional texts: novels, song lyrics, etc. incoherent texts Disambiguation beyond entity names: coreferences: pronouns, paraphrases, etc. common nouns, verbal phrases (general WSD) Leverage deep-parsing structures, leverage semantic types

45 Why Named Entity Disambiguation is Key Linked data is best if it has many good links New & rich contents mostly in traditional Web Create sameAs links in (X)HTML contents, via RDFa Links for named entities give best mileage/effort Methods & tools greatly advanced & gradually maturing Keep human in the loop, embed NED in authoring tools

46 Outline Why More Triples: Dynamics, Linkage, Ubiquity Web-Scale Linkage Explain Title  Wrap-up Linkage & Ubiquity: Named-Entity Disambiguation  

47 Variants of NED at Web Scale How to run this on big batch of 1 Mio. input texts?  partition inputs across distributed machines, organize dictionary appropriately, …  exploit cross-document contexts How to deal with inputs from different time epochs?  consider time-dependent contexts, map to entities of proper epoch (e.g. harvested from Wikipedia history) How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies) Tools can map short text onto entities in a few seconds

48 owl:sameAs rdf.freebase.com/ns/ en.rome_ny owl:sameAs data.nytimes.com/ Coord geonames.org/ / city_of_rome N 43° 12' 46'' W 75° 27' 20'' dbpprop:citizenOf dbpedia.org/resource/ Rome rdf:type rdfs:subclassOf yago/ wordnet:Actor rdf:type rdfs:subclassOf yago/ wikicategory:ItalianComposer yago/ wordnet: Artist prop:actedIn imdb.com/name/nm / Linked RDF Triples on the Web prop: composedMusicFor imdb.com/title/tt / dbpedia.org/resource/ Ennio_Morricone referential data quality: automatic, dynamic, high coverage ! ? ? ?

49 Outline Why More Triples: Dynamics, Linkage, Ubiquity Web-Scale Linkage Explain Title  Wrap-up Linkage & Ubiquity: Named-Entity Disambiguation   

50 Summary Dynamics: (Deep-Web) sources  feeds, pub-sub, … ?  fresh & versioned triples Linkage: LOD  entity mapping  user  community Ubiquity: RDFa  entity disambiguation  authoring Linked Data is great! But it needs more triples to capture:

51 Outlook For a Few Triples More Challenge 1: generate high-quality sameAs links in RDFa & across all LOD sources For a Few Triples Less Challenge 2: add efficient top-k ranking to queries over RDF-in-context

52 Thank You !


Download ppt "Gerhard Weikum Max Planck Institute for Informatics For a Few Triples More."

Similar presentations


Ads by Google