Presentation is loading. Please wait.

Presentation is loading. Please wait.

Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up in.

Similar presentations


Presentation on theme: "Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up in."— Presentation transcript:

1 Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up in YAGO in NELL in Google KB Alignment Linked Data 1

2 Goal: Combine several extractors 2 Extractor TextTables Extractor Schriftstück is(Elvis, alive) is(Elvis, dead) is(Elvis, alive) is(Elvis, alive) ?

3 YAGO combines 170 extractors 3 JimGray bornIn "January 12, 1944" JimGray bornIn SanFrancisco … JimGray bornIn SanFrancisco Infobox Extractor TypeChecker MultilingualMerger

4 YAGO combines 170 extractors 4 [Mahdisoltani CIDR 2015] Type checking Type coherence checking Translation Learning of foreign language attributes Deduplication Horn rule inference Functional constraint checking (simple preference over sources) => 10 languages, precision of 95% http://yago-knowledge.org

5 Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up in YAGO √ in NELL in Google KB Alignment Linked Data 5

6 NELL couples different learners http://rtw.ml.cmu.edu/rtw/ Natural Language Pattern Extractor Table Extractor Mutual exclusion Type Check Krzewski coaches the Blue Devils. Krzewski Blue Angels Miller Red Angels sports coach != scientist If I coach, am I a coach? Initial Ontology [Carlson et al. 2010 and follow-ups] 6

7 NELL couples different learners http://rtw.ml.cmu.edu/rtw/ Natural Language Pattern Extractor Table Extractor Mutual exclusion Type Check Krzewski coaches the Blue Devils. Krzewski Blue Angels Miller Red Angels sports coach != scientist If I coach, am I a coach? Initial Ontology [Carlson et al. 2010 and follow-ups] 7 Different learners benefit from each other: table extraction text extraction path ranking (rule learning) morphological features ("…ism" is something abstract) active learning (ask for answers in online forums) learning from images learn from several languages (?)

8 Estimating Accuracy from Unlabeled Data [Platanios, Blum, Mitchell, UAI‘14] Given: Extractors f 1,…,f n Find: error probability e i of each extractor a ij = P x (f i (x)=f j (x)) = P(both make error) + P(neither makes error) =1 – e i – e j – 2*e ij Probability of a simultaneous error Case 1: Independent errors & acc. > 0.5 then a ij =1 – e i – e j – 2*e i *e j Problem reduced to solving a system of N*(N-1)/2 equations with N unknown values Solvable if N ≥ 3 Agreement (known!) 8

9 Estimating Accuracy from Unlabeled Data [Platanios, Blum, Mitchell, UAI‘14] Given: Extractors f 1,…,f n Find: error probability e i of each extractor a ij = P x (f i (x)=f j (x)) = P(both make error) + P(neither makes error) =1 – e i – e j – 2*e ij Probability of a simultaneous error Agreement (known!) Case 2: not independent errors Idea: minimize e ij -e i *e j, i.e., find independent classifiers 9

10 Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up in YAGO √ in NELL √ in Google KB Alignment Linked Data 10

11 Google Knowledge Vault Given: Freebase, a relation r, extractors e 1,…,e n Train a classifier that, given the confidences of the extractors, tells us whether the extracted statement is true. KB Fusion facts Extractor facts facts facts Txt DOM Tables ANO Extractor schema.org [Dong et al.: KDD2014] 11 RDFa> Path Ranking

12 RDFa Annotations My name is Elvis. BrowserRDFa analyzer 30% of Web pages are annotated this way schema.org is a common vocabulary designed by Google, Microsoft, Yandex, and others for this purpose [Guha: "Schema.org", keynote at AKBC 2014] 12

13 Trustworthiness of Web Sources [Dong et al.: VLDB2015] Tail sources with high trustworthiness Knowledge Base Trust Page Rank Many Gossip Web sites Page Rank and Trustworthiness are not always correlated! 13

14 Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up in YAGO √ in NELL √ in Google √ KB Alignment Linked Data 14

15 Knowledge bases are complementary 15

16 No Links  No Use Who is the spouse of the guitar player? 16

17 Linking Records vs. Linking Knowledge Susan B. Davidson Peter Buneman University of Pennsylvania Yi Chen recordKB / Ontology university Differences between DB records and KB entities: Links have rich semantics (e.g. subclassOf) KBs have only binary predicates KBs have no schema Match not just entities, but also classes & predicates (relations) 17

18 Similarity Flooding matches entities at scale Build a graph: nodes: pairs of entities, weighted with similarity edges: weighted with degree of relatedness similarity: 0.9 similarity: 0.7 relatedness 0.8 Iterate until convergence: similarity := weighted sum of neighbor similarities similarity: 0.8 many variants (belief propagation, label propagation, etc.) 18

19 Some neighborhoods are more indicative 1935 sameAs sameAs ? sameAs Many people born in 1935  not indicative Few people married to Priscilla  highly indicative 19

20 Inverse functionality as indicativeness [Suchanek et al.: VLDB’12] 1935 sameAs sameAs ? 20 sameAs

21 Match entities, classes and relations sameAs subPropertyOf 21

22 Match entities, classes and relations sameAs subPropertyOf 22

23 Match entities, classes and relations sameAs subPropertyOf 23

24 Match entities, classes and relations sameAs subPropertyOf 24 subClassOf PARIS matches YAGO and DBpedia time: 1:30 hours precision for instances: 90% precision for classes: 74% precision for relations: 96% [Suchanek et al.: VLDB’12] http://webdam.inria.fr/paris

25 Many challenges remain Entity linkage is at the heart of semantic data integration. More than 50 years of research, still some way to go! Benchmarks: OAEI Ontology Alignment & Instance Matching: oaei.ontologymatching.orgoaei.ontologymatching.org TAC KBP Entity Linking: www.nist.gov/tac/2012/KBP/www.nist.gov/tac/2012/KBP/ TREC Knowledge Base Acceleration: trec-kba.orgtrec-kba.org Highly related entities with ambiguous names George W. Bush (jun.) vs. George H.W. Bush (sen.) Long-tail entities with sparse context Records with complex DB / XML / OWL schemas Ontologies with non-isomorphic structures 25 LOD>

26 Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up in YAGO √ in NELL √ in Google √ KB Alignment √ Linked Data Warning: Numbers mentioned here are not authoritative, because (1) they are based on incomplete crawls or (2) they may be outdated. See the respective sources for details. 26

27 Linked Open Data Cloud user- generated: 51 media: 24 publications: 138 life-sciences: 85 social-networking: 520 government: 183 KBs geographic: 27 cross-domain: 47 linguists http://lodcould.net 30 Bio. triples 500 Mio. links April 2011 From 2011 to 2014, the number of KBs tripled from 297 to 1091. [Schmachtenberg et al.: ICSW2014] 27

28 Links between KBs #links as of April 2014: unknown, crawled only sample #links as of April 2011: 500 mio #links “sameAs” at sameAs.org: 150 mio 44% of KBs are not linked at all Top Linking Predicates owl:sameAs rdfs:seeAlso dct:source dct:language dct:creator skos:exactMatch skos:closeMatch geographic [Schmachtenberg et al.: ICSW2014] KB 1 KB 2 owl:sameAs Watch out: “sameAs” has developed 5 meanings: Identical to Same in different context Same but referentially opaque Represents Very similar to [Halpin & Hayes: “When owl:sameAs isn’t the Same”, LDOW, 2010] 28 LOD>

29 Dereferencing URIs fullpartialnone 19%9%72% Dereferencability of schemas [Schmachtenberg et al.: ICSW2014] [Hogan et al: “Weaving the Pedantic Web”, LDOW, 2010] In a crawl of 1.6m dereferenceable URIs: @prefix y: http://yago-knowledge.org/resource/http://yago-knowledge.org/resource/ y:Elvis rdf:type y:livingPerson y:Elvis y:wasBornIn y:USA … 29 LOD>

30 Publish the Rubbish [Suchanek@WOLE2012 keynote] 30 LOD>

31 Vocabularies (% of KBs) Term% KBsTerm% KBs rdfs:range10%rdfs:seeAlso2% rdfs:subClassOf9%owl:equivalentClass2% rdfs:subPropertyOf7%owl:inverseOf1% rdfs:domain6%swivt:type1% rdfs:isDefinedBy4%owl:equivalentProperty1% Vocabulary20112014 FOAF27%69% Dublin Core31%56% Larger adoption of standard vocabularies Usage of standard vocabularies [Schmachtenberg et al.: ICSW2014] 31

32 Open Problems and Grand Challenges Automatic and continuously maintained sameAs links for Web of Linked Data with high accuracy & coverage Distilling out the high quality pieces of information Web-scale, robust Entity Linking with high quality Handle huge amounts of linked-data sources, Web tables, … 32


Download ppt "Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up in."

Similar presentations


Ads by Google