Presentation is loading. Please wait.

Presentation is loading. Please wait.

YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer.

Similar presentations


Presentation on theme: "YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer."— Presentation transcript:

1 YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer

2 Motivation for an Ontology  Natural Language communication  Automated text translation  Finding information on internet  Computer-processable collection of knowledge

3 What is an Ontology?  An ontology is the description of a domain, its classes and properties and relationships between those classes by means of a formal language.  collection of knowledge about the world, a knowledge base  Example ontologies:  large taxonomies categorizing Web sites (such as on Yahoo!)  categorizations of products for sale and their features (such as on Amazon.com)

4 Uses of Ontologies  Machine Translation  Word Sense Disambiguation  Document Classification  Question Answering  Entity and fact-oriented Web Search

5 What is Yago  Yet Another Great Ontology  Part of Yago-Naga project  Goal to build a knowledge base that is  Large Scale  Domain-independent  Automatic Construction  High Accuracy  Uses Wikipedia and WordNet

6 More about YAGO  2 million entities  20 million facts  Facts represented as RDF triples  Accuracy of 95%  Examples:  Elvis Presley isA singer  singer subClassOf person  Elvis Presley bornOnDate 1935-01-08  Elvis Presley bornIn Tupelo  Tupelo locatedIn Mississippi(state)  Mississippi(state) locatedIn USA

7 The YAGO model  Slight extension of RDFS  Represents knowledge as  Entities  Classes  Relations  Facts  Properties of relations like transitivity  Simple and decidable model

8 Knowledge Representation in YAGO  All objects are entities  e.g. Elvis Presley, Grammy Award  2 entities can stand in a relationship  e.g. hasWonAward  Elvis Presley hasWonAward Grammy Award  The triple of entity, relationship, entity is a fact  e.g. Elvis Presley hasWonAward Grammy Award is a fact

9 Knowledge Representation in YAGO -2  Numbers, dates and strings are also entities.  Elvis Presley BornInYear 1935  Words are entities  “Elvis” means Elvis Presley  Entity is instance of class  Elvis Presley Type Singer  Classes are also entities  Singer Type class

10 Knowledge Representation in YAGO- 3  Classes have hierarchies  Singer SubClassOf Person  Relations are also entities  subClassOf Type atr  Each fact has a fact identifier  #1 FoundIn Wikipedia

11 Key Contributions of YAGO  Information Extraction from Wikipedia  Infoboxes  Category Pages  Combination with WordNet  Taxonomy  Quality Control  Canonicalization  Type Checking

12 Information Extraction -1  Entities from Wikipedia  Each page title is candidate entity  Wiki Markup Language  Wikipedia dump as of September, 2008

13 Information Extraction - WML

14 Information Extraction Techniques  Infobox Harvesting  Wikipedia Infoboxes  Word-Level Techniques  Wikipedia Redirects  Category Harvesting  Wikipedia Categories  Type Extraction  Wikipedia Categories, WordNet Classes

15 1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox

16 Bor B B Born: January 8, 1935 Attribute Relation Inverse Manifold Indirect … Born bornOnDate … Elvis PresleybornOnDateJanuary 8, 1935 Infobox Attribute Map Relation Domain Range … bornOnDate person yagoDate … Relation Map

17 Bor B B Died: August 16, 1977 Attribute Relation Inverse Manifold Indirect … Died diedOnDate … Elvis PresleydiedOnDate Infobox Attribute Map August 16, 1977 Relation Map Relation Domain Range … diedOnDate person yagoDate …

18 Bor B B Genre: Rock and Roll Attribute Relation Inverse Manifold Indirect … Genre isOfGenre … Elvis PresleyisOfGenre Infobox Attribute Map Rock and Roll … isOfGenre entity yagoClass … Relation Domain Range Relation Map

19 Bor B B Birth Name: Elvis Aaron Presley Attribute Relation Inverse Manifold Indirect … birth name means … means Infobox Attribute Map Elvis PresleyElvis Aaron Presley Relation Map Relation Domain Range … means yagoWord entity …

20 Manifold Attributes  Some attributes may have multiple values  e.g. a person may have multiple children  Multiple facts are generated  e.g. one hasChild fact for each child

21 Indirect Attributes - 1  Some attributes do not concern article entity, but another fact  e.g attribute GDP does not concern the article entity i.e. Republic of Singapore, but year 2008  Therefore, facts generated:  Singapore hasGDP 238.755 billion  #14 during 2008  Singapore hasGDP 238.755 billion during 2008 Attribute Relation Inverse Manifold Indirect … gdp ppp hasGDP gdp year during Attribute Map

22 Indirect Attributes - 2 Singapore Infobox

23 Type of Infobox Released October, 1971 Format vinyl record Genre Folk Rock Length 8:33 mins Label United Artists Writer Don McLean Manufacturer Tesla Motors Production 2008-present Class Roadster Length 3,946 mm Width 1,873 mm Height 1,127 mm American PieTesla Roadster Song Infobox Car Infobox

24 Type of Infobox: Attribute Map Attribute Relation Inverse Manifold Indirect … car #length hasLength … song #length hasDuration … Attribute Map Song InfoboxCar Infobox American Pie hasDuration 8:33 Tesla Roadster hasLength 3946

25 Information Extraction - Word Level Techniques  Wikipedia Redirects  virtual redirect page for “Presley, Elvis“ links to “Elvis Presley”  Each redirect gives ‘means’ fact  e.g. “Presley, Elvis“ means Elvis Presley  Parsing Person Names  extract the name components  establish relations givenNameOf and familyNameOf e.g. Presley familyNameOf Elvis Presley Elvis givenNameOf Elvis Presley

26 Wikipedia Categories Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners | Portrait photographers

27 Facts created from Wikipedia Categories  Rhine locatedIn Germany  Bryan Adams bornOnDate 1959  Bryan Adams hasWonAward Grammy Award  Abraham Lincoln politicianOf United States

28 Information Extraction - Category Harvesting  Relational Categories ([0-9]f3,4g) births ([0-9]f3,4g) deaths ([0-9]f3,4g) establishments ([0-9]f3,4g) books|novels MountainsjRivers in (.*) PresidentsjGovernors of (.*) (.*) winners [A-Za-z]+ (.*) winners bornOnDate diedOnDate establishedOnDate writtenOnDate locatedIn politicianOf hasWonPrize RelationRegular Expression Table: Some Category Heuristics

29 2. Connecting Wikipedia and WordNet – What is WordNet  Lexical database for the English language  Created at the Cognitive Science Laboratory of Princeton University  Groups English words into sets of synonyms called synsets  Provides short, general definitions  Provides hypernym/hyponym relations  e.g. canine is hypernym, dog is hyponym

30

31 Connecting Wikipedia and WordNet – Type Extraction  Goal: create class hierarchy  e.g. singer subClassOf performer performer subClassOf artist  hyponymy relation from WordNet  Wikipedia class ‘American people in Japan’ is subclass of WordNet class ‘person’

32 Classifications of Categories  Conceptual Categories  e.g. Albert Einstein is in ‘Naturalized citizens of the United States’  Administrative Categories  e.g. Albert Einstein is in ‘Articles with unsourced statements’  Relational Information  1879 births  Thematic Vicinity  Physics

33 Identification of Conceptual Categories  Only conceptual categories are used  Shallow linguistic parsing of category names  e.g. category ‘American people in Japan’  Break category into pre-modifier - ‘American’ head - ‘people’ post-modifier - ‘in Japan’  If head is plural, then category is conceptual category  Extract class from Wikipedia category  Connect to class from WordNet  e.g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the WordNet class ‘person’

34 Algorithm Function wiki2wordnet(c) Input: Wikipedia category name c Output: WordNet synset 1 head =headCompound(c) 2 pre =preModifier(c) 3 post =postModifier(c) 4 head =stem(head) 5 If there is a WordNet synset s for pre + head 6 return s 7 If there are WordNet synsets s1, …, sn for head 8 (ordered by their frequency for head) 9 return s1 10 fail

35 Explanation of Algorithm Input: American people in Japan 1.pre-modifier : American 2.Head : people 3.Post-modifier : in Japan 4.Stem(head) : person 5.If there is a WordNet synset for ‘American person’ 6.return that synset 7.If there are s1, …, sn synsets for ‘person’ 8.(Ordered by frequency for ‘person’) 9.Return s1 10.Fail Output: person Result: American People in Japan subClassOf person

36 Fig.: WordNet search for “person” Fig.: WordNet search for ‘American Person’

37 Exceptions  Complete hierarchy of classes  Upper classes from WordNet  Leaves from Wikipedia  2 dozen cases failed  Categories with head compound “capital”  In Wikipedia, it means “capital city”  In WordNet, it means “financial asset”  These cases were corrected manually

38 3. Quality Control  Canonicalization  Each fact and each entity reference unique  an entity is always referred to by the same identifier in all facts in YAGO  Type Checking  eliminates individuals that do not have class  eliminates facts that do not respect domain and range constraints  an argument of a fact in YAGO is always an instance of the class required by the relation

39 Canonicalization - 1  Redirect Resolution  infobox heuristics deliver facts that have Wikipedia entities (i.e. Wikipedia links) as arguments  These links may not be correct Wikipedia page identifiers  Check if each argument is correct Wikipedia identifier  Replace by correct, redirected identifier E.g. Hermitage Museum locatedIn St. Petersburg Hermitage Museum locatedIn Saint Petersburg

40 Canonicalization - 2  Removal of Duplicate facts  Sometimes, 2 heuristics deliver the same fact.  canonicalization eliminates one of them  e.g., category ‘1935 births’ yields the fact: Elvis Presley bornOnDate 1935  Infobox attribute ‘Born: January 8, 1935’ yields the fact: Elvis Presley bornOnDate January 8, 1935

41 Type Checking - 1  Reductive Type Checking  Sometimes class of entity cannot be determined  Such facts are discarded e.g. Wikipedia entities that have been proposed for an article, but that do not have a page yet  Inductive Type Checking  Type constraints can be used to generate facts  e.g. Elvis Presley bornOnDate January 8, 1935  So, Elvis Presley is a person  Regular expression check to ensure entity name pattern of given name and family name

42 Type Checking - 2  Type Coherence Checking  Sometimes, classification yields wrong results  e.g. Abraham Lincoln is instance of 13 classes  12 are subclasses of class ‘person’; e.g. lawyer, president  13 th class is class ‘cabinet’  Class hierarchy of YAGO is partitioned into branches  e.g. locations, artifacts, people, other physical  entities, and abstract entities  Branch that most types lead to, is determined  Other types are purged

43 References  YAGO:ALarge Ontology from Wikipedia andWordNet Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany  Automated Construction and Growth of a Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the Faculties of Natural Sciences and Technology of Saarland University  Wikipedia http://en.wikipedia.org/wiki/Main_Page  WordNet http://wordnet.princeton.edu/

44 Thank You, Any Questions?


Download ppt "YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer."

Similar presentations


Ads by Google