Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009 Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009.

Similar presentations


Presentation on theme: "Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009 Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009."— Presentation transcript:

1 Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009 Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009

2 2 October 2015Page 2 Outline Introduction and MotivationIntroduction and Motivation Related WorkRelated Work Proposed WorkProposed Work TimelineTimeline Work ProgressWork Progress ConclusionConclusion

3 2 October 2015Page 3 Introduction Wikipedia Encyclopedia Developed Collaboratively Freely available online Millions of articles English Wikipedia (2,723,767 articles) Multiple Languages (More than 260) Structured and un-structured content

4 2 October 2015Page 4 Introduction Wikipedia Content and Organization Article Text Categories and Category Hierarchy Inter-article Links Info-boxes Disambiguation Pages Redirection Pages Talk Pages History Pages Meta-data

5 2 October 2015Page 5 Motivation Challenges Human Understandable Content (not machine readable) How to make it more structured and organized to improve machine readability How to automatically exploit the knowledge in Wikipedia to solve some real world problems

6 2 October 2015Page 6 Thesis Statement We can exploit Wikipedia and other related knowledge sources to automatically create knowledge about the world supporting a set of common use cases such as: Concept Prediction Information Retrieval Information Extraction

7 2 October 2015Page 7 Proposed Contributions Developing a Novel Hybrid Knowledge Base composed of structured, semi-structured and un- structured information extracted from Wikipedia and other related sources Developing Novel Application Specific Algorithms for exploiting the hybrid knowledge base Task Based Evaluation of the system on common use-cases such as Concept Prediction, Information Retrieval and Information Extraction

8 2 October 2015Page 8 Outline Introduction and MotivationIntroduction and Motivation Related WorkRelated Work Proposed WorkProposed Work TimelineTimeline Work ProgressWork Progress ConclusionConclusion

9 2 October 2015Page 9 Related Work Information Extraction Relation extraction [35] Co-reference resolution [25] Named Entity Classification [52] Natural Language Processing Automatic word sense disambiguation [27] Searching synonyms [28]

10 2 October 2015Page 10 Related Work Information Retrieval Text categorization [24] Computing semantic relatedness [30,31,32] Predicting document topics [26] Search Engine [69] Semantic Web DBPedia [46] Semantic MediaWiki [46] Linked Open Data Project [23] Freebase [22]

11 2 October 2015Page 11 Outline Introduction and MotivationIntroduction and Motivation Related WorkRelated Work Proposed WorkProposed Work TimelineTimeline Work ProgressWork Progress

12 2 October 2015Page 12 Proposed Work Refining, Enriching and Exploiting Structured Content in Wikipedia Integrating other related knowledge sources Developing application specific algorithms Developing a dynamic and scalable architecture

13 2 October 2015Page 13 Issues Single document in too many categories: George W. Bush is included in about 30 categories Links between articles belonging to very different categories John F. Kennedy has a link for “coincidence theory” which belongs to the Mathematical Analysis/ Topology/Fixed Points. Number of articles with in a category: Some categories are under represented where as others have many articles Administrative Categories For eg: Clean up from Sep 2006 Articles with unsourced statements Links to words in an article For eg. If the word United States appears in the document then that word might be linked to the page on “United States”

14 2 October 2015Page 14 Issues Category Hierarchy: Multiple Parents (Thesaurus) Noisy “Animals” category defined in the sub-tree rooted at “People” loose subsumption Geography-> Geography by place -> Regions-> Regions of Asia- >Middle East -> Dances of Middle East Events->Events by year->Lists of leaders by year

15 2 October 2015Page 15 Category Hierarchy Filtering out Administrative Categories Algorithms for Selecting and Ranking Categories Inferring and Labeling Semantic Relations between Categories Refining Subsumption (Taxonomy) Instance-of Relation Using Information in Wikipedia Lists_of_Topics Using Specific Administrative Categories Done Refining, Enriching and Exploiting Structured Content in Wikipedia

16 2 October 2015Page 16 Refining, Enriching and Exploiting Structured Content in Wikipedia Inter-Article Links: Problem: Don’t imply semantic relatedness Links to locations, term definitions, dates, entities Possible solutions: Classifying Link Types Introducing Link Weights Done

17 2 October 2015Page 17 Redirection Pages Refining, Enriching and Exploiting Structured Content in Wikipedia Disambiguation pages

18 2 October 2015Page 18 Proposed Work Exploring Other structured content Talk pages, user pages, history pages and meta-data Other structured resources Integrating structured information from other sources like DBpedia and Freebase in Wikitology How and When to employ reasoning over the RDF triples

19 2 October 2015Page 19 Proposed Work Developing Novel Application Specific Algorithms on top of the hybrid Wikitology Knowledge Base for applications such as Concept Prediction Information Retrieval Information Extraction

20 2 October 2015Page 20 Proposed Work Evaluation Main Approaches to Evaluating Ontologies Gold Standard Evaluation (Comparison to an existing Ontology) Criteria based Evaluation (By humans) Task based Evaluation (Application based) Comparison with Source of Data (Data driven) Using a Reasoning Engine Our Approach to Evaluation Task based Evaluation (Application based)

21 2 October 2015Page 21 IR Index Relational Database Relational Database Triple Store RDF Reasoner Page Links Graph Category Links Hierarchical Graph Category Links Hierarchical Graph Articles Wikitology Code Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Wikitology Overview

22 2 October 2015Page 22 Outline Introduction and MotivationIntroduction and Motivation Related WorkRelated Work Proposed WorkProposed Work Time LineTime Line Work ProgressWork Progress ConclusionConclusion

23 2 October 2015Page 23 Time Line No.Mile StonesExpected Completion Date 1Enriching Wikitology by extracting additional information from Wikipedia May, 2009 2Studying other related knowledge sources in detail such as Freebase, DBPedia, YAGO etc. May, 2009 3Incorporating additional knowledge sources to enrich Wikitology May, 2009 4Working on techniques to improve applications in Information Retrieval and Information Extraction using additional features generated from Wikitology Dec, 2009 5Evaluating the Wikitology knowledge baseMay, 2010 6Thesis write upAug, 2010

24 2 October 2015Page 24 Outline Introduction and MotivationIntroduction and Motivation Related WorkRelated Work Proposed WorkProposed Work Time LineTime Line Work ProgressWork Progress ConclusionConclusion

25 2 October 2015Page 25 Work Done Case Study 1: Concept Prediction Case Study 2: Document Expansion for Information Retrieval Case Study 3: Named Entity Classification Case Study 4: Co-reference Resolution Case Study 5: Concept Based Features for Information Retrieval In Progress

26 2 October 2015Page 26 Case Study 1 Concept Prediction [2] Problem: Predict the individual document topics as well as concepts common to a set of documents Approach: Hybrid Knowledge base: Wikitology 1.0 Algorithms for selecting and aggregating terms

27 2 October 2015Page 27 Wikitology 1.0 Wikipedia as an Ontology Each article is a concept in the ontology Terms linked via Wikipedia’s category system and inter- article links It’s a consensus ontology created, kept current and maintained by a diverse community Overall content quality is high Terms have unique IDs (URLs) and are “self describing” for people

28 2 October 2015Page 28 Wikitology 1.0 Structured Data Specialized Concepts (article titles) Generalized Concepts (category titles) Inter-category and Inter-article links as relations between concepts Article-Category links as relations between specialized and generalized concepts Un-Structured Data Article Text ( A way to map ontology terms to free text) Algorithms Algorithms to select, rank and aggregate concepts using the hybrid knowledge base

29 2 October 2015Page 29 Method 1 Query doc(s) similar to Cosine similarity Similar Wikipedia Articles Using Wikipedia Article Text and Categories to Predict Concepts 0.2 0.1 0.3 0.8 0.2 Input

30 2 October 2015Page 30 Method 1 Query doc(s) similar to Cosine similarity Wikipedia Category Graph Similar Wikipedia Articles Using Wikipedia Article Text and Categories to Predict Concepts 0.2 0.1 0.3 0.8 0.2 Input

31 2 October 2015Page 31 Method 1 Query doc(s) similar to Rank Categories 1.Links 2.Cosine similarity Cosine similarity Wikipedia Category Graph Similar Wikipedia Articles Using Wikipedia Article Text and Categories to Predict Concepts 0.2 0.1 0.3 0.8 0.2 0.9 3 Input Output

32 2 October 2015Page 32 Method 2 Query doc(s) Similar to Cosine similarity Wikipedia Category Graph Using Spreading Activation on Category Links Graph to get Aggregated Concepts 0.2 0.1 0.3 0.8 0.2 Input Ranked Concepts based on Final Activation Score Output Spreading Activation Input Function Output Function

33 2 October 2015Page 33 Method 3 Query doc(s) Similar To Ranked Concepts based on Final Activation Score Spreading Activation Threshold: Ignore Spreading Activation to articles with less than 0.4 Cosine similarity score Edge Weights: Cosine similarity between linked articles Wikipedia Article Links Graph Using Spreading Activation on Article Links Graph Node Input Function Node Output Function Output Input

34 2 October 2015Page 34 Wikitology 1.0 The system was evaluated by predicting the categories and article links of existing Wikipedia articles and comparing with the ground truth It was observed that Wikitology 1.0 system was able to predict the document topics and common concepts with high accuracy when the article concepts were well represented within Wikipedia

35 2 October 2015Page 35 * In Collaboration with Paul McNamee, John Hopkins University Applied Physics Laboratory Doc: FT921-4598 (3/9/92)... Alan Turing, described as a brilliant mathematician and a key figure in the breaking of the Nazis' Enigma codes. Prof IJ Good says it is as well that British security was unaware of Turing's homosexuality, otherwise he might have been fired 'and we might have lost the war'. In 1950 Turing wrote the seminal paper 'Computing Machinery And Intelligence', but in 1954 killed himself... Turing_machine, Turing_test, Church_Turing_thesis, Halting_problem, Computable_number, Bombe, Alan_Turing, Recusion_theory, Formal_methods, Computational_models, Theory_of_computation, Theoretical_computer_science, Artificial_Intelligence IR Effectiveness Using Wikipedia Concepts MAPP@10 base0.20760.4207 Base + rf0.24700.4480 Concepts + rf0.24000.4553 * In Collaboration with Paul McNamee, John Hopkins University Applied Physics Laboratory Preliminary work with TREC documents Case Study 2 Document Expansion with Wikipedia Derived Ontology Terms [21]*

36 2 October 2015Page 36 Semi-automated generation of Training data Persons, Locations and Events Experimenting with different feature sets Inter-article link labeling Results showing accuracy obtained using different feature sets Case Study 3 Named Entity Classification

37 2 October 2015Page 37 Problem: To determine whether various named people, organizations or relations from different documents refer to the same object in the world. For example, does the “Condoleezza Rice” mentioned in one document refer to the same person as the “Secretary Rice” from another? * In Collaboration with John Hopkins University Human Language Technology Center of Excellence Case Study 4 Cross Document Entity Co-reference Resolution [21]*

38 2 October 2015Page 38 Entity Document (EDOC) ABC19980430.1830.0091.LDC2000T44-E2 Webb Hubbell PER Individual NAM: "Hubbell" "Hubbells" "Webb Hubbell" "Webb_Hubbell" NOM: "Mr. " "friend" "income" PRO: "he" "him" "his",. abc's accountant after again ago all alleges alone also and arranged attorney avoid been b efore being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough eva sion feel financial firm first four friend friends going got grand happening has he help him his hope house hubbell hubbells hundred hush income increase independent indict indicted indictme nt inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady la te law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutor s questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years Entity documents capture information about entities extracted from documents, including mention strings, type and subtype, and text surrounding the mentions.

39 2 October 2015Page 39 Wikitology 2.0 Enhancements Structured Data Specialized Concepts (article titles) Generalized Concepts (category titles) Inter-category and Inter-article links as relations between concepts Article-Category links as relations between specialized and generalized concepts YAGO types (to identify entity type) Table with Disambiguation set (to identify highly confused entities) Aliases using Redirect pages Un-Structured Data Article Text Redirect titles (added to article text)

40 2 October 2015Page 40 Wikitology 2.0 Data Structures Lucene Index Concept Title + Redirected Titles (field) Article Text + Redirected Titles (field) RDF field with Entity Type (YAGO type) Graphs Category links graph Article links graph Article-Category links Tables Disambiguation Set derived from disambiguation pages

41 2 October 2015Page 41 Wikitology 2.0 Custom Query Front end The EDOC’s name mention strings Wikitology’s title field slightly higher weight to the longest mention, i.e., “Webb Hubbell” The EDOC type RDF Field: Yago Type Name mention strings + Contextual text Text (Wikitology Article Contents)

42 2 October 2015Page 42 Wikitology Features Article Vector for ABC19980430.1830.0091.LDC2000T44-E2 1.0000 Webster_Hubbell 0.3794 Hubbell_Trading_Post_National_Historic_Site 0.3770 United_States_v._Hubbell 0.2263 Hubbell_Center 0.2221 Whitewater_controversy Category Vector for ABC19980430.1830.0091.LDC2000T44-E2 0.2037 Clinton_administration_controversies 0.2037 American_political_scandals 0.2009 Living_people 0.1667 1949_births 0.1667 People_from_Arkansas 0.1667 Arkansas_politicians 0.1667 American_tax_evaders 0.1667 Arkansas_lawyers Each entity document is tagged by Wikitology, producing vectors of article and category tags. Note the clear match with a known person in Wikipedia.

43 2 October 2015Page 43 Features Derived from Wikitology 2.0 NameRangeTypeDescription APL20WAS{0,1}sim1 if the top article tags for the two entities are identical, 0 otherwise APL21WCS{0,1}sim1 if the top category tags for the two entities are identical, 0 otherwise APL22WAM[0..1]simThe cosine similarity of the medium length article vectors (N=5) for the two entities APL23WcM[0..1]simThe cosine similarity of the medium length category vectors (N=4) for the two entities APL24WAL[0..1]simThe cosine similarity of the long length article vectors (N=8) for the two entities APL31WAS2[0..1]simmatch of entities top Wikitology article tag, weighted by avg(score1,score2) APL32WCS2[0..1]simmatch of entities top Wikitology category tag, weighted by avg(score1,score2) APL26WDP{0,1}dissim1 if both entities are of type PER and their top article tags are different, 0 otherwise APL27WDD{0,1}dissim1 if the two top article tags are members of the same disambiguation set, 0 otherwise APL28WDO{0,1}dissim1 if both entities are of type ORG and their top article tags are different, 0 otherwise APL29WDP2[0..1]dissimMatch both entities are of type PER and their top article tags are different, weighted by 1- abs(score1-score2), 0 otherwise APL30WDP2[0..1]dissimMatch if both entities are of type ORG and their top article matches are different organizations, weighted by 1-abs(score1-score2), 0 otherwise Twelve features were computed for each pair of entities using Wikitology, seven aimed at measuring their similarity and five for measuring their dissimilarity.

44 2 October 2015Page 44 Evaluation Evaluation results for cross-document entity co- reference task using Wikitology features match TP rateFP ratePrecisionRecallF-Measure yes.722.001.966.722.826 no.999.278.99.999.994

45 2 October 2015Page 45 Case Study 5 Feature Generation to Improve Information Retrieval Performance* Incorporating Generalized Concept Features in MORAG [69] search engine * Work being done during internship at RiverGlass Company

46 2 October 2015Page 46 Incorporating Wikitology based features in MORAG search engine MORAG Search Engine Concept features generated using Wikipedia (ESA) Feature Selection using pseudo- relevance feedback Merged Ranking of Concept scores and BOW scores MORAG Search Engine Concept features generated using Wikipedia (ESA) Feature Selection using pseudo- relevance feedback Merged Ranking of Concept scores and BOW scores

47 2 October 2015Page 47 Outline Introduction and MotivationIntroduction and Motivation Related WorkRelated Work Proposed WorkProposed Work TimelineTimeline Work ProgressWork Progress ConclusionConclusion

48 2 October 2015Page 48 Thesis Statement We can exploit Wikipedia and other related knowledge sources to automatically create knowledge about the world supporting a set of common use cases such as: Concept Prediction Information Retrieval Information Extraction

49 2 October 2015Page 49 Proposed Contributions 1.Developing a Novel Hybrid Knowledge base composed of structured and un-structured information extracted from Wikipedia and other related sources Wikitology 1.0 Wikitology 2.0

50 2 October 2015Page 50 Proposed Contributions 2.Developing Novel Application Specific Algorithms for exploiting the hybrid knowledge base Methods for Concept Prediction Ranking methods and Spreading Activation Co-reference Resolution Novel Entity representation and Hybrid Querying Information Retrieval Document Expansion, Generalized Concept Features augmentation

51 2 October 2015Page 51 Proposed Contributions 3.Task Based Evaluation of the system on common use-cases such as Concept Prediction, Information Retrieval and Information Extraction Metrics: Precision and Recall

52 2 October 2015Page 52 The End Thank you Questions?


Download ppt "Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009 Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009."

Similar presentations


Ads by Google