8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County

8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County http://ebiquity.umbc.edu/resource/html/id/250/

Motivation Identifying the topics and concepts associated with text or text entities is a task common to many applications: –Annotation and categorization of documents –Modelling user interests –Business intelligence –Selecting advertisements –Improving information retrieval –Better named entity extraction and disambiguation

What’s a document about? Two common approaches: (1) Select words and phrases using TF- IDF that characterize the document (2) Map document to a list of terms from a controlled vocabulary or ontology (1) is flexible and doesn’t require creating and maintaining an ontology (2) Can connect documents to a rich knowledge base

Wikitology ! Using Wikipedia as an ontology offers the best of both approaches –each article (~4M) is a concept in the ontology –terms linked via Wikipedia’s category system (~200k) and inter-article links –Lots of structured and semi-structured data It’s a consensus ontology created and maintained by a diverse community Broad coverage, multilingual, very current Overall content quality is high

Wikitology features Terms have unique IDs (URLs) and are “self describing” for people Underlying graphs provide structure: categories, article links, disambiguation Article history contains useful meta-data (e.g., for trust and provenance) External sources provide more info (e.g., Google’s pagerank) Annotated with structured data: RDF from DBpedia and Freebase

Constructing the Wikitology KB WordNet Yago Human input & editingDatabases Freebase KB RDF and OWL statements

ACE 2008 ACE 2008 is a NIST sponsored exercise in entity extraction from text Focus on resolving entities across documents, e.g., “Dr. Rice” mentioned in doc 18397 is the same as “Secretary of State” in doc 46281 20K documents in English and Arabic We participated on a team from the JHU Human Language Technology Center of Excellence NLP ML clust FEAT Documents KB entities

ACE 2008 BBN’s Serif system produces text annotated with named entities (people or organizations) Dr. Rice, Ms. Rice, the secretary, she, secretary Rice Featurizers score pairs of entities for co-reference (CNN-264772-E32, AFP-7373726-E19, 0.6543) A machine learning system combines the evidence A simple clustering algorithm identifies clusters NLP ML clust FEAT Documents KB entities

Wikitology tagging Using Serif’s output, we produced an entity document for each entity. Included the entity’s name, nominal and pronom- inal mentions, APF type and subtype, and words in a window around the mentions We tagged entity documents using Wiki- tology producing vectors of (1) terms and (2) categories for the entity We used the vectors to compute features measuring entity pair similarity/dissimilarity

Entity Document & Tags ABC19980430.1830.0091.LDC2000T44-E2 Webb Hubbell PER Individual NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell" NAM: "Mr. " "friend” "income" PRO: "he” "him” "his",. abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years Wikitology article tag vector Webster_Hubbell 1.000 Hubbell_Trading_Post National Historic Site 0.379 United_States_v._Hubbell 0.377 Hubbell_Center 0.226 Whitewater_controversy 0.222 Wikitology category tag vector Clinton_administration_controversies 0.204 American_political_scandals 0.204 Living_people 0.201 1949_births 0.167 People_from_Arkansas 0.167 Arkansas_politicians 0.167 American_tax_evaders 0.167 Arkansas_lawyers 0.167

Wikitology derived features Seven features measured entity similarity using cosine similarity of various length article or category vectors Five features measured entity dissimilarity: two PER entities match different Wikitology persons two entities match Wikitology tags in a disambiguation set two ORG entities match different Wikitology organizations two PER entities match different Wikitology persons, weighted by 1-abs(score1-score2) two ORG entities match different Wikitology orgs, weighted by 1-abs(score1-score2)

Challenges Wikitology tagging is expensive –~2 seconds/document on a single processor –Took ~24 hrs on a cluster for 150K entity docs –A spreading activation algorithm on the underlying graphs improves accuracy at even more cost Exploiting the RDF metadata and data and the underlying graphs –requires reasoning and graph processing Extract entities from Wiki text to find more relations –More graph processing

Next Steps Construct a Web-based API and demo system to facility experimentation Process Wikitology updates in real-time Exploit machine learning to classify pages and improve performance Better use of cluster using Hadoop, etc. Exploit cell processor technology for spreading activation and other graph- based algorithms –e.g., recognize people by the graph of relations they are part of

Spreading Activation Spreading activation is network based algorithm inspired by brain models Associative retrieval finds relevant documents associated with documents that a user considers relevant The documents can be represented as nodes and their associations as links in a network.

Spreading activation example 1 1 4 4 2 2 5 5 6 6 3 3 1 1.5 1 1 1 1 1 1 1 1.8 1 1.5.8 1 1 1 1 =  0.9 0.8 1.0 0.5 0.3 atat a t-1 =  W a0a0 a1a1 from to

Spreading activation example 1 1 4 4 2 2 5 5 6 6 3 3.9.3 1 1.5 1 1 1 1 1 1 1 1.8 1 1.5.8 1 1 1 1 =  0.9 0.8 1.0 0.5 0.3 atat a t-1 =  W a0a0 a1a1 from to

Spreading activation example 1 1 4 4 2 2 5 5 6 6 3 3.9.3 1 1.5 1 1 1 1 1 1 1 1.8 1 1.5.8 1 1 1 1.45.9.15.45.3 =  0.9 0.8 1.0 0.5 0.3 atat a t-1 =  W a0a0 a1a1 from to

Spreading activation example 1 1 4 4 2 2 5 5 6 6 3 3.45.9.15.45.3 1 1.5 1 1 1 1 1 1 1 1.8 1 1.5.8 1 1 1 1.45.9.3.51.45.3 =  0.9 0.8 1.0 0.5 0.3 atat a t-1 =  W a1a1 a2a2 from to

SA as matrix multiplication Good news: SA is matrix multiplication –Model graph as n x n matrix W where Wij is strength of connection from node i to j –Vector A of length n, Ai is node I’s activation –A(t) = W*A(t-1) Bad news: is n is huge –140K category nodes and 4.2M edges –2.9M articles and 50M edges. Good news: matrices are sparse 1/9/2007

Overview of Cell Architecture A traditional “control” CPU called the PowerPC Processing Element (PPE), surrounded by a series (6 or 8) Synergistic Processing Elements (SPEs) PPE is intended for control, and contains a full PowerPC instruction set SPEs are specialized CPUs intended for fast computations Large amount of bandwidth is available between PPE, SPE’s and main memory SPEs cannot access memory directly, but operate off of a local store which is 256 KB. Each SPE has a memory flow controller that it can use to fetch outside data into its store

Sparse Matrix Vector Multiplication Exploiting parallelism for sparse matrix- vector multiplication (SPMV) has several challenges High storage overhead and Indirect and irregular memory access patterns How to parallelize Load balancing

Sparse Matrix Representation Compressed Sparse Row Format (CSR) is a simple storage format Values: The non-zero values in the matrix Columns: Column indices of non-zero values Pointer B: Column index of first non-zero value in a row Pointer E: Column index of last non-zero value in a row

Thread Level Parallelism Partition matrix rows among processors Statically load balance SPMV by approximately equally distributing the non-zero values among processors/threads

Sort rows in decreasing order of number of non-zeros Assign rows to processes/threads iteratively –Assign row #1 to processes 0 –Assign subsequent rows to the process with smallest total number of non-zeros Guarantees the maximum difference in the number of non-zero values between any two processes/threads will be at most the largest number of non-zeros in a row Heuristic Load Balancing

Conclusion Our initial experiments showed that the Wikitology idea has merit Wikipedia is increasingly being used as a knowledge source of choice Easily extendable to other wikis and collaborative KBs, e.g., Intellipedia Serious use requires exploiting cluster machines and cell processing Key processing of associated graph data can exploit cell architecture

8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County

Similar presentations

Presentation on theme: "8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County

Similar presentations

Presentation on theme: "8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County"— Presentation transcript:

Similar presentations

About project

Feedback