8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County

Slides:

Advertisements

Similar presentations

Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.

Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Wikitology Wikipedia as an Ontology Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul McNamee and Christine Piatko.

Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.

Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.

Information Retrieval in Practice

Xyleme A Dynamic Warehouse for XML Data of the Web.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Tirgul 9 Amortized analysis Graph representation.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Vector Space Model CS 652 Information Extraction and Integration.

RDF: Building Block for the Semantic Web Jim Ellenberger UCCS CS5260 Spring 2011.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Overview of Search Engines

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Tag-based Social Interest Discovery

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Chapter 14: Artificial Intelligence Invitation to Computer Science, C++ Version, Third Edition.

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009 Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009.

Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Wikitology Wikipedia as an Ontology Zareen Syed, Tim Finin and Anupam Joshi University of Maryland.

UMBC iConnect Audumbar Chormale, Dr. A. Joshi, Dr. T. Finin, Dr. Z. Segall.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.

Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012

Introduction. What is the course about?  Concepts History History Data representation, logic Data representation, logic Hardware: CPU, memory, storage,

Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.

Algorithmic Detection of Semantic Similarity WWW 2005.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Creating and Exploiting a Web of Semantic Data. Overview Introduction Semantic Web 101 Recent Semantic Web trends Examples: DBpedia, Wikitology Conclusion.

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Creating and Exploiting a Web of (Semantic) Data, Tim Finin Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul.

Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.

Making Software Agents Smarter Tim Finin University of Maryland, Baltimore County ICAART 2010, 22 January 2010

1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Information Retrieval in Practice

CS 405G: Introduction to Database Systems

Auburn University

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Latent Semantic Indexing

Wikitology Wikipedia as an Ontology

Information Retrieval

Creating and Exploiting a Web of Semantic Data

Overview of big data tools

Graph and Link Mining.

Recuperação de Informação B

Presentation transcript:

8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County

Motivation Identifying the topics and concepts associated with text or text entities is a task common to many applications: –Annotation and categorization of documents –Modelling user interests –Business intelligence –Selecting advertisements –Improving information retrieval –Better named entity extraction and disambiguation

What’s a document about? Two common approaches: (1) Select words and phrases using TF- IDF that characterize the document (2) Map document to a list of terms from a controlled vocabulary or ontology (1) is flexible and doesn’t require creating and maintaining an ontology (2) Can connect documents to a rich knowledge base

Wikitology ! Using Wikipedia as an ontology offers the best of both approaches –each article (~4M) is a concept in the ontology –terms linked via Wikipedia’s category system (~200k) and inter-article links –Lots of structured and semi-structured data It’s a consensus ontology created and maintained by a diverse community Broad coverage, multilingual, very current Overall content quality is high

Wikitology features Terms have unique IDs (URLs) and are “self describing” for people Underlying graphs provide structure: categories, article links, disambiguation Article history contains useful meta-data (e.g., for trust and provenance) External sources provide more info (e.g., Google’s pagerank) Annotated with structured data: RDF from DBpedia and Freebase

Constructing the Wikitology KB WordNet Yago Human input & editingDatabases Freebase KB RDF and OWL statements

ACE 2008 ACE 2008 is a NIST sponsored exercise in entity extraction from text Focus on resolving entities across documents, e.g., “Dr. Rice” mentioned in doc is the same as “Secretary of State” in doc K documents in English and Arabic We participated on a team from the JHU Human Language Technology Center of Excellence NLP ML clust FEAT Documents KB entities

ACE 2008 BBN’s Serif system produces text annotated with named entities (people or organizations) Dr. Rice, Ms. Rice, the secretary, she, secretary Rice Featurizers score pairs of entities for co-reference (CNN E32, AFP E19, ) A machine learning system combines the evidence A simple clustering algorithm identifies clusters NLP ML clust FEAT Documents KB entities

Wikitology tagging Using Serif’s output, we produced an entity document for each entity. Included the entity’s name, nominal and pronom- inal mentions, APF type and subtype, and words in a window around the mentions We tagged entity documents using Wiki- tology producing vectors of (1) terms and (2) categories for the entity We used the vectors to compute features measuring entity pair similarity/dissimilarity

Entity Document & Tags ABC LDC2000T44-E2 Webb Hubbell PER Individual NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell" NAM: "Mr. " "friend” "income" PRO: "he” "him” "his",. abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years Wikitology article tag vector Webster_Hubbell Hubbell_Trading_Post National Historic Site United_States_v._Hubbell Hubbell_Center Whitewater_controversy Wikitology category tag vector Clinton_administration_controversies American_political_scandals Living_people _births People_from_Arkansas Arkansas_politicians American_tax_evaders Arkansas_lawyers 0.167

Wikitology derived features Seven features measured entity similarity using cosine similarity of various length article or category vectors Five features measured entity dissimilarity: two PER entities match different Wikitology persons two entities match Wikitology tags in a disambiguation set two ORG entities match different Wikitology organizations two PER entities match different Wikitology persons, weighted by 1-abs(score1-score2) two ORG entities match different Wikitology orgs, weighted by 1-abs(score1-score2)

Challenges Wikitology tagging is expensive –~2 seconds/document on a single processor –Took ~24 hrs on a cluster for 150K entity docs –A spreading activation algorithm on the underlying graphs improves accuracy at even more cost Exploiting the RDF metadata and data and the underlying graphs –requires reasoning and graph processing Extract entities from Wiki text to find more relations –More graph processing

Next Steps Construct a Web-based API and demo system to facility experimentation Process Wikitology updates in real-time Exploit machine learning to classify pages and improve performance Better use of cluster using Hadoop, etc. Exploit cell processor technology for spreading activation and other graph- based algorithms –e.g., recognize people by the graph of relations they are part of

Spreading Activation Spreading activation is network based algorithm inspired by brain models Associative retrieval finds relevant documents associated with documents that a user considers relevant The documents can be represented as nodes and their associations as links in a network.

Spreading activation example =  atat a t-1 =  W a0a0 a1a1 from to

Spreading activation example =  atat a t-1 =  W a0a0 a1a1 from to

Spreading activation example =  atat a t-1 =  W a0a0 a1a1 from to

Spreading activation example =  atat a t-1 =  W a1a1 a2a2 from to

SA as matrix multiplication Good news: SA is matrix multiplication –Model graph as n x n matrix W where Wij is strength of connection from node i to j –Vector A of length n, Ai is node I’s activation –A(t) = W*A(t-1) Bad news: is n is huge –140K category nodes and 4.2M edges –2.9M articles and 50M edges. Good news: matrices are sparse 1/9/2007

Overview of Cell Architecture A traditional “control” CPU called the PowerPC Processing Element (PPE), surrounded by a series (6 or 8) Synergistic Processing Elements (SPEs) PPE is intended for control, and contains a full PowerPC instruction set SPEs are specialized CPUs intended for fast computations Large amount of bandwidth is available between PPE, SPE’s and main memory SPEs cannot access memory directly, but operate off of a local store which is 256 KB. Each SPE has a memory flow controller that it can use to fetch outside data into its store

Sparse Matrix Vector Multiplication Exploiting parallelism for sparse matrix- vector multiplication (SPMV) has several challenges High storage overhead and Indirect and irregular memory access patterns How to parallelize Load balancing

Sparse Matrix Representation Compressed Sparse Row Format (CSR) is a simple storage format Values: The non-zero values in the matrix Columns: Column indices of non-zero values Pointer B: Column index of first non-zero value in a row Pointer E: Column index of last non-zero value in a row

Thread Level Parallelism Partition matrix rows among processors Statically load balance SPMV by approximately equally distributing the non-zero values among processors/threads

Sort rows in decreasing order of number of non-zeros Assign rows to processes/threads iteratively –Assign row #1 to processes 0 –Assign subsequent rows to the process with smallest total number of non-zeros Guarantees the maximum difference in the number of non-zero values between any two processes/threads will be at most the largest number of non-zeros in a row Heuristic Load Balancing

Conclusion Our initial experiments showed that the Wikitology idea has merit Wikipedia is increasingly being used as a knowledge source of choice Easily extendable to other wikis and collaborative KBs, e.g., Intellipedia Serious use requires exploiting cluster machines and cell processing Key processing of associated graph data can exploit cell architecture