Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval Models: Probabilistic Models
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Information Retrieval
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Relational Databases What is a relational database? What would we use one for? What do they look like? How can we describe them? How can you create one?
Chapter 5: Information Retrieval and Web Search
Databases & Data Warehouses Chapter 3 Database Processing.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
2.3 Organising Data for Effective Retrieval
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Linking Wikipedia to the Web Antonio Flores Bernal Department of Computer Sciencies San Pablo Catholic University 2010.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Frame an IR Research Problem and Form Hypotheses ChengXiang Zhai Department.
1 Query Operations Relevance Feedback & Query Expansion.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Clustering Top-Ranking Sentences for Information Access Anastasios Tombros, Joemon Jose, Ian Ruthven University of Glasgow & University of Strathclyde.
EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.
Chapter 6: Information Retrieval and Web Search
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Towards Distributed Information Retrieval in the Semantic Web: Query Reformulation Using the Framework Wednesday 14 th of June, 2006.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Chapter 12 Information Systems.
Software Documentation
Implementation Issues & IR Systems
Martin Rajman, Martin Vesely
Information Retrieval
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Presentation at ACM SIGIR 2004 Workshop on Information Retrieval and Databases, July 29, 2004

2 Motivation Management of textual data and structured data is currently separated A user is often interested in finding information from both databases and text collections. E.g., –Course information may be stored in a database; course web sites are mostly in text –Product information may be stored in a database; product reviews are in text How do we find information from databases and text collections in an integrative way?

3 Entity Retrieval (ER) over Structured and Text Data Problem Definition –Given collections of structured and text data –Given some known information about a real-world entity –Find more information about the entity Example –Data= DBLP (bib. Database) + Web (text) –Entity = researcher –Known information = “name of researcher” and/or a paper published by the researcher –Goal = find all papers in DBLP and all web pages mentioning this researcher

4 Entity Retrieval vs. Traditional Retrieval ER vs. Database Search –ER requires semantic-level matching –DB search matches information at the syntactic- level ER vs. Text Search –ER represents a special category of information need, which is more objectively defined What’s new about ER?

5 Challenges in ER Requires semantic-level matching –Both DB search and text search generally match at the syntactic level –E.g., name= “John Smith” would return all records match the name in DB search –E.g., query=“John Smith” would return documents match one or both words –But “John Smith” could refer to multiple real-world entities Same name for different entities A unique entity name may appear in different syntactic forms in a DB and text collection. –E.g., “John Smith” -> “J. Smith”

6 Definition of a Simplified ER Problem A={A 1, A 2, …, A k } Relational Table T Attributes C={c 1 =v 1, c 2 =v 2, …, c n =v n } constraints c i  A T={t 1, t 2, …, t l } target attributes t i  A R={r 1, r 2, …, r m } examples of rel docs r i  D Q=(q, R, C, T) Query q=Text query Document Set D Data Results … t 1, t 2, …, t l + +

7 Finding all Information about “John Smith” DBLP bib. database C: {author=“John Smith”, paper.conferenc=SIGIR} T: {paper.title, paper.conference} R: Home page of “John Smith” Q=(q, R, C, T) Query q=“John Smith” The Web Data Results … Titl conf + + Author, title, conf, date… “John Smith” is highly ambiguous!

8 ER Strategies Separate ER on DB and on text –Q=(q,R,C,T) Use Q1=(q,R) to search the text collection Use Q2=(C,T) to search the DB –The main challenge is entity disambiguation Integrative ER on DB + Text –Q=(q,R,C,T): use Q to search both the text collection and DB –Relevant information in DB can help improve search over text –Relevant information in text can help improve search over DB Hypothesis tested in this work

9 Exploit Structured Information to Improve ER on Text Given an ER query Q=(q,R,C,T) Assume that we have a basic text search engine We may exploit structured information to construct a different text query Q i Text Search Engine Text results Q S =(C,T) DB Search ER Results s 1, …, s F s 1 ’, …, s F ’ Attribute selection QiQi Method 1: Text Only (Baseline)Q1=Q T =(q,R)Method 2: Add Immediate StructureQ2=(q+s 1,, R) Method 3: Add All StructuresQ3=(q+s 1 +…+s F, R) Q4=(q+s 1 ’+…+s F ’, R)Method 4: Add Selective Structures

10 Attribute Selection Method Assumption: An attribute is more useful if it occurs more frequently in the top text documents (returned by the baseline TextOnly method) Attribute Selection Procedure –Use the top 25% of the docs returned by TextOnly as the reference doc set –Score each attribute by the average frequency of all the attribute values of the attribute in the reference doc set –Select the attribute with the highest score to expand the query

11 Experiments ER queries: 11 researchers, Q=name (no relevant text doc examples) DB = DBLP ( >460,000 articles Text collection = top 100 web pages returned by Google using the names of the 11 researchers Measures: –Precision: percent of pages retrieved that are relevant –Recall: percent of relevant pages that are retrieved –F1: a combination of precision and recall Retrieval method –Vector space model with BM25 TF –Scores normalized by the score of the top-ranked document –A score threshold is used to retrieve a subset of the top 100 pages returned by Google (set to a constant all the time) –Implemented in Lemur ER on DB: the DBLP search engine on the Web with manual selection of relevant tuples

12 Effect of Exploiting Structured Information F1 is improved as we exploit more structured information

13 Effect of Attribute Selection Conference is a better attribute than co-authors or titles

14 Automatic Attribute Selection The attribute score based on value frequency predicts the usefulness of an attribute well

15 Conclusions We address the problem of finding information from databases and text collections in an integrative way We introduced the entity retrieval problem and proposed several methods to exploit structured information to improve ER on text With some preliminary experiment results, we show that exploiting relevant structured information can improve ER performance on text

16 Many Further Research Questions What is an appropriate query language for ER? What is an appropriate formal retrieval framework for ER? What are the best strategies and methods for ER? …

Thank You!