An Efficient Algorithm for Incremental Update of Concept space

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Multimedia Database Systems
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Partial Sums An Addition Algorithm.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Queensland University of Technology
Information Organization: Overview
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Updating SF-Tree Speaker: Ho Wai Shing.
Information Retrieval in Practice
Lecture 12: Relevance Feedback & Query Expansion - II
An Automatic Construction of Arabic Similarity Thesaurus
Information Retrieval and Web Search
Multimedia Information Retrieval
Toshiyuki Shimizu (Kyoto University)
Basic Information Retrieval
Presented by: Prof. Ali Jaoua
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Chapter 5: Information Retrieval and Web Search
Automatic Global Analysis
Information Organization: Overview
Information Retrieval and Web Design
Introduction to Search Engines
Connecting the Dots Between News Article
Presentation transcript:

An Efficient Algorithm for Incremental Update of Concept space Presented by Felix Cheung

Overview Background Introduction to Concept Space The Problem of Concept Space The Idea of the Solution Performance Evaluation Conclusion

Background Vocabulary Problem The failure is caused by variety of terms Such as HIV vs. AIDS Two people choose the same words with less 20% One of solutions: thesauri

Thesauri A thesaurus is a book of words that are put in groups together according to connections between their meaning To solve vocabulary problem If a search retrieves too few documents, a user can expand his query The problem of thesauri Manual construction is very complex

Introduction to Concept Space It is an automatic approach to thesaurus construction Given terms j & k, a concept space has associations Wjk and Wkj Wjk and Wkj are asymmetric An association is a value between 0 and 1

Concept Space Construction The construction of concept space consists of two phases An automatic indexing phase A document collection is processed to build inverted lists

Inverted Lists doc. id tf a b

Concept Space Construction The construction of concept space consists of two phases A co-occurrence analysis phase The associations of every term pair are computed based on the following equation

The sum of TFIDF scores To compute the sum of all TFIDF scores of keyword j in all the documents: where term frequency of j in doc i number of docs with j number of docs in db

Weighting Factor The Weighting Factor is used to penalize the general terms

The sum of co-occurrence TFIDF scores To find the sum of all co-occurrence TFIDF scores of keywords j and k in all the documents where number of docs with both j and k min(tfij, tfik) number of docs in db

A Complete Concept Space A complete concept space is gigantic Each term may have a few thousand related terms => overwhelm searchers Only highly related terms are suggested

Highly related terms There are 1,708,551 co-occurrence pairs The max no. of related terms = 100 If no. of related terms > 100, only 100 terms with highest association values retained (strong associations) Only highly-ranked association is contained – called partial concept space

The Problem of Concept Space In a dynamic environment, the collection changes with time => concept space update The simplest approach => reconstruct from scratch Disadvantage: time consuming To study incremental update problem of partial concept spaces

The Definition A set of document (D) A new document collection (D’) add A document collection (D) A updated concept space(CSD’) A constructed concept space (CSD) update Only n strong associations kept

The Idea of pruning algorithm Avoid scanning inverted lists directly Calculate an easy-computed upper bound of W’jk Compare with a threshold j The property of j If  j, W’jk must not be a strong association

The upper bound

How to determine j Compute n associations W’jki‘s for which Wjki is strong w.r.t the document D (n  i  1) Set j = min(W’jki) Given p, if j > , W’jp< all n W’jki’s

Pruning Algorithm Compute the association W’jki w.r.t D’ if Wjki is strong w.r.t. D for each term j Determine j among n such associations of term j Compute the upper bound of W’jp if Wjp is weak w.r.t. D Compute W’jp if  j Only keep the n largest associations of j

Quantization is in term of The amount of storage is very big High precision is not needed Some quantization techniques can be applied to reduce the storage requirment

Performance Evaluation “The Ohsumed Test Collection” is used 348,566 abstracts with 240247 terms 169 MB large (after stop-word removal and stemming) The algorithm is run on a 700 MHz Pentium III Xeon machine

Experiment I Half of documents are picked as the original collection D The other half of documents are partitioned into 10 equal parts These parts are added to D successively and cumulatively

Experiment I Result (I)

Experiment I Result (II)

Experiment I Result (III)

Experiment I Result (IV)

Experiment I Result (V)

Experiment II Another factors affects the performance- the size of added documents The size of added documents changes from 17,400 to 174,000

Experiment II Result

Storage requirement

Conclusion Concept space approach is a very useful tool for information retrieval The construction and incremental update are very time consuming In many application, only a partial concept is needed To reduce the storage requirement, some quantization methods are proposed

Conclusion (Con’t) The pruning algorithms are effective in avoiding computing weak associations 9-time speedup can be achieved