Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Organisation Of Data (1) Database Theory
Chapter 5: Introduction to Information Retrieval
Optimizing search engines using clickthrough data
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Developing and Evaluating a Query Recommendation Feature to Assist Users with Online Information Seeking & Retrieval With graduate students: Karl Gyllstrom,
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Information Retrieval in Practice
Search Engines and Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
An Introduction to Content Management. By the end of the session you will be able to... Explain what a content management system is Apply the principles.
Search Engines and Information Retrieval Chapter 1.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Let VRS Work for You! ELUNA Conference 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
The Key to Successful Searching Software patents pending. ™ Trademarks of SLICCWARE Corporation All rights reserved. SM Service Mark of SLICCWARE Corporation.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Views Lesson 7.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Information Retrieval
Session 1 Module 1: Introduction to Data Integrity
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Based Information Retrieval
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Multimedia Information Retrieval
Search Techniques and Advanced tools for Researchers
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Information Retrieval
CSE 635 Multimedia Information Retrieval
Panagiotis G. Ipeirotis Luis Gravano
Mining Anchor Text for Query Refinement
Presentation transcript:

Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference

Document Clustering  Automatically creates clusters of similar documents  General benefit: provides an overview of the range of topics in a set  Multiple specific uses – Familiarization with database before searching – Familiarization with a result set after searching – Assistance in category definition for other uses - Category tree construction - FAQ construction

Dataware’s Clustering Toolkit  One API function  Source of documents is a BRS result set – which could be backref 0 for entire database – Can specify certain fields for analysis  Output indicates member documents for each cluster  Application can specify number and max/min size of clusters, etc.  US PTO (Patent and Trademark Office) plans to do category tree construction

How It Works  Extracts keywords from each document – using our keyword-generation library - which is also in 6.3 keyword generation load filter  Repeats these steps: – Compare document and cluster pairs using the keyword lists - How many keywords do two lists share, and how similar are their weights? – Combine the most similar pair into one cluster  Stops when n clusters remain (n is configurable)

How It Works  Output is a list of clusters, including: – a cluster quality score - Measures how cohesive the cluster is – a ranked list of keywords describing the cluster – a ranked list of member documents - Highest-ranked docs are the most “central”

Speed Tricks  Speed is a big issue in clustering – especially for interactive searching – Keyword extraction takes time – Pairwise comparisons don’t scale up well at all – Thus, we use a couple of speed tricks - One trick for database design - One trick inside the clustering function  Trick 1: Pre-generate keywords – Use the BRS 6.3 keyword generation load filter – The filter produces a keyword paragraph that looks like this...

Speed Tricks..Keywords: compartment (187.80). mass (156.56). methylhistidine (118.12)....  At clustering time, we don’t need to do keyword analysis – Just retrieve keyword lists from engine – Cuts execution time in half

Speed Tricks  Trick 2: Cluster a sample of the set (Cutting et al) – Create the desired number of clusters from a small sample – Then compare the remaining documents only to those few clusters, not to all other documents – Saves a huge amount of execution time  Another trick for result-set clustering: – Cluster only the top-ranked 100 to 1000 docs  A final speed note: CPU speed helps a lot – Clustering is very processor-intensive - 2x CPU speed gives almost 2x clustering speed

Query-By-Example (QBE)  Allows an example passage or document to serve as a query  Useful when we already have some text or a document about our topic – “Find more like this” – No query formulation required – QBE analyzes the text, then constructs and executes a query

Dataware’s QBE Toolkit  One API function  Source of example text can be: – a text buffer - e.g. text selected with mouse – a BRS document (or documents) from a result set - e.g. selected from a title list - Can specify certain fields for analysis – a word list with weights or occurrence counts  Output is a standard ranked document list

How It Works  Extracts keywords from the example text – using... all together now... our keyword-generation library, yet again  Keyword selection process likes words that: – occur frequently in the example text – are rare in the database as a whole  Getting database statistics can be done: – using field qualification - most accurate but slow – using no qualification - still good, much faster – not at all -- just use occurrence counts in example text -- fastest, but trickier

How It Works  Performs a ranked search using the keywords and their weights  Flexible fielding: – Analysis of example document(s) can use one set of BRS paragraphs – Search can use a different set  Speed trick: – Generate keyword field for database (load filter) – Field-level index it – Use it for QBE searches

That’s all, folks!