December 16, 2003 INEX 2003 -- Ray R. Larson Cheshire II at INEX 2003: Component and Algorithm Fusion Ray R. Larson School of Information Management and.

Slides:



Advertisements
Similar presentations
XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Advertisements

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Information Retrieval Models: Probabilistic Models
MCNC/CNIDR & A/WWW Enterprises Introduction to CNIDR’s Isite Jim Fullton - MCNC/CNIDR Archie Warnock - A/WWW Enterprises.
October 3, 2003 CDL -- Cheshire II & III -- Ray R. Larson Cheshire II: Recent Additions & Cheshire III: Design and System Overview Ray R. Larson School.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Retrieval in Practice
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.
ASP Tutorial. What is ASP? ASP (Active Server Pages) is a Microsoft technology that enables you to make dynamic and interactive web pages. –ASP usually.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Tutorial 6 Forms Section A - Working with Forms in JavaScript.
A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock
Basics of Web Databases With the advent of Web database technology, Web pages are no longer static, but dynamic with connection to a back-end database.
XP New Perspectives on XML Tutorial 6 1 TUTORIAL 6 XSLT Tutorial – Carey ISBN
WORKING WITH XSLT AND XPATH
GDT V5 Web Services. GDT V5 Web Services Doug Evans and Detlef Lexut GDT 2008 International User Conference August 10 – 13  Lake Las Vegas, Nevada GDT.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Querying Structured Text in an XML Database By Xuemei Luo.
The Internet 8th Edition Tutorial 4 Searching the Web.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
SRW/U: Re-Introduction SRW is a Web Services based Information Retrieval Protocol Motivations: Create an easy to implement protocol with the power of Z39.50.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
ASP. ASP is a powerful tool for making dynamic and interactive Web pages An ASP file can contain text, HTML tags and scripts. Scripts in an ASP file are.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Search Engine Architecture
Text Based Information Retrieval
Martin Rajman, Martin Vesely
Toshiyuki Shimizu (Kyoto University)
Presentation transcript:

December 16, 2003 INEX Ray R. Larson Cheshire II at INEX 2003: Component and Algorithm Fusion Ray R. Larson School of Information Management and Systems University of California, Berkeley

December 16, 2003 INEX Ray R. Larson OverviewOverview Cheshire II feature overviewCheshire II feature overview –Logistic Regression Ranking and Boolean Operations Additions from INEX ‘02Additions from INEX ‘02 –XML Schemas and Element Retrieval –CORI, Okapi BM-25 ranking algorithms –Result Set sorting, merging and ranking operators Evaluation ResultsEvaluation Results

December 16, 2003 INEX Ray R. Larson Overview of Cheshire II It supports SGML and XML with components and component indexesIt supports SGML and XML with components and component indexes It is a client/server applicationIt is a client/server application Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implementedUses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implemented Server supports a Relational Database GatewayServer supports a Relational Database Gateway Supports Boolean searching of all serversSupports Boolean searching of all servers Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity searchSupports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search Search engine supports ``nearest neighbor'' searches and relevance feedbackSearch engine supports ``nearest neighbor'' searches and relevance feedback GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshireWWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire Scriptable clients using Tcl and (new) PythonScriptable clients using Tcl and (new) Python Store SGML/XML as files or “Datastore” databaseStore SGML/XML as files or “Datastore” database

December 16, 2003 INEX Ray R. Larson XML Element Extraction A new search “ElementSetName” is XML_ELEMENT_A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present requestAny Xpath, element name, or regular expression can be included following the final underscore when submitting a present request The matching elements are extracted from the records matching the search and delivered in a simple format..The matching elements are extracted from the records matching the search and delivered in a simple format..

December 16, 2003 INEX Ray R. Larson XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML }} { Singularitâes áa Cargáese … etc…

December 16, 2003 INEX Ray R. Larson Boolean Search Capability All Boolean operations are supportedAll Boolean operations are supported –“zfind author x and (title y or subject z) not subject A” Named sets are supported and stored on the serverNamed sets are supported and stored on the server Boolean operations between stored sets are supportedBoolean operations between stored sets are supported –“zfind SET1 and subject widgets or SET2” Nested parentheses and truncation are supportedNested parentheses and truncation are supported –“zfind xtitle Alice#”

December 16, 2003 INEX Ray R. Larson Probabilistic Retrieval Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval timeUses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval time Z39.50 “relevance” operator used to indicate probabilistic searchZ39.50 “relevance” operator used to indicate probabilistic search Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed: –zfind “cheshire cats, looking glasses, march hares and other such things” –zfind caucus races Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined: –zfind government documents and title guidebooks

December 16, 2003 INEX Ray R. Larson Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide Note that we did NOT retrain the coefficients this year Probabilistic Retrieval: Logistic Regression

December 16, 2003 INEX Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Inverse Component Frequency Number of Terms in common between query and Component -- logged

December 16, 2003 INEX Ray R. Larson Combining Boolean and Probabilistic Search Elements Two original approaches:Two original approaches: –Boolean Approach –Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries

December 16, 2003 INEX Ray R. Larson Ranking Methods added since INEX ‘02 CORI -- From Jamie Callan: Simple implementation of a weighting scheme for distributed search. Very effective for distributed search collection selection. Not used for official INEX runs.CORI -- From Jamie Callan: Simple implementation of a weighting scheme for distributed search. Very effective for distributed search collection selection. Not used for official INEX runs. OKAPI BM From Steve Robertson. This is now seems to be the “default” retrieval algorithm in experimental IROKAPI BM From Steve Robertson. This is now seems to be the “default” retrieval algorithm in experimental IR New operators (later) let us mix and match ranking methods and Boolean operationsNew operators (later) let us mix and match ranking methods and Boolean operations

December 16, 2003 INEX Ray R. Larson Okapi BM25 Where:Where: Q is a query containing terms TQ is a query containing terms T K is k 1 ((1-b) + b.dl/avdl)K is k 1 ((1-b) + b.dl/avdl) k 1, b and k 3 are parameters, usually 1.2, 0.75 and k 1, b and k 3 are parameters, usually 1.2, 0.75 and tf is the frequency of the term in a specific documenttf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derivedqtf is the frequency of the term in a topic from which Q was derived dl and avdl are the document length and the average document length measured in some convenient unitdl and avdl are the document length and the average document length measured in some convenient unit w (1) is the Robertson-Sparck Jones weight.w (1) is the Robertson-Sparck Jones weight.

December 16, 2003 INEX Ray R. Larson INEX ‘02 Fusion Search Query Results Sort/ Merge Final Ranked List Merge multiple resultsets and sort new setMerge multiple resultsets and sort new set –Sort by index name/key (ATTRIBUTE) –Sort by rank (ELEMENTS) Merges ranked results and Boolean resultsMerges ranked results and Boolean results –Sort by XML/SGML Tag contents (TAG)

December 16, 2003 INEX Ray R. Larson Merging and Ranking Operators Extends the capabilities of merging to include merger operations in queries like Boolean operatorsExtends the capabilities of merging to include merger operations in queries like Boolean operators Fuzzy Logic Operators (not used for INEX)Fuzzy Logic Operators (not used for INEX) –!FUZZY_AND –!FUZZY_OR –!FUZZY_NOT Containment operators: Restrict components to or with a particular parentContainment operators: Restrict components to or with a particular parent –!RESTRICT_FROM –!RESTRICT_TO Merge OperatorsMerge Operators –!MERGE_SUM –!MERGE_MEAN –!MERGE_NORM

December 16, 2003 INEX Ray R. Larson Query Generation - CO # 91 TITLE = Internet traffic# 91 TITLE = Internet traffic {Internet traffic internet, web, traffic, measurement, congestion}) !MERGE_NORM {Internet traffic}) !MERGE_NORM {Internet traffic}) !MERGE_NORM {Internet traffic internet, web, traffic, measurement, congestion}) !MERGE_NORM {Internet traffic}) !MERGE_NORM {Internet {Internet traffic internet, web, traffic, measurement, congestion}) !MERGE_NORM {Internet traffic}) !MERGE_NORM {Internet traffic}) !MERGE_NORM {Internet traffic internet, web, traffic, measurement, congestion}) !MERGE_NORM {Internet traffic}) !MERGE_NORM {Internet traffic}) TARGETPATH = XML_ELEMENT_articleTARGETPATH = XML_ELEMENT_article

December 16, 2003 INEX Ray R. Larson INEX CO Runs

December 16, 2003 INEX Ray R. Larson Query Generation - SCAS #66 TITLE = /article[./fm//yr < '2000’] //sec[about(.,'"search engines"')]#66 TITLE = /article[./fm//yr < '2000’] //sec[about(.,'"search engines"')] ((date < '2000')) !RESTRICT_FROM {"search engines"} !MERGE_MEAN (sec_words {$search engines$})))((date < '2000')) !RESTRICT_FROM {"search engines"} !MERGE_MEAN (sec_words {$search engines$}))) TARGETPATH = XML_ELEMENT_secTARGETPATH = XML_ELEMENT_sec

December 16, 2003 INEX Ray R. Larson Query Generation -- SCAS This run uses Logistic regression matching combined with Boolean phrase matching and MERGE_MEAN partial result combinations FUZZY_AND and FUZZY_OR operators were used in combining AND and OR elements within an "about" predicate. Containment operators were used to constrain component searches within ancestor elements, E.g.: This run uses Logistic regression matching combined with Boolean phrase matching and MERGE_MEAN partial result combinations FUZZY_AND and FUZZY_OR operators were used in combining AND and OR elements within an "about" predicate. Containment operators were used to constrain component searches within ancestor elements, E.g.:

December 16, 2003 INEX Ray R. Larson INEX SCAS Runs

December 16, 2003 INEX Ray R. Larson Future Plans Bug fixes -- incorrect query generation for some SCAS queries, for example…Bug fixes -- incorrect query generation for some SCAS queries, for example… –TITLE = //article[about(.,'security +biometrics') AND about(.//sec,'"facial recognition"')] –Submitted : {security biometrics} !MERGE_MEAN {biometrics biometrics biometrics biometrics}) ) !FUZZY_AND {"facial recognition"} !MERGE_MEAN (sec_title {$facial recognition$})) –Should have included sec_words and Boolean subquery for biometrics merged with ranked subquery

December 16, 2003 INEX Ray R. Larson Future Plans Add Language Model ranking for componentsAdd Language Model ranking for components Retrain Logistic Regression coefficients on INEX assessment data -- and experiment with including new variables, such as relative component sizeRetrain Logistic Regression coefficients on INEX assessment data -- and experiment with including new variables, such as relative component size Find bugs in Okapi BM-25Find bugs in Okapi BM-25 Find more bugs ahead of time, and be more consistent in runs!Find more bugs ahead of time, and be more consistent in runs!