Presentation is loading. Please wait.

Presentation is loading. Please wait.

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.

Similar presentations


Presentation on theme: "December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R."— Presentation transcript:

1 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R. Larson School of Information Management and Systems University of California, Berkeley

2 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Overview of Cheshire II It supports SGML and XMLIt supports SGML and XML It is a client/server applicationIt is a client/server application Uses the Z39.50 Information Retrieval ProtocolUses the Z39.50 Information Retrieval Protocol Server supports a Relational Database GatewayServer supports a Relational Database Gateway Supports Boolean searching of all serversSupports Boolean searching of all servers Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity searchSupports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search Search engine supports ``nearest neighbor'' searches and relevance feedbackSearch engine supports ``nearest neighbor'' searches and relevance feedback GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshireWWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire Scriptable clients using Tcl and (new) PythonScriptable clients using Tcl and (new) Python

3 December 9, 2002 Cheshire II at INEX -- Ray R. Larson SGML/XML Support Underlying native format for all data is SGML or XMLUnderlying native format for all data is SGML or XML The DTD defines the database contentsThe DTD defines the database contents Full SGML/XML parsingFull SGML/XML parsing SGML/XML Format Configuration Files define the database location and indexesSGML/XML Format Configuration Files define the database location and indexes Various format conversions and utilities available for Z39.50 support (MARC, GRS-1Various format conversions and utilities available for Z39.50 support (MARC, GRS-1

4 December 9, 2002 Cheshire II at INEX -- Ray R. Larson SGML/XML Support Configuration files for the Server are SGML/XML:Configuration files for the Server are SGML/XML: –They include elements describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

5 December 9, 2002 Cheshire II at INEX -- Ray R. Larson IndexingIndexing Any SGML/XML tagged field or attribute can be indexed:Any SGML/XML tagged field or attribute can be indexed: –B-Tree and Hash access via Berkeley DB (Sleepycat) –Stemming, keyword, exact keys and “special keys” –Mapping from any Z39.50 Attribute combination to a specific index –Underlying postings information includes term frequency for probabilistic searching Component extraction with separate component indexesComponent extraction with separate component indexes

6 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Boolean Search Capability All Boolean operations are supportedAll Boolean operations are supported –“zfind author x and (title y or subject z) not subject A” Named sets are supported and stored on the serverNamed sets are supported and stored on the server Boolean operations between stored sets are supportedBoolean operations between stored sets are supported –“zfind SET1 and subject widgets or SET2” Nested parentheses and truncation are supportedNested parentheses and truncation are supported –“zfind xtitle Alice#”

7 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Probabilistic Retrieval Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval timeUses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval time Z39.50 “relevance” operator used to indicate probabilistic searchZ39.50 “relevance” operator used to indicate probabilistic search Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed: –zfind topic @ “cheshire cats, looking glasses, march hares and other such things” –zfind title @ caucus races Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined: –zfind topic @ government documents and title guidebooks

8 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide Probabilistic Retrieval: Logistic Regression

9 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

10 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Combining Boolean and Probabilistic Search Elements Two approaches:Two approaches: –Boolean Approach –Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries

11 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

12 December 9, 2002 Cheshire II at INEX -- Ray R. Larson INEX Overview Local Net UI Or Scripts Map Query Map Results Map Query Map Results INEX Search Engine

13 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Fusion Search Query Results Sort/ Merge Final Ranked List

14 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Distributed Metadata Servers Replicated servers Meta-Topical Servers General Servers Database Servers

15 December 9, 2002 Cheshire II at INEX -- Ray R. Larson Further Information Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ –Includes HTML documentation Project Web Site http://cheshire.berkeley.edu/Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/ Archives Hub http://www.archiveshub.ac.uk/Archives Hub http://www.archiveshub.ac.uk/


Download ppt "December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R."

Similar presentations


Ads by Google