December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.

Slides:



Advertisements
Similar presentations
XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Advertisements

Multi-user and internet mapping. Multi-user environments Simple file server solution, LAN (Novel, Windows network) View from everywhere, edit from one.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
XML DOCUMENTS AND DATABASES
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
MCNC/CNIDR & A/WWW Enterprises Introduction to CNIDR’s Isite Jim Fullton - MCNC/CNIDR Archie Warnock - A/WWW Enterprises.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Retrieval in Practice
LYU0101 Wireless Digital Information System Lam Yee Gordon Yeung Kam Wah Supervisor Prof. Michael Lyu Second semester FYP Presentation 2001~2002.
August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
LYU0101 Wireless Digital Information System Lam Yee Gordon Yeung Kam Wah Supervisor Prof. Michael Lyu Second semester FYP Presentation 2001~2002.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
WWW and Internet The Internet Creation of the Web Languages for document description Active web pages.
December 16, 2003 INEX Ray R. Larson Cheshire II at INEX 2003: Component and Algorithm Fusion Ray R. Larson School of Information Management and.
How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting UC Berkeley SIMS class.
Overview of Search Engines
Tutorial 6 Forms Section A - Working with Forms in JavaScript.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
Project Overview Bibliographic merging, Endeca, and Web application.
1 Accelerated Web Development Course JavaScript and Client side programming Day 2 Rich Roth On The Net
Web Indexing and Searching By Florin Zidaru. Outline Web Indexing and Searching Overview Swish-e: overview and features Swish-e: set-up Swish-e: demo.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Hotbot A Search Engine Case Study. Introduction  Owned by Terra/Lycos.  One of the largest web search engines.  Uses the Inktomi database combined.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
The Development of a search engine & Comparison according to algorithms Sungsoo Kim Haebeom Lee The mid-term progress report.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
A/WWW Enterprises 28 Sept 1995 AstroBrowse: Survey of Current Technology A. Warnock A/WWW Enterprises
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
The Future of Isite - Growing GILS Archie Warnock A/WWW Enterprises
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic library materials.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Search Engine Know- How: How To Optimize Your Content, Navigation Pages, & Documents For Search Engines.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Don’t Duck Metadata March 2005 Introducing Setting Up a Clearinghouse Node Topic: Introduction to Setting Up a Clearinghouse Node Objective: By.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Information Retrieval in Practice
Search Engine Architecture
Building Search Systems for Digital Library Collections
SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching
Manuscript Transcription Assistant Initiative
Presentation transcript:

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R. Larson School of Information Management and Systems University of California, Berkeley

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Overview of Cheshire II It supports SGML and XMLIt supports SGML and XML It is a client/server applicationIt is a client/server application Uses the Z39.50 Information Retrieval ProtocolUses the Z39.50 Information Retrieval Protocol Server supports a Relational Database GatewayServer supports a Relational Database Gateway Supports Boolean searching of all serversSupports Boolean searching of all servers Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity searchSupports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search Search engine supports ``nearest neighbor'' searches and relevance feedbackSearch engine supports ``nearest neighbor'' searches and relevance feedback GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshireWWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire Scriptable clients using Tcl and (new) PythonScriptable clients using Tcl and (new) Python

December 9, 2002 Cheshire II at INEX -- Ray R. Larson SGML/XML Support Underlying native format for all data is SGML or XMLUnderlying native format for all data is SGML or XML The DTD defines the database contentsThe DTD defines the database contents Full SGML/XML parsingFull SGML/XML parsing SGML/XML Format Configuration Files define the database location and indexesSGML/XML Format Configuration Files define the database location and indexes Various format conversions and utilities available for Z39.50 support (MARC, GRS-1Various format conversions and utilities available for Z39.50 support (MARC, GRS-1

December 9, 2002 Cheshire II at INEX -- Ray R. Larson SGML/XML Support Configuration files for the Server are SGML/XML:Configuration files for the Server are SGML/XML: –They include elements describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

December 9, 2002 Cheshire II at INEX -- Ray R. Larson IndexingIndexing Any SGML/XML tagged field or attribute can be indexed:Any SGML/XML tagged field or attribute can be indexed: –B-Tree and Hash access via Berkeley DB (Sleepycat) –Stemming, keyword, exact keys and “special keys” –Mapping from any Z39.50 Attribute combination to a specific index –Underlying postings information includes term frequency for probabilistic searching Component extraction with separate component indexesComponent extraction with separate component indexes

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Boolean Search Capability All Boolean operations are supportedAll Boolean operations are supported –“zfind author x and (title y or subject z) not subject A” Named sets are supported and stored on the serverNamed sets are supported and stored on the server Boolean operations between stored sets are supportedBoolean operations between stored sets are supported –“zfind SET1 and subject widgets or SET2” Nested parentheses and truncation are supportedNested parentheses and truncation are supported –“zfind xtitle Alice#”

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Probabilistic Retrieval Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval timeUses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval time Z39.50 “relevance” operator used to indicate probabilistic searchZ39.50 “relevance” operator used to indicate probabilistic search Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed: –zfind “cheshire cats, looking glasses, march hares and other such things” –zfind caucus races Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined: –zfind government documents and title guidebooks

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide Probabilistic Retrieval: Logistic Regression

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Combining Boolean and Probabilistic Search Elements Two approaches:Two approaches: –Boolean Approach –Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

December 9, 2002 Cheshire II at INEX -- Ray R. Larson INEX Overview Local Net UI Or Scripts Map Query Map Results Map Query Map Results INEX Search Engine

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Fusion Search Query Results Sort/ Merge Final Ranked List

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Distributed Metadata Servers Replicated servers Meta-Topical Servers General Servers Database Servers

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Further Information Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ –Includes HTML documentation Project Web Site Web Site Archives Hub Hub