Cheshire II: Features and Internals and Cheshire III overview

Cheshire II: Features and Internals and Cheshire III overview
Ray R. Larson School of Information Management and Systems University of California, Berkeley March 2, 2004 Ray R. Larson

Overview Cheshire II feature overview Additions from INEX ‘03
Logistic Regression Ranking, Okapi BM-25 and Boolean Operations Fusion Operators Additions from INEX ‘03 Element/Index level re-estimation of LR coefficients Adhoc and Heterogeneous Track Methodology Evaluation Results -Adhoc March 2, 2004 Ray R. Larson

Overview of Cheshire II
It supports SGML and XML with components and component indexes It is a client/server application Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implemented Server supports a Relational Database Gateway Supports Boolean searching of all servers Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search Search engine supports ``nearest neighbor'' searches and relevance feedback GUI interface on X window displays and Windows NT WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire Scriptable clients using Tcl and Python Store SGML/XML as files or “Datastore” database March 2, 2004 Ray R. Larson

Cheshire II Searching Z39.50 Internet Images Scanned Text Local Remote
March 2, 2004 Ray R. Larson

INEX Overview Local Net INEX Search Engine Map Query Map Query UI Or
Scripts Map Query Results Local Net Map Results March 2, 2004 Ray R. Larson

Boolean Search Capability
All Boolean operations are supported “zfind author x and (title y or subject z) not subject A” Named sets are supported and stored on the server Boolean operations between stored sets are supported “zfind SET1 and subject widgets or SET2” Nested parentheses and truncation are supported “zfind xtitle Alice#” March 2, 2004 Ray R. Larson

Probabilistic Retrieval
Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval time Z39.50 “relevance” operator used to indicate probabilistic search Any index can have Probabilistic searching performed: zfind “cheshire cats, looking glasses, march hares and other such things” zfind caucus races Boolean and Probabilistic elements can be combined: zfind government documents and title guidebooks March 2, 2004 Ray R. Larson

Probabilistic Retrieval: Logistic Regression
Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide March 2, 2004 Ray R. Larson

Probabilistic Retrieval: Logistic Regression attributes
Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Inverse Component Frequency Number of Terms in common between query and Component -- logged March 2, 2004 Ray R. Larson

Combining Boolean and Probabilistic Search Elements
Two original approaches: Boolean Approach Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries March 2, 2004 Ray R. Larson

Okapi BM25 Where: Q is a query containing terms T
K is k1((1-b) + b.dl/avdl) k1, b and k3 are parameters , usually 1.2, 0.75 and tf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derived dl and avdl are the document length and the average document length measured in some convenient unit w(1) is the Robertson-Sparck Jones weight. March 2, 2004 Ray R. Larson

Merging and Ranking Operators
Extends the capabilities of merging to include merger operations in queries like Boolean operators Fuzzy Logic Operators (not used for INEX) !FUZZY_AND !FUZZY_OR !FUZZY_NOT Containment operators: Restrict components to or with a particular parent !RESTRICT_FROM !RESTRICT_TO Merge Operators !MERGE_SUM !MERGE_MEAN !MERGE_NORM !MERGE_CMBZ March 2, 2004 Ray R. Larson

INEX ‘04 Fusion Search Subquery Subquery Fusion/ Merge Final Ranked List Subquery Subquery Comp. Query Results Comp. Query Results Merge multiple ranked and Boolean index searches within each query and multiple component search resultsets Major components merged are Articles, Body, Sections, subsections, paragraphs March 2, 2004 Ray R. Larson

New LR Coefficients Estimates using INEX ‘03 relevance assessments for
Index b0 b1 b2 b3 b4 b5 b6 Base -3.700 1.269 -0.310 0.679 -0.021 0.223 4.010 topic -7.758 5.670 -3.427 1.787 -0.030 1.952 5.880 topicshort -6.364 2.739 -1.443 1.228 -0.020 1.280 3.837 abstract -5.892 2.318 -1.364 0.860 -0.013 1.052 3.600 alltitles -5.243 2.319 -1.361 1.415 -0.037 1.180 3.696 sec words -6.392 2.125 -1.648 1.106 -0.075 1.174 3.632 para words -8.632 1.258 -1.654 1.485 -0.084 1.143 4.004 Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component March 2, 2004 Ray R. Larson

SGML/XML Support Underlying native format for all data is SGML or XML
The DTD defines the file format for each file Full SGML/XML parsing SGML/XML Format Configuration Files define the database USMARC DTD and MARC to SGML conversion (and back again) Access to full-text via special SGML/XML tags March 2, 2004 Ray R. Larson

Indexing Any SGML/XML tagged field or attribute can be indexed:
B-Tree and Hash access via Berkeley DB (Sleepycat) Stemming, keyword, exact keys and “special keys” Mapping from any Z39.50 Attribute combination to a specific index Underlying postings information includes term frequency for probabilistic searching Component extraction with separate component indexes March 2, 2004 Ray R. Larson

XML Element Extraction
A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request The matching elements are extracted from the records matching the search and delivered in a simple format.. March 2, 2004 Ray R. Larson

XML Extraction % zselect sherlock
372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML }} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc… March 2, 2004 Ray R. Larson

SGML/XML Support Configuration files for the Server are SGML/XML:
They include elements describing all of the data files and indexes for the database. They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. March 2, 2004 Ray R. Larson

SGML/XML Support Example XML record for a DL document <ELIB-BIB>
<BIB-VERSION>ELIB-v1.0</BIB-VERSION> <ID>756</ID> <ENTRY>June 12, 1996</ENTRY> <DATE>June 1996</DATE> <TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE> <ORGANIZATION>University of California</ORGANIZATION> <TYPE>report</TYPE> <AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL> <AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL> <PROJECT>SNEP</PROJECT> <SERIES>Vol 3</SERIES> <PAGES>40</PAGES> <TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF> <PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF> </ELIB-BIB> March 2, 2004 Ray R. Larson

SGML Support Example SGML/MARC Record
<USMARC Material="BK" ID=" "><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry> </Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005> </Fld005> <Fld008> nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a> </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm </a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a>theory and practice /<c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a>J. Wiley,<c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a>ill. ;<c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ... March 2, 2004 Ray R. Larson

SGML/XML Support TREC document… <DOC>
<DOCNO>FT </DOCNO> <PROFILE>_AN-DCPCCAA3FT</PROFILE> <DATE>930316 </DATE> <HEADLINE> FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key to unlocking Tangentopoli - They will set the investigation agenda </HEADLINE> <BYLINE> By ROBERT GRAHAM </BYLINE> <TEXT> OVER the weekend the Italian media felt obliged to comment on a non-event. No new arrests had taken place in any of the country's ever more numerous corruption scandals which centre on the illicit funding of political parties ... </TEXT> <XX> … March 2, 2004 Ray R. Larson

<CO>Ente Nazionale Idrocarburi.
… Companies:- </XX> <CO>Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale. </CO> <XX> Countries:- <CN>ITZ Italy, EC. </CN> Industries:- <IN>P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC. </IN> Types:- </XX> … March 2, 2004 Ray R. Larson

<TP>CMMT Comment & Analysis. GOVT Legal issues. </TP>
… <TP>CMMT Comment & Analysis. GOVT Legal issues. </TP> <PUB>The Financial Times </PUB> <PAGE> London Page 4 </PAGE> </DOC> March 2, 2004 Ray R. Larson

SGML/XML Support INEX Document <article>
<fno>C1050</fno> <doi> /C1050s-2000</doi> <fm> <hdr><hdr1><ti>COMPUTING IN SCIENCE & ENGINEERING</ti> <crt><issn> </issn>/00/$10.00 <cci><onm>© 2000 IEEE</onm></cci></crt></hdr1> <hdr2><obi><volno>Vol. 2</volno><issno>No. 1</issno></obi> <pdt><mo>JANUARY/FEBRUARY</mo><yr>2000</yr></pdt> <pp>pp </pp></hdr2> </hdr> <tig><atl>The Decompositional Approach to Matrix Computation</atl> <pn>pp </pn></tig> <au sequence="first"><fnm>G.W.</fnm><snm>Stewart</snm><aff><onm>University of Maryland</onm></aff></au> <fig><art file="c1050x1.gif" w="425" h="321" tw="150" th="113"/></fig> <abs>The introduction of matrix decomposition into numerical linear algebra revolutionized matrix computations. This article outlines the decompositional approach, comments on its history, and surveys the six most widely used decompositions. </abs> </fm> <bdy> <sec><st></st> <ip1>In 1951, Paul S. Dwyer published <it>Linear Computations</it>, perhaps the first book devoted entirely to numerical linear algebra.<ref rid="bibc10501" type="bib">1</ref> Digital computing was in its infancy, and Dwyer focused on computation with mechanical calculators. Nonetheless, the book was state of the art. <ref rid="c10501" type="fig">Figure 1</ref> reproduces a page of the book dealing with Gaussian elimination. In 1954, Alston S. Householder published <it>Principles of Numerical Analysis</it>,<ref rid="bibc10502" type="bib">2</ref> one of the first modern treatments of high-speed digital computation. <ref rid="c10502" type="fig">Figure 2</ref> reproduces a page from this book, also dealing with Gaussian elimination.</ip1> <fig id="c10501"><art file="c10501.gif" w="600" h="970" tw="150" th="243"/><no>1</no><fgc>This page from <it>Linear Computations</it> shows that Paul Dwyer's approach begins with a system of scalar equations. Courtesy of John Wiley & Sons.</fgc></fig> <fig id="c10502"><art file="c10502.gif" w="500" h="807" tw="150" th="242"/><no>2</no><fgc>On this page from <it>Principles of Numerical Analysis</it>, Alston Householder uses partitioned matrices and LU decomposition. Courtesy of McGraw-Hill.</fgc></fig> The contrast between these two excerpts is striking. The most obvious difference is that Dwyer used scalar equations whereas Householder used partitioned matrices. … March 2, 2004 Ray R. Larson

SGML/XML Support …<sec><st>CONCLUSION</st>
<ip1>The big six are not the only decompositions in use; in fact, there are many more. As mentioned earlier, certain intermediate forms—such as tridiagonal and Hessenberg forms—have come to be regarded as decompositions in their own right. Since the singular value decomposition is expensive to compute and not readily updated, rank-revealing alternatives have received considerable attention.<ref rid="bibc105054" type="bib">54</ref><super>,</super><ref rid="bibc105055" type="bib">55</ref> There are also generalizations of the singular value decomposition and the Schur decomposition for pairs of matrices. <ref rid="bibc105056" type="bib">56</ref><super>,</super><ref rid="bibc105057" type="bib">57</ref> All crystal balls become cloudy when they look to the future, but it seems safe to say that as long as new matrix problems arise, new decompositions will be devised to solve them.</ip1> </sec> </bdy> <bm> <ack><h>Acknowledgment</h> <ip1><it>This work was supported by the National Science Foundation under Grant No </it></ip1> </ack> <bib><bibl><h>References</h> <bb id="bibc10501"><au><fnm>P.S.</fnm><snm>Dwyer</snm></au><ti>Linear Computations,</ti> <obi>John Wiley & Sons,</obi><loc><cty>New York,</cty></loc><pdt><yr>1951.</yr></pdt></bb> <bb id="bibc10502"><au><fnm>A.S.</fnm><snm>Householder</snm></au><ti>Principles of Numerical Analysis,</ti> <obi>McGraw-Hill,</obi><loc><cty>New York,</cty></loc><pdt><yr>1953.</yr></pdt></bb> <bb id="bibc10503"><au><fnm>J.H.</fnm><snm>Wilkinson</snm></au><obi>and</obi> <au><fnm>C.</fnm><snm>Reinsch</snm></au><ti>Handbook for Automatic Computation, Vol. II, Linear Algebra,</ti> <obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt><yr>1971.</yr></pdt></bb> <bb id="bibc10504"><au><fnm>B.S.</fnm><snm>Garbow</snm></au> <obi>et al.,</obi><atl>"Matrix Eigensystem Routines—Eispack Guide Extension,"</atl> <ti>Lecture Notes in Computer Science,</ti><obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt> <yr>1977.</yr></pdt></bb> <bb id="bibc10505"><au><fnm>J.J.</fnm><snm>Dongarra</snm></au><obi>et al.,</obi> <ti>LINPACK User's Guide,</ti> <obi>SIAM,</obi><loc><cty>Philadelphia,</cty></loc><pdt><yr>1979.</yr></pdt></bb> … March 2, 2004 Ray R. Larson

SGML/XML Support INEX CAS Query
<?xml version="1.0" encoding="ISO "?> <!DOCTYPE inex_topic SYSTEM "topic.dtd"> <inex_topic topic_id="70" query_type="CAS" ct_no="49"> <title> /article[about(./fm/abs,'"information retrieval" "digital libraries"')]</title> <description>Retrieve articles with an abstract indicating the article is about information retrieval and/or digital libraries</description> <narrative>To be relevant the retrieved articles must be about information retrieval, digital libraries or, preferably both. Articles about information retrieval from digital libraries will receive the highest relevance judgements.</narrative> <keywords>information retrieval,digital libraries</keywords> </inex_topic> March 2, 2004 Ray R. Larson

SGML/XML Support Configuration files for the Server are also SGML/XML:
They include tags describing all of the data files and indexes for the database. They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. March 2, 2004 Ray R. Larson

Cheshire Configuration Files
   <DBCONFIG> <DBENV>/projects/is240/GroupX/indexes </DBENV> <! >   <FILEDEF TYPE=SGML> <DEFAULTPATH>/projects/is240/GroupX </DEFAULTPATH>  <FILETAG> trec </FILETAG>  <FILENAME> /projects/is240/ft </FILENAME> <CONTINCLUDE> /projects/is240/ft.CONT </CONTINCLUDE>  <FILEDTD> /projects/is240/TREC.FT.DTD </FILEDTD>  <ASSOCFIL> ft.assoc </ASSOCFIL>  <HISTORY> cheshire_index/TESTDATA.history </HISTORY> … March 2, 2004 Ray R. Larson

Indexing Any SGML/XML tagged field or attribute can be indexed:
B-Tree and Hash access via Berkeley DB (Sleepycat) Stemming, keyword, exact keys and “special keys” Mapping from any Z39.50 Attribute combination to a specific index Underlying postings information includes term frequency for probabilistic searching. SGML may include address of full-text for indexing New indexes can be easily added, or old ones deleted March 2, 2004 Ray R. Larson

Bitmapped Indexes Bitmap indexes can be used for Boolean operations where the data has only a few values and very large numbers of items with each value Only one bit per record stored in the index Processed on a demand basis so only blocks with the bits needed to resolve a query are fetched March 2, 2004 Ray R. Larson

<INDEXES>    <!-- The following provides document number access > <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE PRIMARYKEY=IGNORE> <INDXNAME> cheshire_index/trec.docno.index </INDXNAME> <INDXTAG> docno </INDXTAG> <INDXMAP> <USE> 12 </USE><struct> 1 </struct> </INDXMAP> <INDXMAP> <USE> 12 </USE><struct> 2 </struct> </INDXMAP> <INDXMAP> <USE> 12 </USE><struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>DOCNO </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> … March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM>
  <!-- The following is the primary index for probabilistic searches > <!-- It includes headlines, datelines, bylines, and full text > <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> cheshire_index/trec.topic.index </INDXNAME> <INDXTAG> topic </INDXTAG> <INDXMAP> <USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> … <STOPLIST> cheshire_index/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>HEADLINE </FTAG> <FTAG>DATELINE </FTAG> <FTAG>BYLINE </FTAG> <FTAG>TEXT </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

Cheshire II – EVI Generation
Entry Vocabulary Indexes can improve access to data with controlled index terms Define basis for clustering records. Select field to form the basis of the cluster. Evidence Fields to use as contents of the pseudo-documents. During indexing cluster keys are generated with basis and evidence from each record. Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. Pseudo-Documents (Class clusters) are indexed on combined evidence fields. March 2, 2004 Ray R. Larson

EVI/Cluster Definitions
<clusname> classcluster </clusname> <cluskey normal=CLASSCLUS> <tagspec> <FTAG>FLD950 </FTAG> <s> â </s> </tagspec> </cluskey> <stoplist> /usr3/cheshire2/data2/clasclusstoplist </stoplist> <clusmap> <from> <tagspec> <ftag>FLD245</ftag><s>^[ab]</s> <ftag>FLD440</ftag><s>â</s> <ftag>FLD490</ftag><s>â</s> <ftag>FLD830</ftag><s>â</s> <ftag>FLD740</ftag><s>â</s> </tagspec></from> <to> <tagspec> <ftag>titles</ftag> </tagspec></to> <ftag>FLD6..</ftag><s>^[abcdxyz]</s> <to> <tagspec> <ftag>subjects</ftag> <summarize> <maxnum> 5 </maxnum> <ftag>subjsum</ftag> </tagspec></summarize> </clusmap> </CLUSTER> March 2, 2004 Ray R. Larson

Component Extraction and Indexing
Any element (or range of SGML/XML data starting with one element and ending with another) can be defined as a ‘component’ and accessed and indexed as if it were an entire document. Component indexes and document-level indexes can be combined in search operations (and special operators permit selection of document or components as the result March 2, 2004 Ray R. Larson

Component Definitions
<COMPONENTS> <COMPONENTDEF> <COMPONENTNAME> TESTDATA/COMPONENT_DB1 </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>mainenty </FTAG> <FTAG>titles </FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPENDTAG> <TAGSPEC><FTAG>Fld300 </FTAG></TAGSPEC> </COMPENDTAG> <COMPONENTINDEXES>  <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> TESTDATA/comp1index1.author … </INDEXDEF> </COMPONENTDEF> </COMPONENTS> March 2, 2004 Ray R. Larson

Result Formatting (Display)
<DISPOPTIONS> KEEP_ENTITIES </DISPOPTIONS> <DISPLAY> <FORMAT NAME="B" OID=" " DEFAULT> <convert function="TAGSET-G"> <clusmap> <from> <tagspec> <ftag>DOCNO</ftag> </tagspec></from> <to> <ftag>28</ftag> </tagspec></to> <ftag>#DOCID#</ftag> <ftag>5</ftag> </clusmap> </convert> </FORMAT> </DISPLAY> March 2, 2004 Ray R. Larson

INEX Configuration Example
    <DBCONFIG> <DBENV>/projects/metadata/cheshire/TREC/cheshire_index </DBENV> <! >  <FILEDEF TYPE=XML> <DEFAULTPATH> /projects/metadata/cheshire/INEX </DEFAULTPATH>  <FILETAG> INEX </FILETAG>  <FILENAME> inex-1.3/xml </FILENAME> <CONTINCLUDE> inex-1.3/xml_main.cont </CONTINCLUDE>  <FILEDTD> inex-1.3/dtd/wrapper.dtd </FILEDTD> <SGMLCAT> inex-1.3/dtd/catalog </SGMLCAT>  <ASSOCFIL> inex-1.3/xml_main.assoc </ASSOCFIL>  <HISTORY> inex.history </HISTORY> March 2, 2004 Ray R. Larson

<INDEXES>   <!-- The following provides document number access > <INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=DO_NOT_NORMALIZE PRIMARYKEY=IGNORE> <INDXNAME> indexes/docno.index </INDXNAME> <INDXTAG> docno </INDXTAG> <INDXMAP> <USE> 12 </USE><struct> 1 </struct> </INDXMAP> <USE> 12 </USE><struct> 2 </struct> </INDXMAP> <USE> 12 </USE><struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG> doi </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/pauthor.index </INDXNAME> <INDXTAG> pauthor </INDXTAG>   <STOPLIST> indexes/authorstoplist </STOPLIST>   <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>au</S><S>snm</S> <FTAG>fm</FTAG><S>au</S><S>fnm</S> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<!-- The following provides keyword title access > <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM> <INDXNAME> indexes/title.index </INDXNAME> <INDXTAG> title </INDXTAG> <INDXMAP> <USE> 4 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 5 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 6 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/titlestoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>tig</S><S>atl</S> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<!-- The following is the primary index for probabilistic searches > <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM> <INDXNAME> indexes/topic.index </INDXNAME> <INDXTAG> topic </INDXTAG> <INDXMAP> <USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> … <USE> 1017 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>tig</S><S>atl</S> <FTAG>abs</FTAG> <FTAG>bdy</FTAG> <FTAG>bibl</FTAG><S>bb</S><S>atl</S> <FTAG>app</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR> <INDXNAME> indexes/date.index </INDXNAME> <INDXTAG> date </INDXTAG>  <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 30 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 30 </USE><POSIT> 3 </posit> <struct> 5 </struct> <INDXKEY> <TAGSPEC> <FTAG>hdr2</FTAG><s>yr</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/journal.index </INDXNAME> <INDXTAG> journal </INDXTAG> <INDXMAP> <USE> 1022 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 1022 </USE><POSIT> 3 </posit> <struct> 5 </struct> <INDXKEY> <TAGSPEC> <FTAG>hdr1</FTAG><s>ti</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/keywords.index </INDXNAME> <INDXTAG> kwd </INDXTAG> <INDXMAP> <USE> 3121 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>kwd</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/abstract.index </INDXNAME> <INDXTAG> abstract </INDXTAG> <INDXMAP> <USE> 62 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>abs</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/author_seq.index </INDXNAME> <INDXTAG> author_seq </INDXTAG> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>au</S><ATTR>sequence</ATTR> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/bib_author_fnm.index </INDXNAME> <INDXTAG> bib_author_fnm </INDXTAG> <INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>bb</FTAG><s>au</s><s>fnm</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/bib_author_snm.index </INDXNAME> <INDXTAG> bib_author_snm </INDXTAG> <INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>bb</FTAG><s>au</s><s>snm</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM> <INDXNAME> indexes/fig.index </INDXNAME> <INDXTAG> fig </INDXTAG> <INDXMAP> <USE> 3150 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fig</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM> <INDXNAME> indexes/ack.index </INDXNAME> <INDXTAG> ack </INDXTAG> <INDXMAP> <USE> 3188 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>ack</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/alltitles.index </INDXNAME> <INDXTAG> alltitles </INDXTAG> <INDXMAP> <USE> 3188 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/titlestoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>atl</FTAG> <FTAG>st</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/affil.index </INDXNAME> <INDXTAG> affil </INDXTAG> <INDXMAP> <USE> 3189 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/titlestoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><s>aff</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=none> <INDXNAME> indexes/fno.index </INDXNAME> <INDXTAG> fno </INDXTAG> <INDXMAP> <USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>fno</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=INTEGER NORMAL=NONE> <INDXNAME> indexes/figno.index </INDXNAME> <INDXTAG> figno </INDXTAG> <INDXMAP> <USE> 3193 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>fig</FTAG><s>no</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/topicshort.index </INDXNAME> <INDXTAG> topicshort </INDXTAG> <INDXMAP> <USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>tig</S><S>atl</S> <FTAG>abs</FTAG> <FTAG>kwd</FTAG> <FTAG>st</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> </INDEXES> March 2, 2004 Ray R. Larson

<COMPONENTS> <COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_SECTION </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>sec</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES>  <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE> <INDXNAME> indexes/sec_title2.index </INDXNAME> <INDXTAG> sec_title </INDXTAG>  <STOPLIST> indexes/titlestoplist </STOPLIST>   <INDXKEY> <FTAG>sec</FTAG><s>st</s> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/sec_words.index </INDXNAME> <INDXTAG> sec_words </INDXTAG>  <STOPLIST> indexes/topicstoplist </STOPLIST>   <INDXKEY> <TAGSPEC> <FTAG>sec</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson

<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_BIB </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>bm</FTAG><S>bib</S><s>bibl</s><s>bb</s> </TAGSPEC> </COMPSTARTTAG>  <COMPONENTINDEXES>  <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/bib_author.index </INDXNAME> <INDXTAG> bib_author </INDXTAG>    <INDXKEY> <FTAG>au</FTAG> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE> <INDXNAME> indexes/bib_title.index </INDXNAME> <INDXTAG> bib_title </INDXTAG>    <INDXKEY> <TAGSPEC> <FTAG>atl</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson

<INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR> <INDXNAME> indexes/bib_date.index </INDXNAME> <INDXTAG> bib_date </INDXTAG>    <INDXKEY> <TAGSPEC> <FTAG>pdt</FTAG><s>yr</s> </TAGSPEC> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson

<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>îlrj$|îp1$|îp2$|îp3$|îp4$|îp5$|îtem-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES>  <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/para_words.index </INDXNAME> <INDXTAG> para_words </INDXTAG>  <STOPLIST> indexes/topicstoplist </STOPLIST>   <INDXKEY> <FTAG>.*</FTAG> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson

<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_FIG </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>fig</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES>  <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/fig_caption.index </INDXNAME> <INDXTAG> fig_caption </INDXTAG>  <STOPLIST> indexes/titlestoplist </STOPLIST>   <INDXKEY> <FTAG>fgc</FTAG> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson

<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_VITAE </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>vt</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES>  <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE> <INDXNAME> indexes/vitae_words.index </INDXNAME> <INDXTAG> vt_vitae </INDXTAG>  <STOPLIST> indexes/titlestoplist </STOPLIST>   <INDXKEY> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> </COMPONENTS> March 2, 2004 Ray R. Larson

<DISPOPTIONS> KEEP_ENTITIES </DISPOPTIONS> <DISPLAY> <DISPLAYDEF NAME="B" OID=" " DEFAULT> <convert function="MIXED"> <clusmap> <from> <tagspec> <ftag>doi</ftag> </tagspec></from> <to> <ftag>28</ftag> </tagspec></to> <ftag>#DOCID#</ftag> <ftag>5</ftag> <ftag>#DBNAME#</ftag> </tagspec></from>… March 2, 2004 Ray R. Larson

<DISPLAYDEF name="XML_ELEMENT_" OID=" "> <convert function="XML_ELEMENT"> <clusmap> <from> <tagspec> <ftag>#FILENAME#</ftag> </tagspec></from> <to> <ftag>FILENAME</ftag> </tagspec></to> <ftag>#RANK#</ftag> <ftag>RANK </ftag> … …<from> <tagspec> <ftag>#RAWSCORE#</ftag> </tagspec></from> <to> <ftag>RAWSCORE </ftag> </tagspec></to> <from> <ftag> SUBST_ELEMENT </ftag> </tagspec> </to> </clusmap> </convert> </DISPLAYDEF> </DISPLAY> </FILEDEF> </DBCONFIG> March 2, 2004 Ray R. Larson

<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>îlrj$|îp1$|îp2$|îp3$|îp4$|îp5$|îtem-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES>  <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/para_words.index </INDXNAME> <INDXTAG> para_words </INDXTAG>  <STOPLIST> indexes/topicstoplist </STOPLIST>   <INDXKEY> <FTAG>.*</FTAG> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson

XML Schemas and Element Retrieval

XML Schema Support XML Schemas or DTD’s can be used to define the data contents Tested with a wide variety of schemas including METS (with various supporting schemas) March 2, 2004 Ray R. Larson

XML Element Extraction
A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request (Note only a subset of full Xpath is available) The matching elements are extracted from the records matching the search and delivered in a simple format.. March 2, 2004 Ray R. Larson

XML Extraction % zselect sherlock
372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML }} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc… March 2, 2004 Ray R. Larson

Database Storage All data stored as SGML/XML flat text files plus optional linked full-text (non-XML) files File format is defined though SGML/XML DTD (also flat text file) or Schema “Associator” files provide indexed direct access to each record in SGML/XML files. Contain offset and record length for each “record” Associators can be built to index any conformant document in a directory sub-tree March 2, 2004 Ray R. Larson

INEX CO Runs Three official, one later run - all Title-only
Fusion - Combines Okapi and LR using the MERGE_CMBZ operator NewParms (LR)- Using only LR with the new parameters Feedback - An attempt at blind relevance feedback PostFusion - Fusion of the new LR coefficients and Okapi March 2, 2004 Ray R. Larson

Query Generation - CO # 162 TITLE = Text and Index Compression Algorithms QUERY: {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) @+ is is LR !MERGE_CMBZ is a normalized score summation and enhancement March 2, 2004 Ray R. Larson

INEX CO Runs Strict Generalized Avg Prec FUSION = 0.0642
NEWPARMS = FDBK = POSTFUS = Avg Prec FUSION = NEWPARMS = FDBK = POSTFUS = March 2, 2004 Ray R. Larson

INEX VCAS Runs Two official runs
FUSVCAS - Element fusion using LR and various operators for path restriction NEWVCAS - Using the new LR coefficients for each appropriate index and various operators for path restriction March 2, 2004 Ray R. Larson

Query Generation - VCAS
#66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)] Submitted query = {intelligent transport systems})) !RESTRICT_FROM {on-board route planning navigation system for automobiles})) Target elements: sec|ss1|ss2|ss3 March 2, 2004 Ray R. Larson

VCAS Results Generalized Strict Avg Prec FUSVCAS = 0.0321
NEWVCAS = Avg Prec FUSVCAS = NEWVCAS = March 2, 2004 Ray R. Larson

Heterogeneous Track Approach using the Cheshire’s Virtual Database options Primarily a version of distributed IR Each collection indexed separately Search via Z39.50 distributed queries Z39.50 Attribute mapping used to map query indexes to appropriate elements in a given collection Only LR used and collection results merged using probability of relevance for each collection result March 2, 2004 Ray R. Larson

Heterogeneous Track Issues
Very large “Documents” Our approach was to segment Reporting Xpath after segmenting large documents March 2, 2004 Ray R. Larson

Database Storage Config File Remote RDBMS Index File Page Data File
Postings File History File Index File SGML/XML File Associator File DTD File Prox data File Index File Associator File Cluster File March 2, 2004 Ray R. Larson

Client/Server Architecture
Server Supports: Database storage Indexing Z39.50 access to local data Boolean and Probabilistic Searching Relevance Feedback External SQL database support Client Supports: Programmable (Tcl/Tk – Python soon) Graphical User Interface Z39.50 access to remote servers SGML & MARC formatting Combined Client/Server CGI scripting via WebCheshire March 2, 2004 Ray R. Larson

Z39.50 Overview Internet Search Engine UI Map Query Map Results Map

Two Protocols: HTTP & Z39.50 March 2, 2004 Ray R. Larson

Server Z39.50 Support Locally developed Z39.50 Library
Extended version 3 support support version 3 attributes in BIB-1 including “stem”, “relevance”, etc. Also adding support for “type 102” ranked queries (version 4) Can provide both MARC, SUTRS and SGML records, support for Explain and GRS-1 conversion of any SGML records March 2, 2004 Ray R. Larson

Distributed Search March 2, 2004 Ray R. Larson

The Problem The Digital Library vision -- Access to everyone for “all human knowledge” Lyman and Varian’s estimates of the “Dark Web” Hundreds or Thousands of servers with databases ranging widely in content, topic, format Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results How to select the “best” ones to search? Which resource to search first? Which to search next if more is wanted? Topical /domain constraints on the search selections Variable contents of database (metadata only, full text, multimedia…) March 2, 2004 Ray R. Larson

Distributed Search Tasks
Resource Description How to collect metadata about digital libraries and their collections or databases Resource Selection How to select relevant digital library collections or databases from a large number of databases Distributed Search How to perform parallel or sequential searching over the selected digital library databases Data Fusion How to merge query results from different digital libraries with their different search engines, differing record structures, etc. March 2, 2004 Ray R. Larson

An Approach for Distributed Resource Discovery
Distributed resource representation and discovery New approach to building resource descriptions based on Z39.50 Instead of using broadcast search across resources we are using two Z39.50 Services Identification of database metadata using Z39.50 Explain Extraction of distributed indexes using Z39.50 SCAN Evaluation How efficiently can we build distributed indexes? How effectively can we choose databases using the index? How effective is merging search results from multiple sources? Can we build hierarchies of servers (general/meta-topical/individual)? March 2, 2004 Ray R. Larson

Z39.50 Explain Explain supports searches for Server-Level metadata
Server Name IP Addresses Ports Database-Level metadata Database name Search attributes (indexes and combinations) Support metadata (record syntaxes, etc) March 2, 2004 Ray R. Larson

Z39.50 SCAN Originally intended to support Browsing Query for Results
Database Attributes plus Term (i.e., index and start point) Step Size Number of terms to retrieve Position in Response set Results Number of terms returned List of Terms and their frequency in the database (for the given attribute combination) March 2, 2004 Ray R. Larson

Z39.50 SCAN Results Syntax: zscan indexname1 term stepsize number_of_terms pref_pos % zscan title cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … March 2, 2004 Ray R. Larson

Resource Index Creation
For all servers, or a topical subset… Get Explain information For each index Use SCAN to extract terms and frequency Add term + freq + source index + database metadata to the XML “Collection Document” for the resource Planned extensions: Post-Process indexes (especially Geo Names, etc) for special types of data e.g. create “geographical coverage” indexes March 2, 2004 Ray R. Larson

MetaSearch Approach Internet DB 1 DB2 Distributed Index Db 5 Db 6 DB 3
Engine MetaSearch Server Map Query Map Explain And Scan Queries Map Results Internet DB 1 DB2 Map Results Search Engine Map Query Distributed Index Search Engine Map Results Db 5 Db 6 March 2, 2004 Ray R. Larson DB 3 DB 4

Known Issues and Problems
Not all Z39.50 Servers support SCAN or Explain Solutions that appear to work well: Probing for attributes instead of explain (e.g. DC attributes or analogs) We also support OAI and can extract OAI metadata for servers that support OAI Query-based sampling (Callan) Collection Documents are static and need to be replaced when the associated collection changes March 2, 2004 Ray R. Larson

Evaluation Test Environment TREC Tipster data (approx. 3 GB)
Partitioned into 236 smaller collections based on source and date by month (no DOE) High size variability (from 1 to thousands of records) Same database as used in other distributed search studies by J. French and J. Callan among others Used TREC topics for evaluation (these are the only topics with relevance judgements for all 3 TIPSTER disks March 2, 2004 Ray R. Larson

Harvesting Efficiency
Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb) Average of seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the network Average of seconds Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds. March 2, 2004 Ray R. Larson

Our Collection Ranking Approach
We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results) March 2, 2004 Ray R. Larson

Probabilistic Retrieval: Logistic Regression
Probability of relevance for a given index is based on logistic regression from a sample set documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: March 2, 2004 Ray R. Larson

Probabilistic Retrieval: Logistic Regression attributes
Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection size estimate Average Inverse Collection Frequency Inverse Document Frequency (N = Number of collections M = Number of Terms in common between query and document March 2, 2004 Ray R. Larson

Evaluation Effectiveness
Tested using the collection representatives described above (as harvested from over the network) and the TIPSTER relevance judgements Testing by comparing our approach to known algorithms for ranking collections Results were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) Recall analog (How many of the Rel docs occurred in the top n databases – averaged) March 2, 2004 Ray R. Larson

Titles only (short query)

Future Logically Clustering servers by topic
Meta-Meta Servers (treating the MetaSearch database as just another database) March 2, 2004 Ray R. Larson

Distributed Metadata Servers
Database Servers General Servers Meta-Topical Servers Replicated servers March 2, 2004 Ray R. Larson

Geographic Operators and Search Ranking

The GEO Operations Operators established for the GEO Z39.50 profile
Implemented using special operations on indexes Indexing allows extraction of geographic coordinates and dates from SGML/XML data in a variety of formats Normalized internal representation in indexes Search using geographic and time elements as primary or limiting search elements March 2, 2004 Ray R. Larson

The GEO Operations X-based interfaces permit (simple) map drawing and search Interface to MapServer for web-based map searching March 2, 2004 Ray R. Larson

GEO Geographic operators
>=< Overlap Search region and data Overlap >#< Fully Enclosed Data fully enclosed in search reg. <#> Encloses Data fully encloses search region <># Fully Outside Data outside of search region ++ Near Data is near search region :<: Before Data date is before search date :<=: Before or During Data date is before or during search date :>=: During or After Data date is during or after search date :>: After Data date is after search date March 2, 2004 Ray R. Larson

Overlaps search March 2, 2004 Ray R. Larson

Fully Enclosed Search March 2, 2004 Ray R. Larson

Map-Based Search March 2, 2004 Ray R. Larson

GeoSearch Web Interface

MySQL and PostgreSQL March 2, 2004 Ray R. Larson

RDBMS Support There are two reasons for RDBMS support
IR systems are not meant for LOTS of update transactions Some application need to have access to both relational data and text data via Z39.50 Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used via Cheshire Z39.50 mappings to RDBMS columns “ZQL” submission of SQL as Z39.50 Type 0 query March 2, 2004 Ray R. Larson

Protocol Support March 2, 2004 Ray R. Larson

Protocols In Cheshire II most protocols (except Z39.50) are implemented using scripting Example scripts to support the following are included in the distribution OAI SRW (Python version) SOAP SDLIP March 2, 2004 Ray R. Larson

Cheshire III Design and Development

Cheshire III Goals Retain or reproduce (and refine) all Cheshire II features “Spring cleaning” of code base Add Full Unicode Support Store most system and content data in the database Permit easy and efficient integration in Web Services Use threaded server for economy of resource usage Enhanced Multiprotocol support Support for distributed processing (I.e. GRID clusters) Enhance expandability and “drop in’ functionality Interfaces and/or APIs for Java, Python, C/C++ March 2, 2004 Ray R. Larson

Cheshire II Design Overview
Z SERVER CONFIG CONT INDEX CLUSTER CLUSTER EXTENSION BUILD ASSOC XML DOCS INDEX CHESHIRE ASSOC INDEX(S) COMPONENT DEFINITION XML DIRECTORY March 2, 2004 Ray R. Larson

Cheshire III Server Overview
API I N D E X G T R R X E A S C N L O S T R F D O R M S A C H P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI W O K RESULT SETS USER F & ACCESS U Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL P SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER March 2, 2004 Ray R. Larson

Cheshire III SERVER API STAFF UI W O K REMOTE SYSTEMS (any protocol)
D E X G T R R X E A S C N L O S T R F D O R M S A C H P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI W O K RESULT SETS USER F & ACCESS U Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL P SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER March 2, 2004 Ray R. Larson

Retain Features The intent is to permit all of the types of in indexing, searching and record formatting available now, while making it easier to add new capabilities The new system will also support full UNICODE for content and for metadata Store metadata and content in the database (including config information, etc.) March 2, 2004 Ray R. Larson

Permit easy integration of Web Services
The assumption is that the web server will be the central server mechanism in the future. The new design relies on the session handling, threading and load management tools available in Apache ( ) The Cheshire server is dynamically loaded as part of the Web Server March 2, 2004 Ray R. Larson

Multiprotocol Support
The Web server handles the network issues and passes requests in various protocols along to the Cheshire Server. Individual Protocol “plugins” and the Protocol Handler convert search, display, and metadata requests in a particular protocol to the internal Cheshire III control language, and convert outgoing message and data to the appropriate protocol form March 2, 2004 Ray R. Larson

Distributed & GRID Processing
The server will support protocols for interchange of partial results and collection statistics with a single “Master” controlling the actions of a large number of “Slave” servers These will run in parallel in a GRID environment This is still “research” but will probably be using “Storage Grid” technology from SDSC with our own applications Non-Grid use of the same protocols, etc will be possible (but definitely slower) March 2, 2004 Ray R. Larson

Enhanced Expanability
Clearly defined APIs for interacting with the server will permit easy addition of new functionality, or to replace or upgrade existing functionality Interactive user interface for database configuration and setup We want to make it easier for a user/administrator to create and manage the database March 2, 2004 Ray R. Larson

Multilingual APIs The system is being developed in a multilingual environment. We will include the ability to interface with (at a minimum) Java, Python and C/C++ applications. APIs for developing new functions will be available in these languages as well March 2, 2004 Ray R. Larson

Development Currently work is going on here (RRL) and (primarily) in the UK We have incomplete (Alpha) versions of the system, but haven’t been distributing it in the current form (changing constantly) First release version is expected in mid-’04 March 2, 2004 Ray R. Larson

Further Information Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ Includes HTML documentation Project Web Site Archives Hub March 2, 2004 Ray R. Larson

Cheshire II: Features and Internals and Cheshire III overview

Similar presentations

Presentation on theme: "Cheshire II: Features and Internals and Cheshire III overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cheshire II: Features and Internals and Cheshire III overview

Similar presentations

Presentation on theme: "Cheshire II: Features and Internals and Cheshire III overview"— Presentation transcript:

Similar presentations

About project

Feedback