Download presentation
Presentation is loading. Please wait.
Published byTomas Bäcker Modified over 6 years ago
1
Cheshire II: Features and Internals and Cheshire III overview
Ray R. Larson School of Information Management and Systems University of California, Berkeley March 2, 2004 Ray R. Larson
2
Overview Cheshire II feature overview Additions from INEX ‘03
Logistic Regression Ranking, Okapi BM-25 and Boolean Operations Fusion Operators Additions from INEX ‘03 Element/Index level re-estimation of LR coefficients Adhoc and Heterogeneous Track Methodology Evaluation Results -Adhoc March 2, 2004 Ray R. Larson
3
Overview of Cheshire II
It supports SGML and XML with components and component indexes It is a client/server application Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implemented Server supports a Relational Database Gateway Supports Boolean searching of all servers Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search Search engine supports ``nearest neighbor'' searches and relevance feedback GUI interface on X window displays and Windows NT WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire Scriptable clients using Tcl and Python Store SGML/XML as files or “Datastore” database March 2, 2004 Ray R. Larson
4
Cheshire II Searching Z39.50 Internet Images Scanned Text Local Remote
March 2, 2004 Ray R. Larson
5
INEX Overview Local Net INEX Search Engine Map Query Map Query UI Or
Scripts Map Query Results Local Net Map Results March 2, 2004 Ray R. Larson
6
Boolean Search Capability
All Boolean operations are supported “zfind author x and (title y or subject z) not subject A” Named sets are supported and stored on the server Boolean operations between stored sets are supported “zfind SET1 and subject widgets or SET2” Nested parentheses and truncation are supported “zfind xtitle Alice#” March 2, 2004 Ray R. Larson
7
Probabilistic Retrieval
Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval time Z39.50 “relevance” operator used to indicate probabilistic search Any index can have Probabilistic searching performed: zfind “cheshire cats, looking glasses, march hares and other such things” zfind caucus races Boolean and Probabilistic elements can be combined: zfind government documents and title guidebooks March 2, 2004 Ray R. Larson
8
Probabilistic Retrieval: Logistic Regression
Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide March 2, 2004 Ray R. Larson
9
Probabilistic Retrieval: Logistic Regression attributes
Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Inverse Component Frequency Number of Terms in common between query and Component -- logged March 2, 2004 Ray R. Larson
10
Combining Boolean and Probabilistic Search Elements
Two original approaches: Boolean Approach Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries March 2, 2004 Ray R. Larson
11
Okapi BM25 Where: Q is a query containing terms T
K is k1((1-b) + b.dl/avdl) k1, b and k3 are parameters , usually 1.2, 0.75 and tf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derived dl and avdl are the document length and the average document length measured in some convenient unit w(1) is the Robertson-Sparck Jones weight. March 2, 2004 Ray R. Larson
12
Merging and Ranking Operators
Extends the capabilities of merging to include merger operations in queries like Boolean operators Fuzzy Logic Operators (not used for INEX) !FUZZY_AND !FUZZY_OR !FUZZY_NOT Containment operators: Restrict components to or with a particular parent !RESTRICT_FROM !RESTRICT_TO Merge Operators !MERGE_SUM !MERGE_MEAN !MERGE_NORM !MERGE_CMBZ March 2, 2004 Ray R. Larson
13
INEX ‘04 Fusion Search Subquery Subquery Fusion/ Merge Final Ranked List Subquery Subquery Comp. Query Results Comp. Query Results Merge multiple ranked and Boolean index searches within each query and multiple component search resultsets Major components merged are Articles, Body, Sections, subsections, paragraphs March 2, 2004 Ray R. Larson
14
New LR Coefficients Estimates using INEX ‘03 relevance assessments for
Index b0 b1 b2 b3 b4 b5 b6 Base -3.700 1.269 -0.310 0.679 -0.021 0.223 4.010 topic -7.758 5.670 -3.427 1.787 -0.030 1.952 5.880 topicshort -6.364 2.739 -1.443 1.228 -0.020 1.280 3.837 abstract -5.892 2.318 -1.364 0.860 -0.013 1.052 3.600 alltitles -5.243 2.319 -1.361 1.415 -0.037 1.180 3.696 sec words -6.392 2.125 -1.648 1.106 -0.075 1.174 3.632 para words -8.632 1.258 -1.654 1.485 -0.084 1.143 4.004 Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component March 2, 2004 Ray R. Larson
15
SGML/XML Support Underlying native format for all data is SGML or XML
The DTD defines the file format for each file Full SGML/XML parsing SGML/XML Format Configuration Files define the database USMARC DTD and MARC to SGML conversion (and back again) Access to full-text via special SGML/XML tags March 2, 2004 Ray R. Larson
16
Indexing Any SGML/XML tagged field or attribute can be indexed:
B-Tree and Hash access via Berkeley DB (Sleepycat) Stemming, keyword, exact keys and “special keys” Mapping from any Z39.50 Attribute combination to a specific index Underlying postings information includes term frequency for probabilistic searching Component extraction with separate component indexes March 2, 2004 Ray R. Larson
17
XML Element Extraction
A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request The matching elements are extracted from the records matching the search and delivered in a simple format.. March 2, 2004 Ray R. Larson
18
XML Extraction % zselect sherlock
372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML }} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc… March 2, 2004 Ray R. Larson
19
SGML/XML Support Configuration files for the Server are SGML/XML:
They include elements describing all of the data files and indexes for the database. They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. March 2, 2004 Ray R. Larson
20
SGML/XML Support Example XML record for a DL document <ELIB-BIB>
<BIB-VERSION>ELIB-v1.0</BIB-VERSION> <ID>756</ID> <ENTRY>June 12, 1996</ENTRY> <DATE>June 1996</DATE> <TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE> <ORGANIZATION>University of California</ORGANIZATION> <TYPE>report</TYPE> <AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL> <AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL> <PROJECT>SNEP</PROJECT> <SERIES>Vol 3</SERIES> <PAGES>40</PAGES> <TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF> <PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF> </ELIB-BIB> March 2, 2004 Ray R. Larson
21
SGML Support Example SGML/MARC Record
<USMARC Material="BK" ID=" "><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry> </Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005> </Fld005> <Fld008> nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a> </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm </a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ... March 2, 2004 Ray R. Larson
22
SGML/XML Support TREC document… <DOC>
<DOCNO>FT </DOCNO> <PROFILE>_AN-DCPCCAA3FT</PROFILE> <DATE>930316 </DATE> <HEADLINE> FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key to unlocking Tangentopoli - They will set the investigation agenda </HEADLINE> <BYLINE> By ROBERT GRAHAM </BYLINE> <TEXT> OVER the weekend the Italian media felt obliged to comment on a non-event. No new arrests had taken place in any of the country's ever more numerous corruption scandals which centre on the illicit funding of political parties ... </TEXT> <XX> … March 2, 2004 Ray R. Larson
23
<CO>Ente Nazionale Idrocarburi.
… Companies:- </XX> <CO>Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale. </CO> <XX> Countries:- <CN>ITZ Italy, EC. </CN> Industries:- <IN>P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC. </IN> Types:- </XX> … March 2, 2004 Ray R. Larson
24
<TP>CMMT Comment & Analysis. GOVT Legal issues. </TP>
… <TP>CMMT Comment & Analysis. GOVT Legal issues. </TP> <PUB>The Financial Times </PUB> <PAGE> London Page 4 </PAGE> </DOC> March 2, 2004 Ray R. Larson
25
SGML/XML Support INEX Document <article>
<fno>C1050</fno> <doi> /C1050s-2000</doi> <fm> <hdr><hdr1><ti>COMPUTING IN SCIENCE & ENGINEERING</ti> <crt><issn> </issn>/00/$10.00 <cci><onm>© 2000 IEEE</onm></cci></crt></hdr1> <hdr2><obi><volno>Vol. 2</volno><issno>No. 1</issno></obi> <pdt><mo>JANUARY/FEBRUARY</mo><yr>2000</yr></pdt> <pp>pp </pp></hdr2> </hdr> <tig><atl>The Decompositional Approach to Matrix Computation</atl> <pn>pp </pn></tig> <au sequence="first"><fnm>G.W.</fnm><snm>Stewart</snm><aff><onm>University of Maryland</onm></aff></au> <fig><art file="c1050x1.gif" w="425" h="321" tw="150" th="113"/></fig> <abs><p>The introduction of matrix decomposition into numerical linear algebra revolutionized matrix computations. This article outlines the decompositional approach, comments on its history, and surveys the six most widely used decompositions.</p> </abs> </fm> <bdy> <sec><st></st> <ip1>In 1951, Paul S. Dwyer published <it>Linear Computations</it>, perhaps the first book devoted entirely to numerical linear algebra.<ref rid="bibc10501" type="bib">1</ref> Digital computing was in its infancy, and Dwyer focused on computation with mechanical calculators. Nonetheless, the book was state of the art. <ref rid="c10501" type="fig">Figure 1</ref> reproduces a page of the book dealing with Gaussian elimination. In 1954, Alston S. Householder published <it>Principles of Numerical Analysis</it>,<ref rid="bibc10502" type="bib">2</ref> one of the first modern treatments of high-speed digital computation. <ref rid="c10502" type="fig">Figure 2</ref> reproduces a page from this book, also dealing with Gaussian elimination.</ip1> <fig id="c10501"><art file="c10501.gif" w="600" h="970" tw="150" th="243"/><no>1</no><fgc>This page from <it>Linear Computations</it> shows that Paul Dwyer's approach begins with a system of scalar equations. Courtesy of John Wiley & Sons.</fgc></fig> <fig id="c10502"><art file="c10502.gif" w="500" h="807" tw="150" th="242"/><no>2</no><fgc>On this page from <it>Principles of Numerical Analysis</it>, Alston Householder uses partitioned matrices and LU decomposition. Courtesy of McGraw-Hill.</fgc></fig> <p>The contrast between these two excerpts is striking. The most obvious difference is that Dwyer used scalar equations whereas Householder used partitioned matrices. … March 2, 2004 Ray R. Larson
26
SGML/XML Support …<sec><st>CONCLUSION</st>
<ip1>The big six are not the only decompositions in use; in fact, there are many more. As mentioned earlier, certain intermediate forms—such as tridiagonal and Hessenberg forms—have come to be regarded as decompositions in their own right. Since the singular value decomposition is expensive to compute and not readily updated, rank-revealing alternatives have received considerable attention.<ref rid="bibc105054" type="bib">54</ref><super>,</super><ref rid="bibc105055" type="bib">55</ref> There are also generalizations of the singular value decomposition and the Schur decomposition for pairs of matrices. <ref rid="bibc105056" type="bib">56</ref><super>,</super><ref rid="bibc105057" type="bib">57</ref> All crystal balls become cloudy when they look to the future, but it seems safe to say that as long as new matrix problems arise, new decompositions will be devised to solve them.</ip1> </sec> </bdy> <bm> <ack><h>Acknowledgment</h> <ip1><it>This work was supported by the National Science Foundation under Grant No </it></ip1> </ack> <bib><bibl><h>References</h> <bb id="bibc10501"><au><fnm>P.S.</fnm><snm>Dwyer</snm></au><ti>Linear Computations,</ti> <obi>John Wiley & Sons,</obi><loc><cty>New York,</cty></loc><pdt><yr>1951.</yr></pdt></bb> <bb id="bibc10502"><au><fnm>A.S.</fnm><snm>Householder</snm></au><ti>Principles of Numerical Analysis,</ti> <obi>McGraw-Hill,</obi><loc><cty>New York,</cty></loc><pdt><yr>1953.</yr></pdt></bb> <bb id="bibc10503"><au><fnm>J.H.</fnm><snm>Wilkinson</snm></au><obi>and</obi> <au><fnm>C.</fnm><snm>Reinsch</snm></au><ti>Handbook for Automatic Computation, Vol. II, Linear Algebra,</ti> <obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt><yr>1971.</yr></pdt></bb> <bb id="bibc10504"><au><fnm>B.S.</fnm><snm>Garbow</snm></au> <obi>et al.,</obi><atl>"Matrix Eigensystem Routines—Eispack Guide Extension,"</atl> <ti>Lecture Notes in Computer Science,</ti><obi>Springer-Verlag,</obi><loc><cty>New York,</cty></loc><pdt> <yr>1977.</yr></pdt></bb> <bb id="bibc10505"><au><fnm>J.J.</fnm><snm>Dongarra</snm></au><obi>et al.,</obi> <ti>LINPACK User's Guide,</ti> <obi>SIAM,</obi><loc><cty>Philadelphia,</cty></loc><pdt><yr>1979.</yr></pdt></bb> … March 2, 2004 Ray R. Larson
27
SGML/XML Support INEX CAS Query
<?xml version="1.0" encoding="ISO "?> <!DOCTYPE inex_topic SYSTEM "topic.dtd"> <inex_topic topic_id="70" query_type="CAS" ct_no="49"> <title> /article[about(./fm/abs,'"information retrieval" "digital libraries"')]</title> <description>Retrieve articles with an abstract indicating the article is about information retrieval and/or digital libraries</description> <narrative>To be relevant the retrieved articles must be about information retrieval, digital libraries or, preferably both. Articles about information retrieval from digital libraries will receive the highest relevance judgements.</narrative> <keywords>information retrieval,digital libraries</keywords> </inex_topic> March 2, 2004 Ray R. Larson
28
SGML/XML Support Configuration files for the Server are also SGML/XML:
They include tags describing all of the data files and indexes for the database. They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. March 2, 2004 Ray R. Larson
29
Cheshire Configuration Files
<!-- ******************************************************************* --> <!-- ************************* TREC INTERACTIVE TEST DB **************** --> <!-- This is the config file for the Cheshire II TREC interactive Database --> <DBCONFIG> <DBENV>/projects/is240/GroupX/indexes </DBENV> <! > <!-- TREC TEST DATABASE FILEDEF --> <!-- The Interactive TREC Financial Times datafile --> <FILEDEF TYPE=SGML> <DEFAULTPATH>/projects/is240/GroupX </DEFAULTPATH> <!-- filetag is the "shorthand" name of the file --> <FILETAG> trec </FILETAG> <!-- filename is the full path name of the main data directory --> <FILENAME> /projects/is240/ft </FILENAME> <CONTINCLUDE> /projects/is240/ft.CONT </CONTINCLUDE> <!-- fileDTD is the full path name of the file's DTD --> <FILEDTD> /projects/is240/TREC.FT.DTD </FILEDTD> <!-- assocfil is the full path name of the file's Associator --> <ASSOCFIL> ft.assoc </ASSOCFIL> <!-- history is the full path name of the file's history file --> <HISTORY> cheshire_index/TESTDATA.history </HISTORY> … March 2, 2004 Ray R. Larson
30
Indexing Any SGML/XML tagged field or attribute can be indexed:
B-Tree and Hash access via Berkeley DB (Sleepycat) Stemming, keyword, exact keys and “special keys” Mapping from any Z39.50 Attribute combination to a specific index Underlying postings information includes term frequency for probabilistic searching. SGML may include address of full-text for indexing New indexes can be easily added, or old ones deleted March 2, 2004 Ray R. Larson
31
Bitmapped Indexes Bitmap indexes can be used for Boolean operations where the data has only a few values and very large numbers of items with each value Only one bit per record stored in the index Processed on a demand basis so only blocks with the bits needed to resolve a query are fetched March 2, 2004 Ray R. Larson
32
<!-- The following are the index definitions for the file --> <INDEXES> <!-- ******************************************************************* --> <!-- ************************* DOC NO. ********************************* --> <!-- ******************************************************************* --> <!-- The following provides document number access > <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE PRIMARYKEY=IGNORE> <INDXNAME> cheshire_index/trec.docno.index </INDXNAME> <INDXTAG> docno </INDXTAG> <INDXMAP> <USE> 12 </USE><struct> 1 </struct> </INDXMAP> <INDXMAP> <USE> 12 </USE><struct> 2 </struct> </INDXMAP> <INDXMAP> <USE> 12 </USE><struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>DOCNO </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> … March 2, 2004 Ray R. Larson
33
<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM>
<!-- ******************************************************************* --> <!-- ************************* TOPIC *********************************** --> <!-- The following is the primary index for probabilistic searches > <!-- It includes headlines, datelines, bylines, and full text > <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> cheshire_index/trec.topic.index </INDXNAME> <INDXTAG> topic </INDXTAG> <INDXMAP> <USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> … <STOPLIST> cheshire_index/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>HEADLINE </FTAG> <FTAG>DATELINE </FTAG> <FTAG>BYLINE </FTAG> <FTAG>TEXT </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
34
Cheshire II – EVI Generation
Entry Vocabulary Indexes can improve access to data with controlled index terms Define basis for clustering records. Select field to form the basis of the cluster. Evidence Fields to use as contents of the pseudo-documents. During indexing cluster keys are generated with basis and evidence from each record. Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. Pseudo-Documents (Class clusters) are indexed on combined evidence fields. March 2, 2004 Ray R. Larson
35
EVI/Cluster Definitions
<clusname> classcluster </clusname> <cluskey normal=CLASSCLUS> <tagspec> <FTAG>FLD950 </FTAG> <s> ^a </s> </tagspec> </cluskey> <stoplist> /usr3/cheshire2/data2/clasclusstoplist </stoplist> <clusmap> <from> <tagspec> <ftag>FLD245</ftag><s>^[ab]</s> <ftag>FLD440</ftag><s>^a</s> <ftag>FLD490</ftag><s>^a</s> <ftag>FLD830</ftag><s>^a</s> <ftag>FLD740</ftag><s>^a</s> </tagspec></from> <to> <tagspec> <ftag>titles</ftag> </tagspec></to> <ftag>FLD6..</ftag><s>^[abcdxyz]</s> <to> <tagspec> <ftag>subjects</ftag> <summarize> <maxnum> 5 </maxnum> <ftag>subjsum</ftag> </tagspec></summarize> </clusmap> </CLUSTER> March 2, 2004 Ray R. Larson
36
Component Extraction and Indexing
Any element (or range of SGML/XML data starting with one element and ending with another) can be defined as a ‘component’ and accessed and indexed as if it were an entire document. Component indexes and document-level indexes can be combined in search operations (and special operators permit selection of document or components as the result March 2, 2004 Ray R. Larson
37
Component Definitions
<COMPONENTS> <COMPONENTDEF> <COMPONENTNAME> TESTDATA/COMPONENT_DB1 </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>mainenty </FTAG> <FTAG>titles </FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPENDTAG> <TAGSPEC><FTAG>Fld300 </FTAG></TAGSPEC> </COMPENDTAG> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> TESTDATA/comp1index1.author … </INDEXDEF> </COMPONENTDEF> </COMPONENTS> March 2, 2004 Ray R. Larson
38
Result Formatting (Display)
<DISPOPTIONS> KEEP_ENTITIES </DISPOPTIONS> <DISPLAY> <FORMAT NAME="B" OID=" " DEFAULT> <convert function="TAGSET-G"> <clusmap> <from> <tagspec> <ftag>DOCNO</ftag> </tagspec></from> <to> <ftag>28</ftag> </tagspec></to> <ftag>#DOCID#</ftag> <ftag>5</ftag> </clusmap> </convert> </FORMAT> </DISPLAY> March 2, 2004 Ray R. Larson
39
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ********************* Config for INEX evaluation ****************** --> <!-- This is the config file for the Cheshire II TREC interactive Database --> <!-- new version uses proximity indexes... --> <DBCONFIG> <DBENV>/projects/metadata/cheshire/TREC/cheshire_index </DBENV> <! > <!-- INEX TEST DATABASE FILEDEF --> <FILEDEF TYPE=XML> <DEFAULTPATH> /projects/metadata/cheshire/INEX </DEFAULTPATH> <!-- filetag is the "shorthand" name of the file --> <FILETAG> INEX </FILETAG> <!-- filename is the full path name of the main data directory --> <FILENAME> inex-1.3/xml </FILENAME> <CONTINCLUDE> inex-1.3/xml_main.cont </CONTINCLUDE> <!-- fileDTD is the full path name of the file's DTD --> <FILEDTD> inex-1.3/dtd/wrapper.dtd </FILEDTD> <SGMLCAT> inex-1.3/dtd/catalog </SGMLCAT> <!-- assocfil is the full path name of the file's Associator --> <ASSOCFIL> inex-1.3/xml_main.assoc </ASSOCFIL> <!-- history is the full path name of the file's history file --> <HISTORY> inex.history </HISTORY> March 2, 2004 Ray R. Larson
40
INEX Configuration Example
<!-- The following are the index definitions for the file --> <INDEXES> <!-- ******************************************************************* --> <!-- ************************* DOC NO. ********************************* --> <!-- The following provides document number access > <INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=DO_NOT_NORMALIZE PRIMARYKEY=IGNORE> <INDXNAME> indexes/docno.index </INDXNAME> <INDXTAG> docno </INDXTAG> <INDXMAP> <USE> 12 </USE><struct> 1 </struct> </INDXMAP> <USE> 12 </USE><struct> 2 </struct> </INDXMAP> <USE> 12 </USE><struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG> doi </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
41
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ********************** PERSONAL AUTHOR/BYLINE ********************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/pauthor.index </INDXNAME> <INDXTAG> pauthor </INDXTAG> <!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 1 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXMAP> <USE> 1004 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/authorstoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>au</S><S>snm</S> <FTAG>fm</FTAG><S>au</S><S>fnm</S> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
42
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* TITLE/HEADLINE ************************** --> <!-- The following provides keyword title access > <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM> <INDXNAME> indexes/title.index </INDXNAME> <INDXTAG> title </INDXTAG> <INDXMAP> <USE> 4 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 5 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 6 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/titlestoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>tig</S><S>atl</S> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
43
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* TOPIC *********************************** --> <!-- The following is the primary index for probabilistic searches > <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROX NORMAL=STEM> <INDXNAME> indexes/topic.index </INDXNAME> <INDXTAG> topic </INDXTAG> <INDXMAP> <USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> … <USE> 1017 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>tig</S><S>atl</S> <FTAG>abs</FTAG> <FTAG>bdy</FTAG> <FTAG>bibl</FTAG><S>bb</S><S>atl</S> <FTAG>app</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
44
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************** DATE *********************************** --> <INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR> <INDXNAME> indexes/date.index </INDXNAME> <INDXTAG> date </INDXTAG> <!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 30 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 30 </USE><POSIT> 3 </posit> <struct> 5 </struct> <INDXKEY> <TAGSPEC> <FTAG>hdr2</FTAG><s>yr</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
45
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************** JOURNAL ******************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/journal.index </INDXNAME> <INDXTAG> journal </INDXTAG> <INDXMAP> <USE> 1022 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <USE> 1022 </USE><POSIT> 3 </posit> <struct> 5 </struct> <INDXKEY> <TAGSPEC> <FTAG>hdr1</FTAG><s>ti</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
46
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* KEYWORDS********************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/keywords.index </INDXNAME> <INDXTAG> kwd </INDXTAG> <INDXMAP> <USE> 3121 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>kwd</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
47
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* ABSTRACT********************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/abstract.index </INDXNAME> <INDXTAG> abstract </INDXTAG> <INDXMAP> <USE> 62 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>abs</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
48
INEX Configuration Example
<!-- The following index has contents of the SEQUENCE attribute of the --> <!-- au (author) tag: either "first" or "additional" --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/author_seq.index </INDXNAME> <INDXTAG> author_seq </INDXTAG> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>au</S><ATTR>sequence</ATTR> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
49
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* Bib author Forename ******************** --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/bib_author_fnm.index </INDXNAME> <INDXTAG> bib_author_fnm </INDXTAG> <INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>bb</FTAG><s>au</s><s>fnm</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
50
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* Bib author surname ******************** --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/bib_author_snm.index </INDXNAME> <INDXTAG> bib_author_snm </INDXTAG> <INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>bb</FTAG><s>au</s><s>snm</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
51
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* FIGURES ********************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM> <INDXNAME> indexes/fig.index </INDXNAME> <INDXTAG> fig </INDXTAG> <INDXMAP> <USE> 3150 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fig</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
52
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* acknowledgements ************************ --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM> <INDXNAME> indexes/ack.index </INDXNAME> <INDXTAG> ack </INDXTAG> <INDXMAP> <USE> 3188 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/topicstoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>ack</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
53
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* alltitles ******************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/alltitles.index </INDXNAME> <INDXTAG> alltitles </INDXTAG> <INDXMAP> <USE> 3188 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/titlestoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>atl</FTAG> <FTAG>st</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
54
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* Affiliation ***************************** --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/affil.index </INDXNAME> <INDXTAG> affil </INDXTAG> <INDXMAP> <USE> 3189 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <STOPLIST> indexes/titlestoplist </STOPLIST> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><s>aff</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
55
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* FNO ********************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=none> <INDXNAME> indexes/fno.index </INDXNAME> <INDXTAG> fno </INDXTAG> <INDXMAP> <USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>fno</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
56
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* FIGNO ********************************* --> <INDEXDEF ACCESS=BTREE EXTRACT=INTEGER NORMAL=NONE> <INDXNAME> indexes/figno.index </INDXNAME> <INDXTAG> figno </INDXTAG> <INDXMAP> <USE> 3193 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>fig</FTAG><s>no</s> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
57
INEX Configuration Example
<!-- ******************************************************************* --> <!-- ************************* topicshort ******************************** --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/topicshort.index </INDXNAME> <INDXTAG> topicshort </INDXTAG> <INDXMAP> <USE> 3192 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <INDXKEY> <TAGSPEC> <FTAG>fm</FTAG><S>tig</S><S>atl</S> <FTAG>abs</FTAG> <FTAG>kwd</FTAG> <FTAG>st</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> </INDEXES> March 2, 2004 Ray R. Larson
58
INEX Configuration Example
<COMPONENTS> <COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_SECTION </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>sec</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE> <INDXNAME> indexes/sec_title2.index </INDXNAME> <INDXTAG> sec_title </INDXTAG> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/titlestoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <FTAG>sec</FTAG><s>st</s> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
59
INEX Configuration Example
<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/sec_words.index </INDXNAME> <INDXTAG> sec_words </INDXTAG> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/topicstoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <TAGSPEC> <FTAG>sec</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson
60
INEX Configuration Example
<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_BIB </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>bm</FTAG><S>bib</S><s>bibl</s><s>bb</s> </TAGSPEC> </COMPSTARTTAG> <!-- /* no end tag */ --> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/bib_author.index </INDXNAME> <INDXTAG> bib_author </INDXTAG> <!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 1000 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <FTAG>au</FTAG> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
61
INEX Configuration Example
<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE> <INDXNAME> indexes/bib_title.index </INDXNAME> <INDXTAG> bib_title </INDXTAG> <!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 33 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <TAGSPEC> <FTAG>atl</FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF> March 2, 2004 Ray R. Larson
62
INEX Configuration Example
<INDEXDEF ACCESS=BTREE EXTRACT=DATE NORMAL=YEAR> <INDXNAME> indexes/bib_date.index </INDXNAME> <INDXTAG> bib_date </INDXTAG> <!-- The following INDXMAP items provide a mapping from the AUTHOR tag to --> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 31 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <TAGSPEC> <FTAG>pdt</FTAG><s>yr</s> </TAGSPEC> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson
63
INEX Configuration Example
<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/para_words.index </INDXNAME> <INDXTAG> para_words </INDXTAG> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/topicstoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <FTAG>.*</FTAG> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson
64
INEX Configuration Example
<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_FIG </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>fig</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE> <INDXNAME> indexes/fig_caption.index </INDXNAME> <INDXTAG> fig_caption </INDXTAG> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/titlestoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <FTAG>fgc</FTAG> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson
65
INEX Configuration Example
<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_VITAE </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>vt</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=NONE> <INDXNAME> indexes/vitae_words.index </INDXNAME> <INDXTAG> vt_vitae </INDXTAG> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 38 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/titlestoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> </COMPONENTS> March 2, 2004 Ray R. Larson
66
INEX Configuration Example
<DISPOPTIONS> KEEP_ENTITIES </DISPOPTIONS> <DISPLAY> <DISPLAYDEF NAME="B" OID=" " DEFAULT> <convert function="MIXED"> <clusmap> <from> <tagspec> <ftag>doi</ftag> </tagspec></from> <to> <ftag>28</ftag> </tagspec></to> <ftag>#DOCID#</ftag> <ftag>5</ftag> <ftag>#DBNAME#</ftag> </tagspec></from>… March 2, 2004 Ray R. Larson
67
INEX Configuration Example
<DISPLAYDEF name="XML_ELEMENT_" OID=" "> <convert function="XML_ELEMENT"> <clusmap> <from> <tagspec> <ftag>#FILENAME#</ftag> </tagspec></from> <to> <ftag>FILENAME</ftag> </tagspec></to> <ftag>#RANK#</ftag> <ftag>RANK </ftag> … …<from> <tagspec> <ftag>#RAWSCORE#</ftag> </tagspec></from> <to> <ftag>RAWSCORE </ftag> </tagspec></to> <from> <ftag> SUBST_ELEMENT </ftag> </tagspec> </to> </clusmap> </convert> </DISPLAYDEF> </DISPLAY> </FILEDEF> </DBCONFIG> March 2, 2004 Ray R. Larson
68
INEX Configuration Example
<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/para_words.index </INDXNAME> <INDXTAG> para_words </INDXTAG> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/topicstoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <FTAG>.*</FTAG> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson
69
INEX Configuration Example
<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/para_words.index </INDXNAME> <INDXTAG> para_words </INDXTAG> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/topicstoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <FTAG>.*</FTAG> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson
70
INEX Configuration Example
<COMPONENTDEF> <COMPONENTNAME> indexes/COMPONENT_PARAS </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <TAGSPEC> <FTAG>^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$</FTAG> </TAGSPEC> </COMPSTARTTAG> <COMPONENTINDEXES> <!-- First index def --> <INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM> <INDXNAME> indexes/para_words.index </INDXNAME> <INDXTAG> para_words </INDXTAG> <!-- the appropriate Z39.50 BIB1 attribute numbers > <INDXMAP> <USE> 39 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP> <!-- The stoplist for this file --> <STOPLIST> indexes/topicstoplist </STOPLIST> <!-- The INDXKEY area contains the specifications of tags in the doc --> <!-- that are to be extracted and indexed for this index --> <INDXKEY> <FTAG>.*</FTAG> </INDXKEY> </INDEXDEF> </COMPONENTINDEXES> </COMPONENTDEF> March 2, 2004 Ray R. Larson
71
XML Schemas and Element Retrieval
March 2, 2004 Ray R. Larson
72
XML Schema Support XML Schemas or DTD’s can be used to define the data contents Tested with a wide variety of schemas including METS (with various supporting schemas) March 2, 2004 Ray R. Larson
73
XML Element Extraction
A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request (Note only a subset of full Xpath is available) The matching elements are extracted from the records matching the search and delivered in a simple format.. March 2, 2004 Ray R. Larson
74
XML Extraction % zselect sherlock
372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML }} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc… March 2, 2004 Ray R. Larson
75
Database Storage All data stored as SGML/XML flat text files plus optional linked full-text (non-XML) files File format is defined though SGML/XML DTD (also flat text file) or Schema “Associator” files provide indexed direct access to each record in SGML/XML files. Contain offset and record length for each “record” Associators can be built to index any conformant document in a directory sub-tree March 2, 2004 Ray R. Larson
76
INEX CO Runs Three official, one later run - all Title-only
Fusion - Combines Okapi and LR using the MERGE_CMBZ operator NewParms (LR)- Using only LR with the new parameters Feedback - An attempt at blind relevance feedback PostFusion - Fusion of the new LR coefficients and Okapi March 2, 2004 Ray R. Larson
77
Query Generation - CO # 162 TITLE = Text and Index Compression Algorithms QUERY: {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) @+ is is LR !MERGE_CMBZ is a normalized score summation and enhancement March 2, 2004 Ray R. Larson
78
INEX CO Runs Strict Generalized Avg Prec FUSION = 0.0642
NEWPARMS = FDBK = POSTFUS = Avg Prec FUSION = NEWPARMS = FDBK = POSTFUS = March 2, 2004 Ray R. Larson
79
INEX VCAS Runs Two official runs
FUSVCAS - Element fusion using LR and various operators for path restriction NEWVCAS - Using the new LR coefficients for each appropriate index and various operators for path restriction March 2, 2004 Ray R. Larson
80
Query Generation - VCAS
#66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)] Submitted query = {intelligent transport systems})) !RESTRICT_FROM {on-board route planning navigation system for automobiles})) Target elements: sec|ss1|ss2|ss3 March 2, 2004 Ray R. Larson
81
VCAS Results Generalized Strict Avg Prec FUSVCAS = 0.0321
NEWVCAS = Avg Prec FUSVCAS = NEWVCAS = March 2, 2004 Ray R. Larson
82
Heterogeneous Track Approach using the Cheshire’s Virtual Database options Primarily a version of distributed IR Each collection indexed separately Search via Z39.50 distributed queries Z39.50 Attribute mapping used to map query indexes to appropriate elements in a given collection Only LR used and collection results merged using probability of relevance for each collection result March 2, 2004 Ray R. Larson
83
Heterogeneous Track Issues
Very large “Documents” Our approach was to segment Reporting Xpath after segmenting large documents March 2, 2004 Ray R. Larson
84
Database Storage Config File Remote RDBMS Index File Page Data File
Postings File History File Index File SGML/XML File Associator File DTD File Prox data File Index File Associator File Cluster File March 2, 2004 Ray R. Larson
85
Client/Server Architecture
Server Supports: Database storage Indexing Z39.50 access to local data Boolean and Probabilistic Searching Relevance Feedback External SQL database support Client Supports: Programmable (Tcl/Tk – Python soon) Graphical User Interface Z39.50 access to remote servers SGML & MARC formatting Combined Client/Server CGI scripting via WebCheshire March 2, 2004 Ray R. Larson
86
Z39.50 Overview Internet Search Engine UI Map Query Map Results Map
March 2, 2004 Ray R. Larson
87
Two Protocols: HTTP & Z39.50 March 2, 2004 Ray R. Larson
88
Server Z39.50 Support Locally developed Z39.50 Library
Extended version 3 support support version 3 attributes in BIB-1 including “stem”, “relevance”, etc. Also adding support for “type 102” ranked queries (version 4) Can provide both MARC, SUTRS and SGML records, support for Explain and GRS-1 conversion of any SGML records March 2, 2004 Ray R. Larson
89
Distributed Search March 2, 2004 Ray R. Larson
90
The Problem The Digital Library vision -- Access to everyone for “all human knowledge” Lyman and Varian’s estimates of the “Dark Web” Hundreds or Thousands of servers with databases ranging widely in content, topic, format Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results How to select the “best” ones to search? Which resource to search first? Which to search next if more is wanted? Topical /domain constraints on the search selections Variable contents of database (metadata only, full text, multimedia…) March 2, 2004 Ray R. Larson
91
Distributed Search Tasks
Resource Description How to collect metadata about digital libraries and their collections or databases Resource Selection How to select relevant digital library collections or databases from a large number of databases Distributed Search How to perform parallel or sequential searching over the selected digital library databases Data Fusion How to merge query results from different digital libraries with their different search engines, differing record structures, etc. March 2, 2004 Ray R. Larson
92
An Approach for Distributed Resource Discovery
Distributed resource representation and discovery New approach to building resource descriptions based on Z39.50 Instead of using broadcast search across resources we are using two Z39.50 Services Identification of database metadata using Z39.50 Explain Extraction of distributed indexes using Z39.50 SCAN Evaluation How efficiently can we build distributed indexes? How effectively can we choose databases using the index? How effective is merging search results from multiple sources? Can we build hierarchies of servers (general/meta-topical/individual)? March 2, 2004 Ray R. Larson
93
Z39.50 Explain Explain supports searches for Server-Level metadata
Server Name IP Addresses Ports Database-Level metadata Database name Search attributes (indexes and combinations) Support metadata (record syntaxes, etc) March 2, 2004 Ray R. Larson
94
Z39.50 SCAN Originally intended to support Browsing Query for Results
Database Attributes plus Term (i.e., index and start point) Step Size Number of terms to retrieve Position in Response set Results Number of terms returned List of Terms and their frequency in the database (for the given attribute combination) March 2, 2004 Ray R. Larson
95
Z39.50 SCAN Results Syntax: zscan indexname1 term stepsize number_of_terms pref_pos % zscan title cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … March 2, 2004 Ray R. Larson
96
Resource Index Creation
For all servers, or a topical subset… Get Explain information For each index Use SCAN to extract terms and frequency Add term + freq + source index + database metadata to the XML “Collection Document” for the resource Planned extensions: Post-Process indexes (especially Geo Names, etc) for special types of data e.g. create “geographical coverage” indexes March 2, 2004 Ray R. Larson
97
MetaSearch Approach Internet DB 1 DB2 Distributed Index Db 5 Db 6 DB 3
Engine MetaSearch Server Map Query Map Explain And Scan Queries Map Results Internet DB 1 DB2 Map Results Search Engine Map Query Distributed Index Search Engine Map Results Db 5 Db 6 March 2, 2004 Ray R. Larson DB 3 DB 4
98
Known Issues and Problems
Not all Z39.50 Servers support SCAN or Explain Solutions that appear to work well: Probing for attributes instead of explain (e.g. DC attributes or analogs) We also support OAI and can extract OAI metadata for servers that support OAI Query-based sampling (Callan) Collection Documents are static and need to be replaced when the associated collection changes March 2, 2004 Ray R. Larson
99
Evaluation Test Environment TREC Tipster data (approx. 3 GB)
Partitioned into 236 smaller collections based on source and date by month (no DOE) High size variability (from 1 to thousands of records) Same database as used in other distributed search studies by J. French and J. Callan among others Used TREC topics for evaluation (these are the only topics with relevance judgements for all 3 TIPSTER disks March 2, 2004 Ray R. Larson
100
Harvesting Efficiency
Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb) Average of seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the network Average of seconds Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds. March 2, 2004 Ray R. Larson
101
Our Collection Ranking Approach
We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results) March 2, 2004 Ray R. Larson
102
Probabilistic Retrieval: Logistic Regression
Probability of relevance for a given index is based on logistic regression from a sample set documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: March 2, 2004 Ray R. Larson
103
Probabilistic Retrieval: Logistic Regression attributes
Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection size estimate Average Inverse Collection Frequency Inverse Document Frequency (N = Number of collections M = Number of Terms in common between query and document March 2, 2004 Ray R. Larson
104
Evaluation Effectiveness
Tested using the collection representatives described above (as harvested from over the network) and the TIPSTER relevance judgements Testing by comparing our approach to known algorithms for ranking collections Results were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) Recall analog (How many of the Rel docs occurred in the top n databases – averaged) March 2, 2004 Ray R. Larson
105
Titles only (short query)
March 2, 2004 Ray R. Larson
106
Future Logically Clustering servers by topic
Meta-Meta Servers (treating the MetaSearch database as just another database) March 2, 2004 Ray R. Larson
107
Distributed Metadata Servers
Database Servers General Servers Meta-Topical Servers Replicated servers March 2, 2004 Ray R. Larson
108
Geographic Operators and Search Ranking
March 2, 2004 Ray R. Larson
109
The GEO Operations Operators established for the GEO Z39.50 profile
Implemented using special operations on indexes Indexing allows extraction of geographic coordinates and dates from SGML/XML data in a variety of formats Normalized internal representation in indexes Search using geographic and time elements as primary or limiting search elements March 2, 2004 Ray R. Larson
110
The GEO Operations X-based interfaces permit (simple) map drawing and search Interface to MapServer for web-based map searching March 2, 2004 Ray R. Larson
111
GEO Geographic operators
>=< Overlap Search region and data Overlap >#< Fully Enclosed Data fully enclosed in search reg. <#> Encloses Data fully encloses search region <># Fully Outside Data outside of search region ++ Near Data is near search region :<: Before Data date is before search date :<=: Before or During Data date is before or during search date :>=: During or After Data date is during or after search date :>: After Data date is after search date March 2, 2004 Ray R. Larson
112
Overlaps search March 2, 2004 Ray R. Larson
113
Fully Enclosed Search March 2, 2004 Ray R. Larson
114
Map-Based Search March 2, 2004 Ray R. Larson
115
GeoSearch Web Interface
March 2, 2004 Ray R. Larson
116
MySQL and PostgreSQL March 2, 2004 Ray R. Larson
117
RDBMS Support There are two reasons for RDBMS support
IR systems are not meant for LOTS of update transactions Some application need to have access to both relational data and text data via Z39.50 Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used via Cheshire Z39.50 mappings to RDBMS columns “ZQL” submission of SQL as Z39.50 Type 0 query March 2, 2004 Ray R. Larson
118
Protocol Support March 2, 2004 Ray R. Larson
119
Protocols In Cheshire II most protocols (except Z39.50) are implemented using scripting Example scripts to support the following are included in the distribution OAI SRW (Python version) SOAP SDLIP March 2, 2004 Ray R. Larson
120
Cheshire III Design and Development
March 2, 2004 Ray R. Larson
121
Cheshire III Goals Retain or reproduce (and refine) all Cheshire II features “Spring cleaning” of code base Add Full Unicode Support Store most system and content data in the database Permit easy and efficient integration in Web Services Use threaded server for economy of resource usage Enhanced Multiprotocol support Support for distributed processing (I.e. GRID clusters) Enhance expandability and “drop in’ functionality Interfaces and/or APIs for Java, Python, C/C++ March 2, 2004 Ray R. Larson
122
Cheshire II Design Overview
Z SERVER CONFIG CONT INDEX CLUSTER CLUSTER EXTENSION BUILD ASSOC XML DOCS INDEX CHESHIRE ASSOC INDEX(S) COMPONENT DEFINITION XML DIRECTORY March 2, 2004 Ray R. Larson
123
Cheshire III Server Overview
API I N D E X G T R R X E A S C N L O S T R F D O R M S A C H P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI W O K RESULT SETS USER F & ACCESS U Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL P SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER March 2, 2004 Ray R. Larson
124
Cheshire III SERVER API STAFF UI W O K REMOTE SYSTEMS (any protocol)
D E X G T R R X E A S C N L O S T R F D O R M S A C H P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI W O K RESULT SETS USER F & ACCESS U Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL P SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER March 2, 2004 Ray R. Larson
125
Retain Features The intent is to permit all of the types of in indexing, searching and record formatting available now, while making it easier to add new capabilities The new system will also support full UNICODE for content and for metadata Store metadata and content in the database (including config information, etc.) March 2, 2004 Ray R. Larson
126
Permit easy integration of Web Services
The assumption is that the web server will be the central server mechanism in the future. The new design relies on the session handling, threading and load management tools available in Apache ( ) The Cheshire server is dynamically loaded as part of the Web Server March 2, 2004 Ray R. Larson
127
Multiprotocol Support
The Web server handles the network issues and passes requests in various protocols along to the Cheshire Server. Individual Protocol “plugins” and the Protocol Handler convert search, display, and metadata requests in a particular protocol to the internal Cheshire III control language, and convert outgoing message and data to the appropriate protocol form March 2, 2004 Ray R. Larson
128
Distributed & GRID Processing
The server will support protocols for interchange of partial results and collection statistics with a single “Master” controlling the actions of a large number of “Slave” servers These will run in parallel in a GRID environment This is still “research” but will probably be using “Storage Grid” technology from SDSC with our own applications Non-Grid use of the same protocols, etc will be possible (but definitely slower) March 2, 2004 Ray R. Larson
129
Enhanced Expanability
Clearly defined APIs for interacting with the server will permit easy addition of new functionality, or to replace or upgrade existing functionality Interactive user interface for database configuration and setup We want to make it easier for a user/administrator to create and manage the database March 2, 2004 Ray R. Larson
130
Multilingual APIs The system is being developed in a multilingual environment. We will include the ability to interface with (at a minimum) Java, Python and C/C++ applications. APIs for developing new functions will be available in these languages as well March 2, 2004 Ray R. Larson
131
Development Currently work is going on here (RRL) and (primarily) in the UK We have incomplete (Alpha) versions of the system, but haven’t been distributing it in the current form (changing constantly) First release version is expected in mid-’04 March 2, 2004 Ray R. Larson
132
Further Information Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ Includes HTML documentation Project Web Site Archives Hub March 2, 2004 Ray R. Larson
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.