March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University.

March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University of California, Berkeley

March 2, 2004 Ray R. Larson OverviewOverview Cheshire II feature overviewCheshire II feature overview –Logistic Regression Ranking, Okapi BM-25 and Boolean Operations –Fusion Operators Additions from INEX ‘03Additions from INEX ‘03 –Element/Index level re-estimation of LR coefficients Adhoc and Heterogeneous Track MethodologyAdhoc and Heterogeneous Track Methodology Evaluation Results -AdhocEvaluation Results -Adhoc

March 2, 2004 Ray R. Larson Overview of Cheshire II It supports SGML and XML with components and component indexesIt supports SGML and XML with components and component indexes It is a client/server applicationIt is a client/server application Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implementedUses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implemented Server supports a Relational Database GatewayServer supports a Relational Database Gateway Supports Boolean searching of all serversSupports Boolean searching of all servers Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity searchSupports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search Search engine supports ``nearest neighbor'' searches and relevance feedbackSearch engine supports ``nearest neighbor'' searches and relevance feedback GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshireWWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire Scriptable clients using Tcl and PythonScriptable clients using Tcl and Python Store SGML/XML as files or “Datastore” databaseStore SGML/XML as files or “Datastore” database

March 2, 2004 Ray R. Larson Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

March 2, 2004 Ray R. Larson INEX Overview Local Net UI Or Scripts Map Query Map Results Map Query Map Results INEX Search Engine

March 2, 2004 Ray R. Larson Boolean Search Capability All Boolean operations are supportedAll Boolean operations are supported –“zfind author x and (title y or subject z) not subject A” Named sets are supported and stored on the serverNamed sets are supported and stored on the server Boolean operations between stored sets are supportedBoolean operations between stored sets are supported –“zfind SET1 and subject widgets or SET2” Nested parentheses and truncation are supportedNested parentheses and truncation are supported –“zfind xtitle Alice#”

March 2, 2004 Ray R. Larson Probabilistic Retrieval Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval timeUses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval time Z39.50 “relevance” operator used to indicate probabilistic searchZ39.50 “relevance” operator used to indicate probabilistic search Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed: –zfind topic @ “cheshire cats, looking glasses, march hares and other such things” –zfind title @ caucus races Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined: –zfind topic @ government documents and title guidebooks

March 2, 2004 Ray R. Larson Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide Probabilistic Retrieval: Logistic Regression

March 2, 2004 Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Inverse Component Frequency Number of Terms in common between query and Component -- logged

March 2, 2004 Ray R. Larson Combining Boolean and Probabilistic Search Elements Two original approaches:Two original approaches: –Boolean Approach –Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries

March 2, 2004 Ray R. Larson Okapi BM25 Where:Where: Q is a query containing terms TQ is a query containing terms T K is k 1 ((1-b) + b.dl/avdl)K is k 1 ((1-b) + b.dl/avdl) k 1, b and k 3 are parameters, usually 1.2, 0.75 and 7-1000k 1, b and k 3 are parameters, usually 1.2, 0.75 and 7-1000 tf is the frequency of the term in a specific documenttf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derivedqtf is the frequency of the term in a topic from which Q was derived dl and avdl are the document length and the average document length measured in some convenient unitdl and avdl are the document length and the average document length measured in some convenient unit w (1) is the Robertson-Sparck Jones weight.w (1) is the Robertson-Sparck Jones weight.

March 2, 2004 Ray R. Larson Merging and Ranking Operators Extends the capabilities of merging to include merger operations in queries like Boolean operatorsExtends the capabilities of merging to include merger operations in queries like Boolean operators Fuzzy Logic Operators (not used for INEX)Fuzzy Logic Operators (not used for INEX) –!FUZZY_AND –!FUZZY_OR –!FUZZY_NOT Containment operators: Restrict components to or with a particular parentContainment operators: Restrict components to or with a particular parent –!RESTRICT_FROM –!RESTRICT_TO Merge OperatorsMerge Operators –!MERGE_SUM –!MERGE_MEAN –!MERGE_NORM –!MERGE_CMBZ

March 2, 2004 Ray R. Larson Subquery INEX ‘04 Fusion Search Merge multiple ranked and Boolean index searches within each query and multiple component search resultsetsMerge multiple ranked and Boolean index searches within each query and multiple component search resultsets –Major components merged are Articles, Body, Sections, subsections, paragraphs Subquery Comp. Query Results Comp. Query Results Fusion/ Merge Final Ranked List

March 2, 2004 Ray R. Larson New LR Coefficients Indexb0b1b2b3b4b5b6 Base-3.7001.269-0.3100.679-0.0210.2234.010 topic-7.7585.670-3.4271.787-0.0301.9525.880 topicshort-6.3642.739-1.4431.228-0.0201.2803.837 abstract-5.8922.318-1.3640.860-0.0131.0523.600 alltitles-5.2432.319-1.3611.415-0.0371.1803.696 sec words -6.3922.125-1.6481.106-0.0751.1743.632 para words -8.6321.258-1.6541.485-0.0841.1434.004 Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component and Component

March 2, 2004 Ray R. Larson SGML/XML Support Underlying native format for all data is SGML or XMLUnderlying native format for all data is SGML or XML The DTD defines the file format for each fileThe DTD defines the file format for each file Full SGML/XML parsingFull SGML/XML parsing SGML/XML Format Configuration Files define the databaseSGML/XML Format Configuration Files define the database USMARC DTD and MARC to SGML conversion (and back again)USMARC DTD and MARC to SGML conversion (and back again) Access to full-text via special SGML/XML tagsAccess to full-text via special SGML/XML tags

March 2, 2004 Ray R. Larson IndexingIndexing Any SGML/XML tagged field or attribute can be indexed:Any SGML/XML tagged field or attribute can be indexed: –B-Tree and Hash access via Berkeley DB (Sleepycat) –Stemming, keyword, exact keys and “special keys” –Mapping from any Z39.50 Attribute combination to a specific index –Underlying postings information includes term frequency for probabilistic searching Component extraction with separate component indexesComponent extraction with separate component indexes

March 2, 2004 Ray R. Larson XML Element Extraction A new search “ElementSetName” is XML_ELEMENT_A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present requestAny Xpath, element name, or regular expression can be included following the final underscore when submitting a present request The matching elements are extracted from the records matching the search and delivered in a simple format..The matching elements are extracted from the records matching the search and delivered in a simple format..

March 2, 2004 Ray R. Larson XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} { Singularitâes áa Cargáese … etc…

March 2, 2004 Ray R. Larson SGML/XML Support Configuration files for the Server are SGML/XML:Configuration files for the Server are SGML/XML: –They include elements describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

March 2, 2004 Ray R. Larson SGML/XML Support Example XML record for a DL documentExample XML record for a DL document ELIB-v1.0 756 June 12, 1996 June 1996 Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada University of California report USDA Forest Service Neil H. Berg Ken B. Roby Bruce J. McGurk SNEP Vol 3 40 /elib/data/docs/0700/756/HYPEROCR/hyperocr.html /elib/data/docs/0700/756/OCR-ASCII-NOZONE

March 2, 2004 Ray R. Larson 00722 n a m 2 2 00229 4 5 0 001001400000005001700014008004100031010001400072035002000086035001700106100001900123 2450105001422500011002472600032002583000033002905040050003236500036003737000022004097000022004 31950003200453998000700485 CUBGGLAD1282B 19940414143202.0 830810 1983 nyu eng u 82019962 (CU)ocm08866667 (CU)GLAD1282 Burch, John G. Information systems : theory and practice / John G. Burch, Jr., Felix R. Strater, Gary Grudnitski 3rd ed New York : J. Wiley, 1983 xvi, 632 p. : ill. ; 24 cm Includes bibliographical references and index Management information systems.... SGML Support Example SGML/MARC RecordExample SGML/MARC Record

March 2, 2004 Ray R. Larson SGML/XML Support TREC document… FT931-3566 _AN-DCPCCAA3FT 930316 FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key to unlocking Tangentopoli - They will set the investigation agenda By ROBERT GRAHAM OVER the weekend the Italian media felt obliged to comment on a non-event. No new arrests had taken place in any of the country's ever more numerous corruption scandals which centre on the illicit funding of political parties... …

March 2, 2004 Ray R. Larson … Companies:- Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale. Countries:- ITZ Italy, EC. Industries:- P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC. Types:- …

March 2, 2004 Ray R. Larson … CMMT Comment & Analysis. GOVT Legal issues. The Financial Times London Page 4

March 2, 2004 Ray R. Larson SGML/XML Support INEX DocumentINEX Document C1050 10.1041/C1050s-2000 COMPUTING IN SCIENCE & ENGINEERING 1521-9615 /00/$10.00 © 2000 IEEE Vol. 2 No. 1 JANUARY/FEBRUARY 2000 pp. 50-59 The Decompositional Approach to Matrix Computation pp. 50-59 G.W. Stewart University of Maryland The introduction of matrix decomposition into numerical linear algebra revolutionized matrix computations. This article outlines the decompositional approach, comments on its history, and surveys the six most widely used decompositions. In 1951, Paul S. Dwyer published Linear Computations, perhaps the first book devoted entirely to numerical linear algebra. 1 Digital computing was in its infancy, and Dwyer focused on computation with mechanical calculators. Nonetheless, the book was state of the art. Figure 1 reproduces a page of the book dealing with Gaussian elimination. In 1954, Alston S. Householder published Principles of Numerical Analysis, 2 one of the first modern treatments of high-speed digital computation. Figure 2 reproduces a page from this book, also dealing with Gaussian elimination. 1 This page from Linear Computations shows that Paul Dwyer's approach begins with a system of scalar equations. Courtesy of John Wiley & Sons. 2 On this page from Principles of Numerical Analysis, Alston Householder uses partitioned matrices and LU decomposition. Courtesy of McGraw-Hill. The contrast between these two excerpts is striking. The most obvious difference is that Dwyer used scalar equations whereas Householder used partitioned matrices. …

March 2, 2004 Ray R. Larson SGML/XML Support … CONCLUSION The big six are not the only decompositions in use; in fact, there are many more. As mentioned earlier, certain intermediate forms—such as tridiagonal and Hessenberg forms—have come to be regarded as decompositions in their own right. Since the singular value decomposition is expensive to compute and not readily updated, rank-revealing alternatives have received considerable attention. 54, 55 There are also generalizations of the singular value decomposition and the Schur decomposition for pairs of matrices. 56, 57 All crystal balls become cloudy when they look to the future, but it seems safe to say that as long as new matrix problems arise, new decompositions will be devised to solve them. Acknowledgment This work was supported by the National Science Foundation under Grant No. 970909-8562. References P.S. Dwyer Linear Computations, John Wiley & Sons, New York, 1951. A.S. Householder Principles of Numerical Analysis, McGraw-Hill, New York, 1953. J.H. Wilkinson and C. Reinsch Handbook for Automatic Computation, Vol. II, Linear Algebra, Springer-Verlag, New York, 1971. B.S. Garbow et al., "Matrix Eigensystem Routines—Eispack Guide Extension," Lecture Notes in Computer Science, Springer-Verlag, New York, 1977. J.J. Dongarra et al., LINPACK User's Guide, SIAM, Philadelphia, 1979. …

March 2, 2004 Ray R. Larson SGML/XML Support INEX CAS QueryINEX CAS Query /article[about(./fm/abs,'"information retrieval" "digital libraries"')] Retrieve articles with an abstract indicating the article is about information retrieval and/or digital libraries To be relevant the retrieved articles must be about information retrieval, digital libraries or, preferably both. Articles about information retrieval from digital libraries will receive the highest relevance judgements. information retrieval,digital libraries

March 2, 2004 Ray R. Larson SGML/XML Support Configuration files for the Server are also SGML/XML:Configuration files for the Server are also SGML/XML: –They include tags describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

March 2, 2004 Ray R. Larson Cheshire Configuration Files /projects/is240/GroupX/indexes /projects/is240/GroupX trec /projects/is240/ft /projects/is240/ft.CONT /projects/is240/TREC.FT.DTD ft.assoc cheshire_index/TESTDATA.history …

March 2, 2004 Ray R. Larson IndexingIndexing Any SGML/XML tagged field or attribute can be indexed:Any SGML/XML tagged field or attribute can be indexed: –B-Tree and Hash access via Berkeley DB (Sleepycat) –Stemming, keyword, exact keys and “special keys” –Mapping from any Z39.50 Attribute combination to a specific index –Underlying postings information includes term frequency for probabilistic searching. –SGML may include address of full-text for indexing New indexes can be easily added, or old ones deletedNew indexes can be easily added, or old ones deleted

March 2, 2004 Ray R. Larson Bitmapped Indexes Bitmap indexes can be used for Boolean operations where the data has only a few values and very large numbers of items with each valueBitmap indexes can be used for Boolean operations where the data has only a few values and very large numbers of items with each value Only one bit per record stored in the indexOnly one bit per record stored in the index Processed on a demand basis so only blocks with the bits needed to resolve a query are fetchedProcessed on a demand basis so only blocks with the bits needed to resolve a query are fetched

March 2, 2004 Ray R. Larson cheshire_index/trec.docno.index docno 12 1 12 2 12 6 DOCNO …

March 2, 2004 Ray R. Larson cheshire_index/trec.topic.index topic 29 3 6 29 102 3 6 … cheshire_index/topicstoplist HEADLINE DATELINE BYLINE TEXT

March 2, 2004 Ray R. Larson Cheshire II – EVI Generation Entry Vocabulary Indexes can improve access to data with controlled index termsEntry Vocabulary Indexes can improve access to data with controlled index terms Define basis for clustering records.Define basis for clustering records. –Select field to form the basis of the cluster. –Evidence Fields to use as contents of the pseudo-documents. During indexing cluster keys are generated with basis and evidence from each record.During indexing cluster keys are generated with basis and evidence from each record. Cluster keys are sorted and merged on basis and pseudo- documents created for each unique basis element containing all evidence fields.Cluster keys are sorted and merged on basis and pseudo- documents created for each unique basis element containing all evidence fields. Pseudo-Documents (Class clusters) are indexed on combined evidence fields.Pseudo-Documents (Class clusters) are indexed on combined evidence fields.

March 2, 2004 Ray R. Larson EVI/Cluster Definitions classcluster FLD950 â /usr3/cheshire2/data2/clasclusstoplist FLD245 ^[ab] FLD440 â FLD490 â FLD830 â FLD740 â titles FLD6.. ^[abcdxyz] subjects 5 subjsum

March 2, 2004 Ray R. Larson Component Extraction and Indexing Any element (or range of SGML/XML data starting with one element and ending with another) can be defined as a ‘component’ and accessed and indexed as if it were an entire document.Any element (or range of SGML/XML data starting with one element and ending with another) can be defined as a ‘component’ and accessed and indexed as if it were an entire document. Component indexes and document-level indexes can be combined in search operations (and special operators permit selection of document or components as the resultComponent indexes and document-level indexes can be combined in search operations (and special operators permit selection of document or components as the result

March 2, 2004 Ray R. Larson Component Definitions TESTDATA/COMPONENT_DB1 NONE mainenty titles Fld300 TESTDATA/comp1index1.author …

March 2, 2004 Ray R. Larson Result Formatting (Display) KEEP_ENTITIES DOCNO 28 #DOCID# 5

March 2, 2004 Ray R. Larson INEX Configuration Example /projects/metadata/cheshire/TREC/cheshire_index /projects/metadata/cheshire/INEX INEX inex-1.3/xml inex-1.3/xml_main.cont inex-1.3/dtd/wrapper.dtd inex-1.3/dtd/catalog inex-1.3/xml_main.assoc inex.history

March 2, 2004 Ray R. Larson INEX Configuration Example <INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=DO_NOT_NORMALIZE PRIMARYKEY=IGNORE> indexes/docno.index docno 12 1 12 2 12 6 doi

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/pauthor.index pauthor 1 3 6 1004 3 6 indexes/authorstoplist fm au snm fm au fnm

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/title.index title 4 3 6 5 3 6 6 3 6 indexes/titlestoplist fm tig atl

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/topic.index topic 29 3 6 … 1017 102 3 6 indexes/topicstoplist fm tig atl abs bdy bibl bb atl app

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/date.index date 30 3 6 30 3 5 hdr2 yr

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/journal.index journal 1022 3 6 1022 3 5 hdr1 ti

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/keywords.index kwd 3121 3 6 indexes/topicstoplist kwd

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/abstract.index abstract 62 3 6 indexes/topicstoplist abs

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/author_seq.index author_seq fm au sequence

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/bib_author_fnm.index bib_author_fnm 1000 3 6 bb au fnm

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/bib_author_snm.index bib_author_snm 1000 3 6 bb au snm

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/fig.index fig 3150 3 6 indexes/topicstoplist fig

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/ack.index ack 3188 3 6 indexes/topicstoplist ack

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/alltitles.index alltitles 3188 3 6 indexes/titlestoplist atl st

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/affil.index affil 3189 3 6 indexes/titlestoplist fm aff

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/fno.index fno 3192 3 6 fno

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/figno.index figno 3193 3 6 fig no

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/topicshort.index topicshort 3192 3 6 fm tig atl abs kwd st

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_SECTION NONE sec indexes/sec_title2.index sec_title 38 3 6 indexes/titlestoplist sec st

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/sec_words.index sec_words 39 3 6 indexes/topicstoplist sec

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_BIB NONE bm bib bibl bb indexes/bib_author.index bib_author 1000 3 6 au

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/bib_title.index bib_title 33 3 6 atl

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/bib_date.index bib_date 31 3 6 pdt yr

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_PARAS NONE îlrj$|îp1$|îp2$|îp3$|îp4$|îp5$|îtem-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$ indexes/para_words.index para_words 39 3 6 indexes/topicstoplist.*

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_FIG NONE fig indexes/fig_caption.index fig_caption 38 3 6 indexes/titlestoplist fgc

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_VITAE NONE vt indexes/vitae_words.index vt_vitae 38 3 6 indexes/titlestoplist vt

March 2, 2004 Ray R. Larson INEX Configuration Example KEEP_ENTITIES doi 28 #DOCID# 5 #DBNAME# …

March 2, 2004 Ray R. Larson INEX Configuration Example #FILENAME# FILENAME #RANK# RANK … #RAWSCORE# RAWSCORE SUBST_ELEMENT SUBST_ELEMENT

March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_PARAS NONE îlrj$|îp1$|îp2$|îp3$|îp4$|îp5$|îtem-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$ indexes/para_words.index para_words 39 3 6 indexes/topicstoplist.*

March 2, 2004 Ray R. Larson XML Schemas and Element Retrieval

March 2, 2004 Ray R. Larson XML Schema Support XML Schemas or DTD’s can be used to define the data contentsXML Schemas or DTD’s can be used to define the data contents Tested with a wide variety of schemas including METS (with various supporting schemas)Tested with a wide variety of schemas including METS (with various supporting schemas)

March 2, 2004 Ray R. Larson XML Element Extraction A new search “ElementSetName” is XML_ELEMENT_A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request (Note only a subset of full Xpath is available)Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request (Note only a subset of full Xpath is available) The matching elements are extracted from the records matching the search and delivered in a simple format..The matching elements are extracted from the records matching the search and delivered in a simple format..

March 2, 2004 Ray R. Larson XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} { Singularitâes áa Cargáese … etc…

March 2, 2004 Ray R. Larson Database Storage All data stored as SGML/XML flat text files plus optional linked full-text (non-XML) filesAll data stored as SGML/XML flat text files plus optional linked full-text (non-XML) files File format is defined though SGML/XML DTD (also flat text file) or SchemaFile format is defined though SGML/XML DTD (also flat text file) or Schema “Associator” files provide indexed direct access to each record in SGML/XML files.“Associator” files provide indexed direct access to each record in SGML/XML files. –Contain offset and record length for each “record” –Associators can be built to index any conformant document in a directory sub-tree

March 2, 2004 Ray R. Larson INEX CO Runs Three official, one later run - all Title-onlyThree official, one later run - all Title-only –Fusion - Combines Okapi and LR using the MERGE_CMBZ operator –NewParms (LR)- Using only LR with the new parameters –Feedback - An attempt at blind relevance feedback –PostFusion - Fusion of the new LR coefficients and Okapi

March 2, 2004 Ray R. Larson Query Generation - CO # 162 TITLE = Text and Index Compression Algorithms# 162 TITLE = Text and Index Compression Algorithms QUERY: topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms})QUERY: topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms}) @+ is Okapi, @ is LR@+ is Okapi, @ is LR !MERGE_CMBZ is a normalized score summation and enhancement!MERGE_CMBZ is a normalized score summation and enhancement

March 2, 2004 Ray R. Larson INEX CO Runs Generalized Strict Avg Prec FUSION = 0.0642 NEWPARMS = 0.0582 FDBK = 0.0415 POSTFUS = 0.0690 Avg Prec FUSION = 0.0923 NEWPARMS = 0.0853 FDBK = 0.0390 POSTFUS = 0.0952

March 2, 2004 Ray R. Larson INEX VCAS Runs Two official runsTwo official runs –FUSVCAS - Element fusion using LR and various operators for path restriction –NEWVCAS - Using the new LR coefficients for each appropriate index and various operators for path restriction

March 2, 2004 Ray R. Larson Query Generation - VCAS #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)]#66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)] Submitted query = ((topic @ {intelligent transport systems})) !RESTRICT_FROM ((sec_words @ {on- board route planning navigation system for automobiles}))Submitted query = ((topic @ {intelligent transport systems})) !RESTRICT_FROM ((sec_words @ {on- board route planning navigation system for automobiles})) Target elements: sec|ss1|ss2|ss3Target elements: sec|ss1|ss2|ss3

March 2, 2004 Ray R. Larson VCAS Results GeneralizedStrict Avg Prec FUSVCAS = 0.0321 NEWVCAS = 0.0270 Avg Prec FUSVCAS = 0.0601 NEWVCAS = 0.0569

March 2, 2004 Ray R. Larson Heterogeneous Track Approach using the Cheshire’s Virtual Database optionsApproach using the Cheshire’s Virtual Database options –Primarily a version of distributed IR –Each collection indexed separately –Search via Z39.50 distributed queries –Z39.50 Attribute mapping used to map query indexes to appropriate elements in a given collection –Only LR used and collection results merged using probability of relevance for each collection result

March 2, 2004 Ray R. Larson Heterogeneous Track Issues Very large “Documents”Very large “Documents” –Our approach was to segment Reporting Xpath after segmenting large documentsReporting Xpath after segmenting large documents

March 2, 2004 Ray R. Larson Database Storage Associator File Page Data File SGML /XML File History File DTD File Cluster File Postings File Index File Index File Remote RDBMS Config File Index File Associator File Prox data File

March 2, 2004 Ray R. Larson Client/Server Architecture Server Supports:Server Supports: –Database storage –Indexing –Z39.50 access to local data –Boolean and Probabilistic Searching –Relevance Feedback –External SQL database support Client Supports:Client Supports: –Programmable (Tcl/Tk – Python soon) Graphical User Interface –Z39.50 access to remote servers –SGML & MARC formatting Combined Client/Server CGI scripting via WebCheshireCombined Client/Server CGI scripting via WebCheshire

March 2, 2004 Ray R. Larson Z39.50 Overview UI Map Query Internet Map Results Map Query Map Results Map Query Map Results Search Engine

March 2, 2004 Ray R. Larson Two Protocols: HTTP & Z39.50

March 2, 2004 Ray R. Larson Server Z39.50 Support Locally developed Z39.50 LibraryLocally developed Z39.50 Library Extended version 3 supportExtended version 3 support –support version 3 attributes in BIB-1 including “stem”, “relevance”, etc. Also adding support for “type 102” ranked queries (version 4) Can provide both MARC, SUTRS and SGML records, support for Explain and GRS-1 conversion of any SGML recordsCan provide both MARC, SUTRS and SGML records, support for Explain and GRS-1 conversion of any SGML records

March 2, 2004 Ray R. Larson Distributed Search

March 2, 2004 Ray R. Larson The Problem The Digital Library vision -- Access to everyone for “all human knowledge”The Digital Library vision -- Access to everyone for “all human knowledge” Lyman and Varian’s estimates of the “Dark Web”Lyman and Varian’s estimates of the “Dark Web” Hundreds or Thousands of servers with databases ranging widely in content, topic, formatHundreds or Thousands of servers with databases ranging widely in content, topic, format –Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results –How to select the “best” ones to search? Which resource to search first?Which resource to search first? Which to search next if more is wanted?Which to search next if more is wanted? –Topical /domain constraints on the search selections –Variable contents of database (metadata only, full text, multimedia…)

March 2, 2004 Ray R. Larson Distributed Search Tasks Resource DescriptionResource Description –How to collect metadata about digital libraries and their collections or databases Resource SelectionResource Selection –How to select relevant digital library collections or databases from a large number of databases Distributed SearchDistributed Search –How to perform parallel or sequential searching over the selected digital library databases Data FusionData Fusion –How to merge query results from different digital libraries with their different search engines, differing record structures, etc.

March 2, 2004 Ray R. Larson An Approach for Distributed Resource Discovery Distributed resource representation and discoveryDistributed resource representation and discovery –New approach to building resource descriptions based on Z39.50 –Instead of using broadcast search across resources we are using two Z39.50 Services Identification of database metadata using Z39.50 ExplainIdentification of database metadata using Z39.50 Explain Extraction of distributed indexes using Z39.50 SCANExtraction of distributed indexes using Z39.50 SCAN EvaluationEvaluation –How efficiently can we build distributed indexes? –How effectively can we choose databases using the index? –How effective is merging search results from multiple sources? –Can we build hierarchies of servers (general/meta- topical/individual)?

March 2, 2004 Ray R. Larson Z39.50 Explain Explain supports searches forExplain supports searches for –Server-Level metadata Server NameServer Name IP AddressesIP Addresses PortsPorts –Database-Level metadata Database nameDatabase name Search attributes (indexes and combinations)Search attributes (indexes and combinations) –Support metadata (record syntaxes, etc)

March 2, 2004 Ray R. Larson Z39.50 SCAN Originally intended to support BrowsingOriginally intended to support Browsing Query forQuery for –Database –Attributes plus Term (i.e., index and start point) –Step Size –Number of terms to retrieve –Position in Response set ResultsResults –Number of terms returned –List of Terms and their frequency in the database (for the given attribute combination)

March 2, 2004 Ray R. Larson Z39.50 SCAN Results % zscan title cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … Syntax: zscan indexname1 term stepsize number_of_terms pref_pos

March 2, 2004 Ray R. Larson Resource Index Creation For all servers, or a topical subset…For all servers, or a topical subset… –Get Explain information –For each index Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency Add term + freq + source index + database metadata to the XML “Collection Document” for the resourceAdd term + freq + source index + database metadata to the XML “Collection Document” for the resource –Planned extensions: Post-Process indexes (especially Geo Names, etc) for special types of dataPost-Process indexes (especially Geo Names, etc) for special types of data –e.g. create “geographical coverage” indexes

March 2, 2004 Ray R. Larson MetaSearch Approach MetaSearch Server Map Explain And Scan Queries Internet Map Results Map Query Map Results Search Engine DB2DB 1 Map Query Map Results Search Engine DB 4DB 3 Distributed Index Search Engine Db 6 Db 5

March 2, 2004 Ray R. Larson Known Issues and Problems Not all Z39.50 Servers support SCAN or ExplainNot all Z39.50 Servers support SCAN or Explain Solutions that appear to work well:Solutions that appear to work well: –Probing for attributes instead of explain (e.g. DC attributes or analogs) –We also support OAI and can extract OAI metadata for servers that support OAI –Query-based sampling (Callan) Collection Documents are static and need to be replaced when the associated collection changesCollection Documents are static and need to be replaced when the associated collection changes

March 2, 2004 Ray R. Larson EvaluationEvaluation Test EnvironmentTest Environment –TREC Tipster data (approx. 3 GB) –Partitioned into 236 smaller collections based on source and date by month (no DOE) High size variability (from 1 to thousands of records)High size variability (from 1 to thousands of records) Same database as used in other distributed search studies by J. French and J. Callan among othersSame database as used in other distributed search studies by J. French and J. Callan among others –Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 TIPSTER disks

March 2, 2004 Ray R. Larson Harvesting Efficiency Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb)Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb) Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the networkAverage of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the network Average of 14.07 secondsAverage of 14.07 seconds Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds.Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds.

March 2, 2004 Ray R. Larson Our Collection Ranking Approach We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval timeWe attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results)Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results)

March 2, 2004 Ray R. Larson Probabilistic Retrieval: Logistic Regression Probability of relevance for a given index is based on logistic regression from a sample set documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by:

March 2, 2004 Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection size estimate Average Inverse Collection Frequency Inverse Document Frequency (N = Number of collections M = Number of Terms in common between query and document

March 2, 2004 Ray R. Larson EvaluationEvaluation EffectivenessEffectiveness –Tested using the collection representatives described above (as harvested from over the network) and the TIPSTER relevance judgements –Testing by comparing our approach to known algorithms for ranking collections –Results were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) –Recall analog (How many of the Rel docs occurred in the top n databases – averaged)

March 2, 2004 Ray R. Larson Titles only (short query)

March 2, 2004 Ray R. Larson FutureFuture Logically Clustering servers by topicLogically Clustering servers by topic Meta-Meta Servers (treating the MetaSearch database as just another database)Meta-Meta Servers (treating the MetaSearch database as just another database)

March 2, 2004 Ray R. Larson Distributed Metadata Servers Replicated servers Meta-Topical Servers General Servers Database Servers

March 2, 2004 Ray R. Larson Geographic Operators and Search Ranking

March 2, 2004 Ray R. Larson The GEO Operations Operators established for the GEO Z39.50 profileOperators established for the GEO Z39.50 profile Implemented using special operations on indexesImplemented using special operations on indexes Indexing allows extraction of geographic coordinates and dates from SGML/XML data in a variety of formatsIndexing allows extraction of geographic coordinates and dates from SGML/XML data in a variety of formats Normalized internal representation in indexesNormalized internal representation in indexes Search using geographic and time elements as primary or limiting search elementsSearch using geographic and time elements as primary or limiting search elements

March 2, 2004 Ray R. Larson The GEO Operations X-based interfaces permit (simple) map drawing and searchX-based interfaces permit (simple) map drawing and search Interface to MapServer for web-based map searchingInterface to MapServer for web-based map searching

March 2, 2004 Ray R. Larson GEO Geographic operators >=<Overlap Search region and data Overlap >#< Fully Enclosed Data fully enclosed in search reg. <#>Encloses Data fully encloses search region <># Fully Outside Data outside of search region ++Near Data is near search region :<:Before Data date is before search date :<=: Before or During Data date is before or during search date :>=: During or After Data date is during or after search date :>:After Data date is after search date

March 2, 2004 Ray R. Larson Overlaps search

March 2, 2004 Ray R. Larson Fully Enclosed Search

March 2, 2004 Ray R. Larson Map-Based Search

March 2, 2004 Ray R. Larson GeoSearch Web Interface

March 2, 2004 Ray R. Larson MySQL and PostgreSQL

March 2, 2004 Ray R. Larson RDBMS Support There are two reasons for RDBMS supportThere are two reasons for RDBMS support –IR systems are not meant for LOTS of update transactions –Some application need to have access to both relational data and text data via Z39.50 Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used via CheshireBoth MySQL and PostgreSQL are popular open source RDBMS and now either can now be used via Cheshire –Z39.50 mappings to RDBMS columns –“ZQL” submission of SQL as Z39.50 Type 0 query

March 2, 2004 Ray R. Larson Protocol Support

March 2, 2004 Ray R. Larson ProtocolsProtocols In Cheshire II most protocols (except Z39.50) are implemented using scriptingIn Cheshire II most protocols (except Z39.50) are implemented using scripting Example scripts to support the following are included in the distributionExample scripts to support the following are included in the distribution –OAI –SRW (Python version) –SOAP –SDLIP

March 2, 2004 Ray R. Larson Cheshire III Design and Development

March 2, 2004 Ray R. Larson Cheshire III Goals Retain or reproduce (and refine) all Cheshire II featuresRetain or reproduce (and refine) all Cheshire II features –“Spring cleaning” of code base –Add Full Unicode Support –Store most system and content data in the database Permit easy and efficient integration in Web ServicesPermit easy and efficient integration in Web Services Use threaded server for economy of resource usageUse threaded server for economy of resource usage Enhanced Multiprotocol supportEnhanced Multiprotocol support Support for distributed processing (I.e. GRID clusters)Support for distributed processing (I.e. GRID clusters) Enhance expandability and “drop in’ functionalityEnhance expandability and “drop in’ functionality Interfaces and/or APIs for Java, Python, C/C++Interfaces and/or APIs for Java, Python, C/C++

March 2, 2004 Ray R. Larson Cheshire II Design Overview XML DOCS XML DIRECTORY INDEX CLUSTER INDEX CHESHIRE CONT BUILD ASSOC Z SERVER CONFIG COMPONENT DEFINITION INDEX(S) ASSOC CLUSTER EXTENSION

March 2, 2004 Ray R. Larson Cheshire III Server Overview API INDEXINGINDEXING T R R X E A S C N L O S T R F D O R M S SEARCHSEARCH P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI CONFIG NETWORKNETWORK RESULT SETS SCANSCAN USER INFO CONFIG&CONTROLCONFIG&CONTROL ACCESS INFO AUTHENTICATIONAUTHENTICATION CLUSTERINGCLUSTERING Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL APACHEINTERFACEAPACHEINTERFACE SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER

March 2, 2004 Ray R. Larson API INDEXINGINDEXING T R R X E A S C N L O S T R F D O R M S SEARCHSEARCH P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI CONFIG NETWORKNETWORK RESULT SETS SCANSCAN USER INFO CONFIG&CONTROLCONFIG&CONTROL ACCESS INFO AUTHENTICATIONAUTHENTICATION CLUSTERINGCLUSTERING Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL APACHEINTERFACEAPACHEINTERFACE SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER

March 2, 2004 Ray R. Larson Retain Features The intent is to permit all of the types of in indexing, searching and record formatting available now, while making it easier to add new capabilitiesThe intent is to permit all of the types of in indexing, searching and record formatting available now, while making it easier to add new capabilities The new system will also support full UNICODE for content and for metadataThe new system will also support full UNICODE for content and for metadata Store metadata and content in the database (including config information, etc.)Store metadata and content in the database (including config information, etc.)

March 2, 2004 Ray R. Larson Permit easy integration of Web Services The assumption is that the web server will be the central server mechanism in the future.The assumption is that the web server will be the central server mechanism in the future. The new design relies on the session handling, threading and load management tools available in Apache (2.0.40+)The new design relies on the session handling, threading and load management tools available in Apache (2.0.40+) The Cheshire server is dynamically loaded as part of the Web ServerThe Cheshire server is dynamically loaded as part of the Web Server

March 2, 2004 Ray R. Larson Multiprotocol Support The Web server handles the network issues and passes requests in various protocols along to the Cheshire Server.The Web server handles the network issues and passes requests in various protocols along to the Cheshire Server. Individual Protocol “plugins” and the Protocol Handler convert search, display, and metadata requests in a particular protocol to the internal Cheshire III control language, and convert outgoing message and data to the appropriate protocol formIndividual Protocol “plugins” and the Protocol Handler convert search, display, and metadata requests in a particular protocol to the internal Cheshire III control language, and convert outgoing message and data to the appropriate protocol form

March 2, 2004 Ray R. Larson Distributed & GRID Processing The server will support protocols for interchange of partial results and collection statistics with a single “Master” controlling the actions of a large number of “Slave” serversThe server will support protocols for interchange of partial results and collection statistics with a single “Master” controlling the actions of a large number of “Slave” servers These will run in parallel in a GRID environmentThese will run in parallel in a GRID environment This is still “research” but will probably be using “Storage Grid” technology from SDSC with our own applicationsThis is still “research” but will probably be using “Storage Grid” technology from SDSC with our own applications Non-Grid use of the same protocols, etc will be possible (but definitely slower)Non-Grid use of the same protocols, etc will be possible (but definitely slower)

March 2, 2004 Ray R. Larson Enhanced Expanability Clearly defined APIs for interacting with the server will permit easy addition of new functionality, or to replace or upgrade existing functionalityClearly defined APIs for interacting with the server will permit easy addition of new functionality, or to replace or upgrade existing functionality Interactive user interface for database configuration and setupInteractive user interface for database configuration and setup –We want to make it easier for a user/administrator to create and manage the database

March 2, 2004 Ray R. Larson Multilingual APIs The system is being developed in a multilingual environment.The system is being developed in a multilingual environment. We will include the ability to interface with (at a minimum) Java, Python and C/C++ applications.We will include the ability to interface with (at a minimum) Java, Python and C/C++ applications. APIs for developing new functions will be available in these languages as wellAPIs for developing new functions will be available in these languages as well

March 2, 2004 Ray R. Larson DevelopmentDevelopment Currently work is going on here (RRL) and (primarily) in the UK Currently work is going on here (RRL) and (primarily) in the UK We have incomplete (Alpha) versions of the system, but haven’t been distributing it in the current form (changing constantly)We have incomplete (Alpha) versions of the system, but haven’t been distributing it in the current form (changing constantly) First release version is expected in mid-’04First release version is expected in mid-’04

March 2, 2004 Ray R. Larson Further Information Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ –Includes HTML documentation Project Web Site http://cheshire.berkeley.edu/Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/ Archives Hub http://www.archiveshub.ac.uk/Archives Hub http://www.archiveshub.ac.uk/

March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University.

Similar presentations

Presentation on theme: "March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University.

Similar presentations

Presentation on theme: "March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University."— Presentation transcript:

Similar presentations

About project

Feedback