Presentation is loading. Please wait.

Presentation is loading. Please wait.

March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University.

Similar presentations


Presentation on theme: "March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University."— Presentation transcript:

1 March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University of California, Berkeley

2 March 2, 2004 Ray R. Larson OverviewOverview Cheshire II feature overviewCheshire II feature overview –Logistic Regression Ranking, Okapi BM-25 and Boolean Operations –Fusion Operators Additions from INEX ‘03Additions from INEX ‘03 –Element/Index level re-estimation of LR coefficients Adhoc and Heterogeneous Track MethodologyAdhoc and Heterogeneous Track Methodology Evaluation Results -AdhocEvaluation Results -Adhoc

3 March 2, 2004 Ray R. Larson Overview of Cheshire II It supports SGML and XML with components and component indexesIt supports SGML and XML with components and component indexes It is a client/server applicationIt is a client/server application Uses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implementedUses the Z39.50 Information Retrieval Protocol, support for SRW, OAI, SOAP, SDLIP also implemented Server supports a Relational Database GatewayServer supports a Relational Database Gateway Supports Boolean searching of all serversSupports Boolean searching of all servers Supports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity searchSupports probabilistic ranked retrieval in the Cheshire search engine as well as Boolean and proximity search Search engine supports ``nearest neighbor'' searches and relevance feedbackSearch engine supports ``nearest neighbor'' searches and relevance feedback GUI interface on X window displays and Windows NTGUI interface on X window displays and Windows NT WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshireWWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire Scriptable clients using Tcl and PythonScriptable clients using Tcl and Python Store SGML/XML as files or “Datastore” databaseStore SGML/XML as files or “Datastore” database

4 March 2, 2004 Ray R. Larson Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

5 March 2, 2004 Ray R. Larson INEX Overview Local Net UI Or Scripts Map Query Map Results Map Query Map Results INEX Search Engine

6 March 2, 2004 Ray R. Larson Boolean Search Capability All Boolean operations are supportedAll Boolean operations are supported –“zfind author x and (title y or subject z) not subject A” Named sets are supported and stored on the serverNamed sets are supported and stored on the server Boolean operations between stored sets are supportedBoolean operations between stored sets are supported –“zfind SET1 and subject widgets or SET2” Nested parentheses and truncation are supportedNested parentheses and truncation are supported –“zfind xtitle Alice#”

7 March 2, 2004 Ray R. Larson Probabilistic Retrieval Uses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval timeUses Logistic Regression ranking method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weigh calculation at retrieval time Z39.50 “relevance” operator used to indicate probabilistic searchZ39.50 “relevance” operator used to indicate probabilistic search Any index can have Probabilistic searching performed:Any index can have Probabilistic searching performed: –zfind “cheshire cats, looking glasses, march hares and other such things” –zfind caucus races Boolean and Probabilistic elements can be combined:Boolean and Probabilistic elements can be combined: –zfind government documents and title guidebooks

8 March 2, 2004 Ray R. Larson Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide Probabilistic Retrieval: Logistic Regression

9 March 2, 2004 Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Inverse Component Frequency Number of Terms in common between query and Component -- logged

10 March 2, 2004 Ray R. Larson Combining Boolean and Probabilistic Search Elements Two original approaches:Two original approaches: –Boolean Approach –Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries

11 March 2, 2004 Ray R. Larson Okapi BM25 Where:Where: Q is a query containing terms TQ is a query containing terms T K is k 1 ((1-b) + b.dl/avdl)K is k 1 ((1-b) + b.dl/avdl) k 1, b and k 3 are parameters, usually 1.2, 0.75 and k 1, b and k 3 are parameters, usually 1.2, 0.75 and tf is the frequency of the term in a specific documenttf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derivedqtf is the frequency of the term in a topic from which Q was derived dl and avdl are the document length and the average document length measured in some convenient unitdl and avdl are the document length and the average document length measured in some convenient unit w (1) is the Robertson-Sparck Jones weight.w (1) is the Robertson-Sparck Jones weight.

12 March 2, 2004 Ray R. Larson Merging and Ranking Operators Extends the capabilities of merging to include merger operations in queries like Boolean operatorsExtends the capabilities of merging to include merger operations in queries like Boolean operators Fuzzy Logic Operators (not used for INEX)Fuzzy Logic Operators (not used for INEX) –!FUZZY_AND –!FUZZY_OR –!FUZZY_NOT Containment operators: Restrict components to or with a particular parentContainment operators: Restrict components to or with a particular parent –!RESTRICT_FROM –!RESTRICT_TO Merge OperatorsMerge Operators –!MERGE_SUM –!MERGE_MEAN –!MERGE_NORM –!MERGE_CMBZ

13 March 2, 2004 Ray R. Larson Subquery INEX ‘04 Fusion Search Merge multiple ranked and Boolean index searches within each query and multiple component search resultsetsMerge multiple ranked and Boolean index searches within each query and multiple component search resultsets –Major components merged are Articles, Body, Sections, subsections, paragraphs Subquery Comp. Query Results Comp. Query Results Fusion/ Merge Final Ranked List

14 March 2, 2004 Ray R. Larson New LR Coefficients Indexb0b1b2b3b4b5b6 Base topic topicshort abstract alltitles sec words para words Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component and Component

15 March 2, 2004 Ray R. Larson SGML/XML Support Underlying native format for all data is SGML or XMLUnderlying native format for all data is SGML or XML The DTD defines the file format for each fileThe DTD defines the file format for each file Full SGML/XML parsingFull SGML/XML parsing SGML/XML Format Configuration Files define the databaseSGML/XML Format Configuration Files define the database USMARC DTD and MARC to SGML conversion (and back again)USMARC DTD and MARC to SGML conversion (and back again) Access to full-text via special SGML/XML tagsAccess to full-text via special SGML/XML tags

16 March 2, 2004 Ray R. Larson IndexingIndexing Any SGML/XML tagged field or attribute can be indexed:Any SGML/XML tagged field or attribute can be indexed: –B-Tree and Hash access via Berkeley DB (Sleepycat) –Stemming, keyword, exact keys and “special keys” –Mapping from any Z39.50 Attribute combination to a specific index –Underlying postings information includes term frequency for probabilistic searching Component extraction with separate component indexesComponent extraction with separate component indexes

17 March 2, 2004 Ray R. Larson XML Element Extraction A new search “ElementSetName” is XML_ELEMENT_A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present requestAny Xpath, element name, or regular expression can be included following the final underscore when submitting a present request The matching elements are extracted from the records matching the search and delivered in a simple format..The matching elements are extracted from the records matching the search and delivered in a simple format..

18 March 2, 2004 Ray R. Larson XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML }} { Singularitâes áa Cargáese … etc…

19 March 2, 2004 Ray R. Larson SGML/XML Support Configuration files for the Server are SGML/XML:Configuration files for the Server are SGML/XML: –They include elements describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

20 March 2, 2004 Ray R. Larson SGML/XML Support Example XML record for a DL documentExample XML record for a DL document ELIB-v June 12, 1996 June 1996 Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada University of California report USDA Forest Service Neil H. Berg Ken B. Roby Bruce J. McGurk SNEP Vol 3 40 /elib/data/docs/0700/756/HYPEROCR/hyperocr.html /elib/data/docs/0700/756/OCR-ASCII-NOZONE

21 March 2, 2004 Ray R. Larson n a m CUBGGLAD1282B nyu eng u (CU)ocm (CU)GLAD1282 Burch, John G. Information systems : theory and practice / John G. Burch, Jr., Felix R. Strater, Gary Grudnitski 3rd ed New York : J. Wiley, 1983 xvi, 632 p. : ill. ; 24 cm Includes bibliographical references and index Management information systems.... SGML Support Example SGML/MARC RecordExample SGML/MARC Record

22 March 2, 2004 Ray R. Larson SGML/XML Support TREC document… FT _AN-DCPCCAA3FT FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key to unlocking Tangentopoli - They will set the investigation agenda By ROBERT GRAHAM OVER the weekend the Italian media felt obliged to comment on a non-event. No new arrests had taken place in any of the country's ever more numerous corruption scandals which centre on the illicit funding of political parties... …

23 March 2, 2004 Ray R. Larson … Companies:- Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale. Countries:- ITZ Italy, EC. Industries:- P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC. Types:- …

24 March 2, 2004 Ray R. Larson … CMMT Comment & Analysis. GOVT Legal issues. The Financial Times London Page 4

25 March 2, 2004 Ray R. Larson SGML/XML Support INEX DocumentINEX Document C /C1050s-2000 COMPUTING IN SCIENCE & ENGINEERING /00/$10.00 © 2000 IEEE Vol. 2 No. 1 JANUARY/FEBRUARY 2000 pp The Decompositional Approach to Matrix Computation pp G.W. Stewart University of Maryland The introduction of matrix decomposition into numerical linear algebra revolutionized matrix computations. This article outlines the decompositional approach, comments on its history, and surveys the six most widely used decompositions. In 1951, Paul S. Dwyer published Linear Computations, perhaps the first book devoted entirely to numerical linear algebra. 1 Digital computing was in its infancy, and Dwyer focused on computation with mechanical calculators. Nonetheless, the book was state of the art. Figure 1 reproduces a page of the book dealing with Gaussian elimination. In 1954, Alston S. Householder published Principles of Numerical Analysis, 2 one of the first modern treatments of high-speed digital computation. Figure 2 reproduces a page from this book, also dealing with Gaussian elimination. 1 This page from Linear Computations shows that Paul Dwyer's approach begins with a system of scalar equations. Courtesy of John Wiley & Sons. 2 On this page from Principles of Numerical Analysis, Alston Householder uses partitioned matrices and LU decomposition. Courtesy of McGraw-Hill. The contrast between these two excerpts is striking. The most obvious difference is that Dwyer used scalar equations whereas Householder used partitioned matrices. …

26 March 2, 2004 Ray R. Larson SGML/XML Support … CONCLUSION The big six are not the only decompositions in use; in fact, there are many more. As mentioned earlier, certain intermediate forms—such as tridiagonal and Hessenberg forms—have come to be regarded as decompositions in their own right. Since the singular value decomposition is expensive to compute and not readily updated, rank-revealing alternatives have received considerable attention. 54, 55 There are also generalizations of the singular value decomposition and the Schur decomposition for pairs of matrices. 56, 57 All crystal balls become cloudy when they look to the future, but it seems safe to say that as long as new matrix problems arise, new decompositions will be devised to solve them. Acknowledgment This work was supported by the National Science Foundation under Grant No References P.S. Dwyer Linear Computations, John Wiley & Sons, New York, A.S. Householder Principles of Numerical Analysis, McGraw-Hill, New York, J.H. Wilkinson and C. Reinsch Handbook for Automatic Computation, Vol. II, Linear Algebra, Springer-Verlag, New York, B.S. Garbow et al., "Matrix Eigensystem Routines—Eispack Guide Extension," Lecture Notes in Computer Science, Springer-Verlag, New York, J.J. Dongarra et al., LINPACK User's Guide, SIAM, Philadelphia, …

27 March 2, 2004 Ray R. Larson SGML/XML Support INEX CAS QueryINEX CAS Query /article[about(./fm/abs,'"information retrieval" "digital libraries"')] Retrieve articles with an abstract indicating the article is about information retrieval and/or digital libraries To be relevant the retrieved articles must be about information retrieval, digital libraries or, preferably both. Articles about information retrieval from digital libraries will receive the highest relevance judgements. information retrieval,digital libraries

28 March 2, 2004 Ray R. Larson SGML/XML Support Configuration files for the Server are also SGML/XML:Configuration files for the Server are also SGML/XML: –They include tags describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

29 March 2, 2004 Ray R. Larson Cheshire Configuration Files /projects/is240/GroupX/indexes /projects/is240/GroupX trec /projects/is240/ft /projects/is240/ft.CONT /projects/is240/TREC.FT.DTD ft.assoc cheshire_index/TESTDATA.history …

30 March 2, 2004 Ray R. Larson IndexingIndexing Any SGML/XML tagged field or attribute can be indexed:Any SGML/XML tagged field or attribute can be indexed: –B-Tree and Hash access via Berkeley DB (Sleepycat) –Stemming, keyword, exact keys and “special keys” –Mapping from any Z39.50 Attribute combination to a specific index –Underlying postings information includes term frequency for probabilistic searching. –SGML may include address of full-text for indexing New indexes can be easily added, or old ones deletedNew indexes can be easily added, or old ones deleted

31 March 2, 2004 Ray R. Larson Bitmapped Indexes Bitmap indexes can be used for Boolean operations where the data has only a few values and very large numbers of items with each valueBitmap indexes can be used for Boolean operations where the data has only a few values and very large numbers of items with each value Only one bit per record stored in the indexOnly one bit per record stored in the index Processed on a demand basis so only blocks with the bits needed to resolve a query are fetchedProcessed on a demand basis so only blocks with the bits needed to resolve a query are fetched

32 March 2, 2004 Ray R. Larson cheshire_index/trec.docno.index docno DOCNO …

33 March 2, 2004 Ray R. Larson cheshire_index/trec.topic.index topic … cheshire_index/topicstoplist HEADLINE DATELINE BYLINE TEXT

34 March 2, 2004 Ray R. Larson Cheshire II – EVI Generation Entry Vocabulary Indexes can improve access to data with controlled index termsEntry Vocabulary Indexes can improve access to data with controlled index terms Define basis for clustering records.Define basis for clustering records. –Select field to form the basis of the cluster. –Evidence Fields to use as contents of the pseudo-documents. During indexing cluster keys are generated with basis and evidence from each record.During indexing cluster keys are generated with basis and evidence from each record. Cluster keys are sorted and merged on basis and pseudo- documents created for each unique basis element containing all evidence fields.Cluster keys are sorted and merged on basis and pseudo- documents created for each unique basis element containing all evidence fields. Pseudo-Documents (Class clusters) are indexed on combined evidence fields.Pseudo-Documents (Class clusters) are indexed on combined evidence fields.

35 March 2, 2004 Ray R. Larson EVI/Cluster Definitions classcluster FLD950 ^a /usr3/cheshire2/data2/clasclusstoplist FLD245 ^[ab] FLD440 ^a FLD490 ^a FLD830 ^a FLD740 ^a titles FLD6.. ^[abcdxyz] subjects 5 subjsum

36 March 2, 2004 Ray R. Larson Component Extraction and Indexing Any element (or range of SGML/XML data starting with one element and ending with another) can be defined as a ‘component’ and accessed and indexed as if it were an entire document.Any element (or range of SGML/XML data starting with one element and ending with another) can be defined as a ‘component’ and accessed and indexed as if it were an entire document. Component indexes and document-level indexes can be combined in search operations (and special operators permit selection of document or components as the resultComponent indexes and document-level indexes can be combined in search operations (and special operators permit selection of document or components as the result

37 March 2, 2004 Ray R. Larson Component Definitions TESTDATA/COMPONENT_DB1 NONE mainenty titles Fld300 TESTDATA/comp1index1.author …

38 March 2, 2004 Ray R. Larson Result Formatting (Display) KEEP_ENTITIES DOCNO 28 #DOCID# 5

39 March 2, 2004 Ray R. Larson INEX Configuration Example /projects/metadata/cheshire/TREC/cheshire_index /projects/metadata/cheshire/INEX INEX inex-1.3/xml inex-1.3/xml_main.cont inex-1.3/dtd/wrapper.dtd inex-1.3/dtd/catalog inex-1.3/xml_main.assoc inex.history

40 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/docno.index docno doi

41 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/pauthor.index pauthor indexes/authorstoplist fm au snm fm au fnm

42 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/title.index title indexes/titlestoplist fm tig atl

43 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/topic.index topic … indexes/topicstoplist fm tig atl abs bdy bibl bb atl app

44 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/date.index date hdr2 yr

45 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/journal.index journal hdr1 ti

46 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/keywords.index kwd indexes/topicstoplist kwd

47 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/abstract.index abstract indexes/topicstoplist abs

48 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/author_seq.index author_seq fm au sequence

49 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/bib_author_fnm.index bib_author_fnm bb au fnm

50 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/bib_author_snm.index bib_author_snm bb au snm

51 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/fig.index fig indexes/topicstoplist fig

52 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/ack.index ack indexes/topicstoplist ack

53 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/alltitles.index alltitles indexes/titlestoplist atl st

54 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/affil.index affil indexes/titlestoplist fm aff

55 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/fno.index fno fno

56 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/figno.index figno fig no

57 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/topicshort.index topicshort fm tig atl abs kwd st

58 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_SECTION NONE sec indexes/sec_title2.index sec_title indexes/titlestoplist sec st

59 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/sec_words.index sec_words indexes/topicstoplist sec

60 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_BIB NONE bm bib bibl bb indexes/bib_author.index bib_author au

61 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/bib_title.index bib_title atl

62 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/bib_date.index bib_date pdt yr

63 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_PARAS NONE ^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$ indexes/para_words.index para_words indexes/topicstoplist.*

64 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_FIG NONE fig indexes/fig_caption.index fig_caption indexes/titlestoplist fgc

65 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_VITAE NONE vt indexes/vitae_words.index vt_vitae indexes/titlestoplist vt

66 March 2, 2004 Ray R. Larson INEX Configuration Example KEEP_ENTITIES doi 28 #DOCID# 5 #DBNAME# …

67 March 2, 2004 Ray R. Larson INEX Configuration Example #FILENAME# FILENAME #RANK# RANK … #RAWSCORE# RAWSCORE SUBST_ELEMENT SUBST_ELEMENT

68 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_PARAS NONE ^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$ indexes/para_words.index para_words indexes/topicstoplist.*

69 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_PARAS NONE ^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$ indexes/para_words.index para_words indexes/topicstoplist.*

70 March 2, 2004 Ray R. Larson INEX Configuration Example indexes/COMPONENT_PARAS NONE ^ilrj$|^ip1$|^ip2$|^ip3$|^ip4$|^ip5$|^item-none$|^p$|^p1$|^p2$|^p3$|^tmath$|^tf$ indexes/para_words.index para_words indexes/topicstoplist.*

71 March 2, 2004 Ray R. Larson XML Schemas and Element Retrieval

72 March 2, 2004 Ray R. Larson XML Schema Support XML Schemas or DTD’s can be used to define the data contentsXML Schemas or DTD’s can be used to define the data contents Tested with a wide variety of schemas including METS (with various supporting schemas)Tested with a wide variety of schemas including METS (with various supporting schemas)

73 March 2, 2004 Ray R. Larson XML Element Extraction A new search “ElementSetName” is XML_ELEMENT_A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request (Note only a subset of full Xpath is available)Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request (Note only a subset of full Xpath is available) The matching elements are extracted from the records matching the search and delivered in a simple format..The matching elements are extracted from the records matching the search and delivered in a simple format..

74 March 2, 2004 Ray R. Larson XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML }} { Singularitâes áa Cargáese … etc…

75 March 2, 2004 Ray R. Larson Database Storage All data stored as SGML/XML flat text files plus optional linked full-text (non-XML) filesAll data stored as SGML/XML flat text files plus optional linked full-text (non-XML) files File format is defined though SGML/XML DTD (also flat text file) or SchemaFile format is defined though SGML/XML DTD (also flat text file) or Schema “Associator” files provide indexed direct access to each record in SGML/XML files.“Associator” files provide indexed direct access to each record in SGML/XML files. –Contain offset and record length for each “record” –Associators can be built to index any conformant document in a directory sub-tree

76 March 2, 2004 Ray R. Larson INEX CO Runs Three official, one later run - all Title-onlyThree official, one later run - all Title-only –Fusion - Combines Okapi and LR using the MERGE_CMBZ operator –NewParms (LR)- Using only LR with the new parameters –Feedback - An attempt at blind relevance feedback –PostFusion - Fusion of the new LR coefficients and Okapi

77 March 2, 2004 Ray R. Larson Query Generation - CO # 162 TITLE = Text and Index Compression Algorithms# 162 TITLE = Text and Index Compression Algorithms QUERY: {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms})QUERY: {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression Algorithms}) !MERGE_CMBZ {Text and Index Compression is is is is LR !MERGE_CMBZ is a normalized score summation and enhancement!MERGE_CMBZ is a normalized score summation and enhancement

78 March 2, 2004 Ray R. Larson INEX CO Runs Generalized Strict Avg Prec FUSION = NEWPARMS = FDBK = POSTFUS = Avg Prec FUSION = NEWPARMS = FDBK = POSTFUS =

79 March 2, 2004 Ray R. Larson INEX VCAS Runs Two official runsTwo official runs –FUSVCAS - Element fusion using LR and various operators for path restriction –NEWVCAS - Using the new LR coefficients for each appropriate index and various operators for path restriction

80 March 2, 2004 Ray R. Larson Query Generation - VCAS #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)]#66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)] Submitted query = {intelligent transport systems})) !RESTRICT_FROM {on- board route planning navigation system for automobiles}))Submitted query = {intelligent transport systems})) !RESTRICT_FROM {on- board route planning navigation system for automobiles})) Target elements: sec|ss1|ss2|ss3Target elements: sec|ss1|ss2|ss3

81 March 2, 2004 Ray R. Larson VCAS Results GeneralizedStrict Avg Prec FUSVCAS = NEWVCAS = Avg Prec FUSVCAS = NEWVCAS =

82 March 2, 2004 Ray R. Larson Heterogeneous Track Approach using the Cheshire’s Virtual Database optionsApproach using the Cheshire’s Virtual Database options –Primarily a version of distributed IR –Each collection indexed separately –Search via Z39.50 distributed queries –Z39.50 Attribute mapping used to map query indexes to appropriate elements in a given collection –Only LR used and collection results merged using probability of relevance for each collection result

83 March 2, 2004 Ray R. Larson Heterogeneous Track Issues Very large “Documents”Very large “Documents” –Our approach was to segment Reporting Xpath after segmenting large documentsReporting Xpath after segmenting large documents

84 March 2, 2004 Ray R. Larson Database Storage Associator File Page Data File SGML /XML File History File DTD File Cluster File Postings File Index File Index File Remote RDBMS Config File Index File Associator File Prox data File

85 March 2, 2004 Ray R. Larson Client/Server Architecture Server Supports:Server Supports: –Database storage –Indexing –Z39.50 access to local data –Boolean and Probabilistic Searching –Relevance Feedback –External SQL database support Client Supports:Client Supports: –Programmable (Tcl/Tk – Python soon) Graphical User Interface –Z39.50 access to remote servers –SGML & MARC formatting Combined Client/Server CGI scripting via WebCheshireCombined Client/Server CGI scripting via WebCheshire

86 March 2, 2004 Ray R. Larson Z39.50 Overview UI Map Query Internet Map Results Map Query Map Results Map Query Map Results Search Engine

87 March 2, 2004 Ray R. Larson Two Protocols: HTTP & Z39.50

88 March 2, 2004 Ray R. Larson Server Z39.50 Support Locally developed Z39.50 LibraryLocally developed Z39.50 Library Extended version 3 supportExtended version 3 support –support version 3 attributes in BIB-1 including “stem”, “relevance”, etc. Also adding support for “type 102” ranked queries (version 4) Can provide both MARC, SUTRS and SGML records, support for Explain and GRS-1 conversion of any SGML recordsCan provide both MARC, SUTRS and SGML records, support for Explain and GRS-1 conversion of any SGML records

89 March 2, 2004 Ray R. Larson Distributed Search

90 March 2, 2004 Ray R. Larson The Problem The Digital Library vision -- Access to everyone for “all human knowledge”The Digital Library vision -- Access to everyone for “all human knowledge” Lyman and Varian’s estimates of the “Dark Web”Lyman and Varian’s estimates of the “Dark Web” Hundreds or Thousands of servers with databases ranging widely in content, topic, formatHundreds or Thousands of servers with databases ranging widely in content, topic, format –Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results –How to select the “best” ones to search? Which resource to search first?Which resource to search first? Which to search next if more is wanted?Which to search next if more is wanted? –Topical /domain constraints on the search selections –Variable contents of database (metadata only, full text, multimedia…)

91 March 2, 2004 Ray R. Larson Distributed Search Tasks Resource DescriptionResource Description –How to collect metadata about digital libraries and their collections or databases Resource SelectionResource Selection –How to select relevant digital library collections or databases from a large number of databases Distributed SearchDistributed Search –How to perform parallel or sequential searching over the selected digital library databases Data FusionData Fusion –How to merge query results from different digital libraries with their different search engines, differing record structures, etc.

92 March 2, 2004 Ray R. Larson An Approach for Distributed Resource Discovery Distributed resource representation and discoveryDistributed resource representation and discovery –New approach to building resource descriptions based on Z39.50 –Instead of using broadcast search across resources we are using two Z39.50 Services Identification of database metadata using Z39.50 ExplainIdentification of database metadata using Z39.50 Explain Extraction of distributed indexes using Z39.50 SCANExtraction of distributed indexes using Z39.50 SCAN EvaluationEvaluation –How efficiently can we build distributed indexes? –How effectively can we choose databases using the index? –How effective is merging search results from multiple sources? –Can we build hierarchies of servers (general/meta- topical/individual)?

93 March 2, 2004 Ray R. Larson Z39.50 Explain Explain supports searches forExplain supports searches for –Server-Level metadata Server NameServer Name IP AddressesIP Addresses PortsPorts –Database-Level metadata Database nameDatabase name Search attributes (indexes and combinations)Search attributes (indexes and combinations) –Support metadata (record syntaxes, etc)

94 March 2, 2004 Ray R. Larson Z39.50 SCAN Originally intended to support BrowsingOriginally intended to support Browsing Query forQuery for –Database –Attributes plus Term (i.e., index and start point) –Step Size –Number of terms to retrieve –Position in Response set ResultsResults –Number of terms returned –List of Terms and their frequency in the database (for the given attribute combination)

95 March 2, 2004 Ray R. Larson Z39.50 SCAN Results % zscan title cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … Syntax: zscan indexname1 term stepsize number_of_terms pref_pos

96 March 2, 2004 Ray R. Larson Resource Index Creation For all servers, or a topical subset…For all servers, or a topical subset… –Get Explain information –For each index Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency Add term + freq + source index + database metadata to the XML “Collection Document” for the resourceAdd term + freq + source index + database metadata to the XML “Collection Document” for the resource –Planned extensions: Post-Process indexes (especially Geo Names, etc) for special types of dataPost-Process indexes (especially Geo Names, etc) for special types of data –e.g. create “geographical coverage” indexes

97 March 2, 2004 Ray R. Larson MetaSearch Approach MetaSearch Server Map Explain And Scan Queries Internet Map Results Map Query Map Results Search Engine DB2DB 1 Map Query Map Results Search Engine DB 4DB 3 Distributed Index Search Engine Db 6 Db 5

98 March 2, 2004 Ray R. Larson Known Issues and Problems Not all Z39.50 Servers support SCAN or ExplainNot all Z39.50 Servers support SCAN or Explain Solutions that appear to work well:Solutions that appear to work well: –Probing for attributes instead of explain (e.g. DC attributes or analogs) –We also support OAI and can extract OAI metadata for servers that support OAI –Query-based sampling (Callan) Collection Documents are static and need to be replaced when the associated collection changesCollection Documents are static and need to be replaced when the associated collection changes

99 March 2, 2004 Ray R. Larson EvaluationEvaluation Test EnvironmentTest Environment –TREC Tipster data (approx. 3 GB) –Partitioned into 236 smaller collections based on source and date by month (no DOE) High size variability (from 1 to thousands of records)High size variability (from 1 to thousands of records) Same database as used in other distributed search studies by J. French and J. Callan among othersSame database as used in other distributed search studies by J. French and J. Callan among others –Used TREC topics for evaluation (these are the only topics with relevance judgements for all 3 TIPSTER disks

100 March 2, 2004 Ray R. Larson Harvesting Efficiency Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb)Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb) Average of seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the networkAverage of seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the network Average of secondsAverage of seconds Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds.Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds.

101 March 2, 2004 Ray R. Larson Our Collection Ranking Approach We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval timeWe attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results)Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results)

102 March 2, 2004 Ray R. Larson Probabilistic Retrieval: Logistic Regression Probability of relevance for a given index is based on logistic regression from a sample set documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by:

103 March 2, 2004 Ray R. Larson Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection size estimate Average Inverse Collection Frequency Inverse Document Frequency (N = Number of collections M = Number of Terms in common between query and document

104 March 2, 2004 Ray R. Larson EvaluationEvaluation EffectivenessEffectiveness –Tested using the collection representatives described above (as harvested from over the network) and the TIPSTER relevance judgements –Testing by comparing our approach to known algorithms for ranking collections –Results were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) –Recall analog (How many of the Rel docs occurred in the top n databases – averaged)

105 March 2, 2004 Ray R. Larson Titles only (short query)

106 March 2, 2004 Ray R. Larson FutureFuture Logically Clustering servers by topicLogically Clustering servers by topic Meta-Meta Servers (treating the MetaSearch database as just another database)Meta-Meta Servers (treating the MetaSearch database as just another database)

107 March 2, 2004 Ray R. Larson Distributed Metadata Servers Replicated servers Meta-Topical Servers General Servers Database Servers

108 March 2, 2004 Ray R. Larson Geographic Operators and Search Ranking

109 March 2, 2004 Ray R. Larson The GEO Operations Operators established for the GEO Z39.50 profileOperators established for the GEO Z39.50 profile Implemented using special operations on indexesImplemented using special operations on indexes Indexing allows extraction of geographic coordinates and dates from SGML/XML data in a variety of formatsIndexing allows extraction of geographic coordinates and dates from SGML/XML data in a variety of formats Normalized internal representation in indexesNormalized internal representation in indexes Search using geographic and time elements as primary or limiting search elementsSearch using geographic and time elements as primary or limiting search elements

110 March 2, 2004 Ray R. Larson The GEO Operations X-based interfaces permit (simple) map drawing and searchX-based interfaces permit (simple) map drawing and search Interface to MapServer for web-based map searchingInterface to MapServer for web-based map searching

111 March 2, 2004 Ray R. Larson GEO Geographic operators >=#< Fully Enclosed Data fully enclosed in search reg. <#>Encloses Data fully encloses search region <># Fully Outside Data outside of search region ++Near Data is near search region :<:Before Data date is before search date :<=: Before or During Data date is before or during search date :>=: During or After Data date is during or after search date :>:After Data date is after search date

112 March 2, 2004 Ray R. Larson Overlaps search

113 March 2, 2004 Ray R. Larson Fully Enclosed Search

114 March 2, 2004 Ray R. Larson Map-Based Search

115 March 2, 2004 Ray R. Larson GeoSearch Web Interface

116 March 2, 2004 Ray R. Larson MySQL and PostgreSQL

117 March 2, 2004 Ray R. Larson RDBMS Support There are two reasons for RDBMS supportThere are two reasons for RDBMS support –IR systems are not meant for LOTS of update transactions –Some application need to have access to both relational data and text data via Z39.50 Both MySQL and PostgreSQL are popular open source RDBMS and now either can now be used via CheshireBoth MySQL and PostgreSQL are popular open source RDBMS and now either can now be used via Cheshire –Z39.50 mappings to RDBMS columns –“ZQL” submission of SQL as Z39.50 Type 0 query

118 March 2, 2004 Ray R. Larson Protocol Support

119 March 2, 2004 Ray R. Larson ProtocolsProtocols In Cheshire II most protocols (except Z39.50) are implemented using scriptingIn Cheshire II most protocols (except Z39.50) are implemented using scripting Example scripts to support the following are included in the distributionExample scripts to support the following are included in the distribution –OAI –SRW (Python version) –SOAP –SDLIP

120 March 2, 2004 Ray R. Larson Cheshire III Design and Development

121 March 2, 2004 Ray R. Larson Cheshire III Goals Retain or reproduce (and refine) all Cheshire II featuresRetain or reproduce (and refine) all Cheshire II features –“Spring cleaning” of code base –Add Full Unicode Support –Store most system and content data in the database Permit easy and efficient integration in Web ServicesPermit easy and efficient integration in Web Services Use threaded server for economy of resource usageUse threaded server for economy of resource usage Enhanced Multiprotocol supportEnhanced Multiprotocol support Support for distributed processing (I.e. GRID clusters)Support for distributed processing (I.e. GRID clusters) Enhance expandability and “drop in’ functionalityEnhance expandability and “drop in’ functionality Interfaces and/or APIs for Java, Python, C/C++Interfaces and/or APIs for Java, Python, C/C++

122 March 2, 2004 Ray R. Larson Cheshire II Design Overview XML DOCS XML DIRECTORY INDEX CLUSTER INDEX CHESHIRE CONT BUILD ASSOC Z SERVER CONFIG COMPONENT DEFINITION INDEX(S) ASSOC CLUSTER EXTENSION

123 March 2, 2004 Ray R. Larson Cheshire III Server Overview API INDEXINGINDEXING T R R X E A S C N L O S T R F D O R M S SEARCHSEARCH P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI CONFIG NETWORKNETWORK RESULT SETS SCANSCAN USER INFO CONFIG&CONTROLCONFIG&CONTROL ACCESS INFO AUTHENTICATIONAUTHENTICATION CLUSTERINGCLUSTERING Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL APACHEINTERFACEAPACHEINTERFACE SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER

124 March 2, 2004 Ray R. Larson API INDEXINGINDEXING T R R X E A S C N L O S T R F D O R M S SEARCHSEARCH P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI CONFIG NETWORKNETWORK RESULT SETS SCANSCAN USER INFO CONFIG&CONTROLCONFIG&CONTROL ACCESS INFO AUTHENTICATIONAUTHENTICATION CLUSTERINGCLUSTERING Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL APACHEINTERFACEAPACHEINTERFACE SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER

125 March 2, 2004 Ray R. Larson Retain Features The intent is to permit all of the types of in indexing, searching and record formatting available now, while making it easier to add new capabilitiesThe intent is to permit all of the types of in indexing, searching and record formatting available now, while making it easier to add new capabilities The new system will also support full UNICODE for content and for metadataThe new system will also support full UNICODE for content and for metadata Store metadata and content in the database (including config information, etc.)Store metadata and content in the database (including config information, etc.)

126 March 2, 2004 Ray R. Larson Permit easy integration of Web Services The assumption is that the web server will be the central server mechanism in the future.The assumption is that the web server will be the central server mechanism in the future. The new design relies on the session handling, threading and load management tools available in Apache ( )The new design relies on the session handling, threading and load management tools available in Apache ( ) The Cheshire server is dynamically loaded as part of the Web ServerThe Cheshire server is dynamically loaded as part of the Web Server

127 March 2, 2004 Ray R. Larson Multiprotocol Support The Web server handles the network issues and passes requests in various protocols along to the Cheshire Server.The Web server handles the network issues and passes requests in various protocols along to the Cheshire Server. Individual Protocol “plugins” and the Protocol Handler convert search, display, and metadata requests in a particular protocol to the internal Cheshire III control language, and convert outgoing message and data to the appropriate protocol formIndividual Protocol “plugins” and the Protocol Handler convert search, display, and metadata requests in a particular protocol to the internal Cheshire III control language, and convert outgoing message and data to the appropriate protocol form

128 March 2, 2004 Ray R. Larson Distributed & GRID Processing The server will support protocols for interchange of partial results and collection statistics with a single “Master” controlling the actions of a large number of “Slave” serversThe server will support protocols for interchange of partial results and collection statistics with a single “Master” controlling the actions of a large number of “Slave” servers These will run in parallel in a GRID environmentThese will run in parallel in a GRID environment This is still “research” but will probably be using “Storage Grid” technology from SDSC with our own applicationsThis is still “research” but will probably be using “Storage Grid” technology from SDSC with our own applications Non-Grid use of the same protocols, etc will be possible (but definitely slower)Non-Grid use of the same protocols, etc will be possible (but definitely slower)

129 March 2, 2004 Ray R. Larson Enhanced Expanability Clearly defined APIs for interacting with the server will permit easy addition of new functionality, or to replace or upgrade existing functionalityClearly defined APIs for interacting with the server will permit easy addition of new functionality, or to replace or upgrade existing functionality Interactive user interface for database configuration and setupInteractive user interface for database configuration and setup –We want to make it easier for a user/administrator to create and manage the database

130 March 2, 2004 Ray R. Larson Multilingual APIs The system is being developed in a multilingual environment.The system is being developed in a multilingual environment. We will include the ability to interface with (at a minimum) Java, Python and C/C++ applications.We will include the ability to interface with (at a minimum) Java, Python and C/C++ applications. APIs for developing new functions will be available in these languages as wellAPIs for developing new functions will be available in these languages as well

131 March 2, 2004 Ray R. Larson DevelopmentDevelopment Currently work is going on here (RRL) and (primarily) in the UK Currently work is going on here (RRL) and (primarily) in the UK We have incomplete (Alpha) versions of the system, but haven’t been distributing it in the current form (changing constantly)We have incomplete (Alpha) versions of the system, but haven’t been distributing it in the current form (changing constantly) First release version is expected in mid-’04First release version is expected in mid-’04

132 March 2, 2004 Ray R. Larson Further Information Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/Full Cheshire II client and server is open source and available for academic and government use: ftp://cheshire.berkeley.edu/pub/cheshire/ –Includes HTML documentation Project Web Site Web Site Archives Hub Hub


Download ppt "March 2, 2004 Ray R. Larson Cheshire II: Features and Internals and Cheshire III overview Ray R. Larson School of Information Management and Systems University."

Similar presentations


Ads by Google