2007.04.12 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

2007.04.12 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Principles of Information Retrieval Lecture 21: XML Retrieval

2007.04.12 - SLIDE 2IS 240 – Spring 2007 Mini-TREC Proposed Schedule –February 15 – Database and previous Queries –February 27 – report on system acquisition and setup –March 8, New Queries for testing… –April 19, Results due (Next Thursday) –April 24 or 26, Results and system rankings –May 8 Group reports and discussion

2007.04.12 - SLIDE 3IS 240 – Spring 2007 Announcement No Class on Tuesday (April 17 th )

2007.04.12 - SLIDE 4IS 240 – Spring 2007 Today Review –Geographic Information Retrieval –GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K. XML and Structured Element Retrieval –INEX –Approaches to XML retrieval Credit for some of the slides in this lecture goes to Marti Hearst

2007.04.12 - SLIDE 5IS 240 – Spring 2007 Today Review –Geographic Information Retrieval –GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K. Web Crawling and Search Issues –Web Crawling –Web Search Engines and Algorithms Credit for some of the slides in this lecture goes to Marti Hearst

2007.04.12 - SLIDE 6IS 240 – Spring 2007 Introduction What is Geographic Information Retrieval? –GIR is concerned with providing access to georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval. –It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

2007.04.12 - SLIDE 7IS 240 – Spring 2007 Example: Results display from CheshireGeo: http://calsip.regis.berkeley.edu/pattyf/mapserver/cheshire2/cheshire_init.html

2007.04.12 - SLIDE 8IS 240 – Spring 2007 1) Minimum Bounding Circle (3) 2) MBR: Minimum aligned Bounding rectangle (4) 3) Minimum Bounding Ellipse (5) 6) Convex hull (varies)5) 4-corner convex polygon (8)4) Rotated minimum bounding rectangle (5) Presented in order of increasing quality. Number in parentheses denotes number of parameters needed to store representation After Brinkhoff et al, 1993b Other convex, conservative Approximations

2007.04.12 - SLIDE 9IS 240 – Spring 2007 Our Research Questions Spatial Ranking –How effectively can the spatial similarity between a query region and a document region be evaluated and ranked based on the overlap of the geometric approximations for these regions? Geometric Approximations & Spatial Ranking: –How do different geometric approximations affect the rankings? MBRs: the most popular approximation Convex hulls: the highest quality convex approximation

2007.04.12 - SLIDE 10IS 240 – Spring 2007 Spatial Ranking: Methods for computing spatial similarity

2007.04.12 - SLIDE 11IS 240 – Spring 2007 Probabilistic Models: Logistic Regression attributes X 1 = area of overlap(query region, candidate GIO) / area of query region X 2 = area of overlap(query region, candidate GIO) / area of candidate GIO X 3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO that is onshore) Where: Range for all variables is 0 (not similar) to 1 (same)

2007.04.12 - SLIDE 12IS 240 – Spring 2007 CA Named Places in the Test Collection – complex polygons Counties Cities National Parks National Forests Water QCB Regions Bioregions

2007.04.12 - SLIDE 13IS 240 – Spring 2007 CA Counties – Geometric Approximations MBRs Ave. False Area of Approximation: MBRs: 94.61%Convex Hulls: 26.73% Convex Hulls

2007.04.12 - SLIDE 14IS 240 – Spring 2007 CA User Defined Areas (UDAs) in the Test Collection

2007.04.12 - SLIDE 15IS 240 – Spring 2007 Test Collection Query Regions: CA Counties 42 of 58 counties referenced in the test collection metadata 10 counties randomly selected as query regions to train LR model 32 counties used as query regions to test model

2007.04.12 - SLIDE 16IS 240 – Spring 2007 LR model X 1 = area of overlap(query region, candidate GIO) / area of query region X 2 = area of overlap(query region, candidate GIO) / area of candidate GIO Where: Range for all variables is 0 (not similar) to 1 (same)

2007.04.12 - SLIDE 17IS 240 – Spring 2007 Some of our Results Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. For metadata indexed by CA named place regions: For all metadata in the test collection: These results suggest: Convex Hulls perform better than MBRs Expected result given that the CH is a higher quality approximation A probabilistic ranking based on MBRs can perform as well if not better than a non- probabiliistic ranking method based on Convex Hulls Interesting Since any approximation other than the MBR requires great expense, this suggests that the exploration of new ranking methods based on the MBR are a good way to go.

2007.04.12 - SLIDE 18IS 240 – Spring 2007 Some of our Results Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. For metadata indexed by CA named place regions: For all metadata in the test collection: BUT: The inclusion of UDA indexed metadata reduces precision. This is because coarse approximations of onshore or coastal geographic regions will necessarily include much irrelevant offshore area, and vice versa

2007.04.12 - SLIDE 19IS 240 – Spring 2007 Shorefactor Model X1 = area of overlap(query region, candidate GIO) / area of query region X2 = area of overlap(query region, candidate GIO) / area of candidate GIO X3 = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore) –Where: Range for all variables is 0 (not similar) to 1 (same)

2007.04.12 - SLIDE 20IS 240 – Spring 2007 Some of our Results, with Shorefactor These results suggest: Addition of Shorefactor variable improves the model (LR 2), especially for MBRs Improvement not so dramatic for convex hull approximations – b/c the problem that shorefactor addresses is not that significant when areas are represented by convex hulls. For all metadata in the test collection: Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.

2007.04.12 - SLIDE 21IS 240 – Spring 2007 Precision Recall Results for All Data - MBRs

2007.04.12 - SLIDE 22IS 240 – Spring 2007 Results for All Data - Convex Hull Precision Recall

2007.04.12 - SLIDE 23IS 240 – Spring 2007 XML Retrieval The following slides are adapted from presentations at INEX 2003-2005 and at the INEX Element Retrieval Workshop in Glasgow 2005, with some new additions for general context, etc.

2007.04.12 - SLIDE 24IS 240 – Spring 2007 INEX Organization Organized By: –University of Duisburg-Essen, Germany Norbert Fuhr, Saadia Malik, and others –Queen Mary University of London, UK Mounia Lalmas, Gabriella Kazai, and others Supported By: –DELOS Network of Excellence in Digital Libraries (EU) –IEEE Computer Society –University of Duisburg-Essen

2007.04.12 - SLIDE 25IS 240 – Spring 2007 XML Retrieval Issues Using Structure? Specification of Queries How to evaluate?

2007.04.12 - SLIDE 26IS 240 – Spring 2007 Cheshire SGML/XML Support Underlying native format for all data is SGML or XML The DTD defines the database contents Full SGML/XML parsing SGML/XML Format Configuration Files define the database location and indexes Various format conversions and utilities available for Z39.50 support (MARC, GRS-1

2007.04.12 - SLIDE 27IS 240 – Spring 2007 SGML/XML Support Configuration files for the Server are SGML/XML: –They include elements describing all of the data files and indexes for the database. –They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

2007.04.12 - SLIDE 28IS 240 – Spring 2007 Indexing Any SGML/XML tagged field or attribute can be indexed: –B-Tree and Hash access via Berkeley DB (Sleepycat) –Stemming, keyword, exact keys and “special keys” –Mapping from any Z39.50 Attribute combination to a specific index –Underlying postings information includes term frequency for probabilistic searching Component extraction with separate component indexes

2007.04.12 - SLIDE 29IS 240 – Spring 2007 XML Element Extraction A new search “ElementSetName” is XML_ELEMENT_ Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request The matching elements are extracted from the records matching the search and delivered in a simple format..

2007.04.12 - SLIDE 30IS 240 – Spring 2007 XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} { Singularitâes áa Cargáese … etc…

2007.04.12 - SLIDE 31IS 240 – Spring 2007 Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide TREC3 Logistic Regression

2007.04.12 - SLIDE 32IS 240 – Spring 2007 TREC3 Logistic Regression Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Number of Terms in both query and Component

2007.04.12 - SLIDE 33IS 240 – Spring 2007 Okapi BM25 Where: Q is a query containing terms T K is k 1 ((1-b) + b.dl/avdl) k 1, b and k 3 are parameters, usually 1.2, 0.75 and 7-1000 tf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derived dl and avdl are the document length and the average document length measured in some convenient unit w (1) is the Robertson-Sparck Jones weight.

2007.04.12 - SLIDE 34IS 240 – Spring 2007 Combining Boolean and Probabilistic Search Elements Two original approaches: –Boolean Approach –Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries

2007.04.12 - SLIDE 35IS 240 – Spring 2007 Subquery INEX ‘04 Fusion Search Merge multiple ranked and Boolean index searches within each query and multiple component search resultsets –Major components merged are Articles, Body, Sections, subsections, paragraphs Subquery Comp. Query Results Comp. Query Results Fusion/ Merge Final Ranked List

2007.04.12 - SLIDE 36IS 240 – Spring 2007 Merging and Ranking Operators Extends the capabilities of merging to include merger operations in queries like Boolean operators Fuzzy Logic Operators (not used for INEX) –!FUZZY_AND –!FUZZY_OR –!FUZZY_NOT Containment operators: Restrict components to or with a particular parent –!RESTRICT_FROM –!RESTRICT_TO Merge Operators –!MERGE_SUM –!MERGE_MEAN –!MERGE_NORM –!MERGE_CMBZ

2007.04.12 - SLIDE 37IS 240 – Spring 2007 New LR Coefficients Indexb0b1b2b3b4b5b6 Base-3.7001.269-0.3100.679-0.0210.2234.010 topic-7.7585.670-3.4271.787-0.0301.9525.880 topicshort-6.3642.739-1.4431.228-0.0201.2803.837 abstract-5.8922.318-1.3640.860-0.0131.0523.600 alltitles-5.2432.319-1.3611.415-0.0371.1803.696 sec words-6.3922.125-1.6481.106-0.0751.1743.632 para words -8.6321.258-1.6541.485-0.0841.1434.004 Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component and Component

2007.04.12 - SLIDE 38IS 240 – Spring 2007 INEX CO Runs Three official, one later run - all Title-only –Fusion - Combines Okapi and LR using the MERGE_CMBZ operator –NewParms (LR)- Using only LR with the new parameters –Feedback - An attempt at blind relevance feedback –PostFusion - Fusion of the new LR coefficients and Okapi

2007.04.12 - SLIDE 39IS 240 – Spring 2007 Query Generation - CO # 162 TITLE = Text and Index Compression Algorithms QUERY: topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms}) @+ is Okapi, @ is LR !MERGE_CMBZ is a normalized score summation and enhancement

2007.04.12 - SLIDE 40IS 240 – Spring 2007 INEX CO Runs Generalized Strict Avg Prec FUSION = 0.0642 NEWPARMS = 0.0582 FDBK = 0.0415 POSTFUS = 0.0690 Avg Prec FUSION = 0.0923 NEWPARMS = 0.0853 FDBK = 0.0390 POSTFUS = 0.0952

2007.04.12 - SLIDE 41IS 240 – Spring 2007 INEX VCAS Runs Two official runs –FUSVCAS - Element fusion using LR and various operators for path restriction –NEWVCAS - Using the new LR coefficients for each appropriate index and various operators for path restriction

2007.04.12 - SLIDE 42IS 240 – Spring 2007 Query Generation - VCAS #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)] Submitted query = ((topic @ {intelligent transport systems})) !RESTRICT_FROM ((sec_words @ {on-board route planning navigation system for automobiles})) Target elements: sec|ss1|ss2|ss3

2007.04.12 - SLIDE 43IS 240 – Spring 2007 VCAS Results GeneralizedStrict Avg Prec FUSVCAS = 0.0321 NEWVCAS = 0.0270 Avg Prec FUSVCAS = 0.0601 NEWVCAS = 0.0569

2007.04.12 - SLIDE 44IS 240 – Spring 2007 Heterogeneous Track Approach using the Cheshire’s Virtual Database options –Primarily a version of distributed IR –Each collection indexed separately –Search via Z39.50 distributed queries –Z39.50 Attribute mapping used to map query indexes to appropriate elements in a given collection –Only LR used and collection results merged using probability of relevance for each collection result

2007.04.12 - SLIDE 45IS 240 – Spring 2007 INEX 2005 Approach Used only Logistic regression methods “TREC3” with Pivot “TREC2” with Pivot “TREC2” with Blind Feedback Used post-processing for specific tasks

2007.04.12 - SLIDE 46IS 240 – Spring 2007 Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For some set of m statistical measures, X i, derived from the collection and query Logistic Regression

2007.04.12 - SLIDE 47IS 240 – Spring 2007 TREC2 Algorithm Query Document Collection Matching Terms Term Freq for:

2007.04.12 - SLIDE 48IS 240 – Spring 2007 Blind Feedback Term selection from top-ranked documents is based on the classic Robertson/Sparck Jones probabilistic model: Document Relevance Document indexing + - + R t N t -R t N t - R-R t N-N t -R+R N-N t R N-R N For each term t

2007.04.12 - SLIDE 49IS 240 – Spring 2007 Blind Feedback Top x new terms taken from top y documents –For each term in the top y assumed relevant set… –Terms are ranked by termwt and the top x selected for inclusion in the query

2007.04.12 - SLIDE 50IS 240 – Spring 2007 Pivot method Based on the pivot weighting used by IBM Haifa in INEX 2004 (Mass & Mandelbrod) Used 0.50 as pivot for all cases For TREC3 and TREC2 runs all component results weighted by article- level results for the matching article

2007.04.12 - SLIDE 51IS 240 – Spring 2007 Subquery Adhoc Component Fusion Search Merge multiple ranked component types –Major components merged are Article Body, Sections, paragraphs, figures Subquery Comp. Query Results Comp. Query Results Fusion/ Merge Raw Ranked List

2007.04.12 - SLIDE 52IS 240 – Spring 2007 Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: TREC3 Logistic Regression

2007.04.12 - SLIDE 53IS 240 – Spring 2007 TREC3 Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Inverse Component Frequency Number of Terms in common between query and Component -- logged

2007.04.12 - SLIDE 54IS 240 – Spring 2007 TREC3 LR Coefficients Indexb0b1b2b3b4b5b6 Base-3.7001.269-0.3100.679-0.0210.2234.010 topic-7.7585.670-3.4271.787-0.0301.9525.880 topicshort-6.3642.739-1.4431.228-0.0201.2803.837 abstract-5.8922.318-1.3640.860-0.0131.0523.600 alltitles-5.2432.319-1.3611.415-0.0371.1803.696 sec words-6.3922.125-1.6481.106-0.0751.1743.632 para words -8.6321.258-1.6541.485-0.0841.1434.004 Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component and Component

2007.04.12 - SLIDE 55IS 240 – Spring 2007 CO.Focused Generalized & Strict

2007.04.12 - SLIDE 56IS 240 – Spring 2007 COS.Focused Generalized & Strict

2007.04.12 - SLIDE 57IS 240 – Spring 2007 CO.Thorough Generalized & Strict

2007.04.12 - SLIDE 58IS 240 – Spring 2007 COS.Thorough Generalized & Strict

2007.04.12 - SLIDE 59IS 240 – Spring 2007 CAS Generalize & Strict

2007.04.12 - SLIDE 60IS 240 – Spring 2007 Het. Element Retr. Overview The Problem Issues with Element Retrieval and Heterogeneous Retrieval Possible Approaches –XPointer –Generic Metadata systems E.g., Dublin Core –Other Metadata Systems

2007.04.12 - SLIDE 61IS 240 – Spring 2007 The Problem The Adhoc track in INEX has dealt with a single DTD for one type of data (computer science journal articles) In “real-world” environments, XML retrieval must deal with different DTDs, different genres of data and widely varying topical content

2007.04.12 - SLIDE 62IS 240 – Spring 2007 The Heterogeneous Track Research Questions (2004): –For content-oriented queries, what methods are possible for determining which elements contain reasonable answers? Are pure statistical methods appropriate, or are ontology-based approaches also helpful? –What methods can be used to map structural criteria onto other DTDs? –Should mappings focus on element names only, or also deal with element content or semantics? –What are appropriate evaluation criteria for heterogeneous collections?

2007.04.12 - SLIDE 63IS 240 – Spring 2007 INEX 2004 Het Collection Tags CollectionAuthor tagTitle tagAbstract tag INEX (IEEE)fm/aufm/tig/atlfm/abs BerkeleyFld100 Fld700 Fld245Fld500 (rarely) compuscienc e authortitleabstract bibdbpubauthor altauthor titleabstract dblpauthor editor title booktitle None hcibibauthortitleabstract qmulcspubAUTHOR EDITOR TITLEABSTRACT

2007.04.12 - SLIDE 64IS 240 – Spring 2007 Issues with Element Retrieval for Heterogeneous Retrieval Conceptual Issues (user view) –To actually specify structural elements for retrieval requires that the user know the structure of the items to be retrieved –As the number of DTDs or schemas increase this task becomes more complex for both specification and for understanding –For “real world” XML retrieval, specifying structure effectively requires omniscience on the part of the user –The collection itself must be specified in some way (can the user know all of the collections?) –Users of INEX can’t do correct specifications for even one DTD…

2007.04.12 - SLIDE 65IS 240 – Spring 2007 Issues with Element Retrieval for Heterogeneous Retrieval Practical Issues (programmers view) –Most of the same problems as the user view –As seen in an earlier papers today the system must provide an interface that the user can understand, but maps to the complexities of the DTD(s) –But, once again, as the number of DTDs or schemas increase this task becomes increasingly complex for the specification of the mappings –For “real world” XML retrieval, specifying structure effectively requires omniscience on the part of the programmer to provide exhaustive mappings of the document elements to be retrieved As Roelof noted earlier today, this rapidly can become a system that has too many options for a user to understand or use

2007.04.12 - SLIDE 66IS 240 – Spring 2007 Postulate of Impotence In summation we might suggest another ``Postulate of Impotence'' like those suggested by Swanson –You can either have heterogeneous retrieval, or precise element specifications in queries, but you cannot have both simultaneously

2007.04.12 - SLIDE 67IS 240 – Spring 2007 Possible Approaches Generalized structure –Parent/child as in Xpath/Xpointer –What about flat structures? (like most collections in the Het track) Abstract query elements –Use semantic representations in queries rather than structural representations E.g. “Title” instead of //fm/tig/atl –What semantic representations can/should be used?

2007.04.12 - SLIDE 68IS 240 – Spring 2007 XPointer Can specify collection-level identification –Basically a URN attached to an Xpath Can also specify various string-matching constraints on Xpath Might be useful in INEX Het Track for specifying relevance judgements But, it doesn’t address (or worsens) the larger problem of dealing with large numbers of heterogeneous structures

2007.04.12 - SLIDE 69IS 240 – Spring 2007 Abstract Data Elements The idea is to remove the requirement of precise and explicit specification of structural elements and replace them with abstract and implied specifications Used in other heterogeneous retrieval systems –Z39.50/SRW (attributesets and elementsets) –Dublin Core (limited set of elements for search or retrieval)

2007.04.12 - SLIDE 70IS 240 – Spring 2007 Dublin Core Simple metadata for describing internet resources For “Document-Like Objects” 15 Elements (in base DC)

2007.04.12 - SLIDE 71IS 240 – Spring 2007 Dublin Core Elements Title Creator Subject Description Publisher Other Contributors Date Resource Type Format Resource Identifier Source Language Relation Coverage Rights Management

2007.04.12 - SLIDE 72IS 240 – Spring 2007 Issues in Dublin Core Lack of guidance on what to put into each element How to structure or organize at the element level? How to ensure consistency across descriptions for the same persons, places, things, etc.

2007.04.12 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2007.04.12 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2007.04.12 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2007.04.12 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback