CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002.

CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Today’s Topics Recap: XQuery Course evaluations XML indexing and search II Systems supporting relevance ranking UC Davis system (XQuery extension) XIRQL: Relevance ranking / XML search Summary, discussion Metadata

Recap: XQuery Møller and Schwartzbach

Queries Supported by XQuery Location/position (“chapter no.3”) Attribute/value /play/title contains “hamlet” Path queries title contains “hamlet” /play//title contains “hamlet” Complex graphs Ingredients occurring in two recipes Subsumes: hyperlinks What about relevance ranking?

XQuery 1.0 Standard on Order Document order defines a total ordering among all the nodes seen by the language processor. Informally, document order corresponds to a depth-first, left-to-right traversal of the nodes in the Data Model. … if a node in document A is before a node in document B, then every node in document A is before every node in document B.

Document Collection = 1 XML Doc. <!DOCTYPE MedlineCitationSet PUBLIC "MedlineCitationSet" "http://www.nlm.nih.gov/databases/dtd/nlmm edline_001211.dtd"> 91060009 1991 01 10 some content

Document Collection = 1 XML Doc. <!DOCTYPE MedlineCitationSet PUBLIC "MedlineCitationSet" "http://www.nlm.nih.gov/databases/dtd/nlmm edline_001211.dtd"> (content) …

How XQuery makes ranking difficult All documents in collection A must be ranked before all documents in collection B. Fragments must be ordered in depth-first, left-to-right order.

XQuery: Order By Clause F for $d in document("depts.xml")//deptno L let $e := document("emps.xml")//emp[deptno = $d] W where count($e) >= 10 O order by avg($e/salary) descending R return { $d, {count($e)}, {avg($e/salary)} }

XQuery Order By Clause Order by clause only allows ordering by “overt” criterion Relevance ranking Is often proprietary Can’t be expressed easily as function of set to be ranked Is better abstracted out of query formulation (cf. www)

Next: XML IR System with Relevance Ranking

Thursday (12/5): Presentation of Projects Copy slides into your subdirectory in the cs276a submit/ directory Slides must be named slides.{ppt,pdf} Slides must be submitted by noon tomorrow (12/4) You have 3 minutes to present on Thursday Use all of it Or leave 1 minute for 1 question Projects are due today

Evaluations

UC Davis System Extension of XQuery for relevance ranking Term counts are stored with only one related node (usually leaf) Term weight computation at query time Solves “granularity problem” Reasonable index size 30% - 60% of collection

Integrated Query Processing

Abbreviations DF = Document Fragment DFS = Document Fragment Sequence SSD = sequence of sets of DFs

Element vs Attribute Search for element /play[title=“hamlet”] Search for attribute /play[@title=“hamlet”] Both are valid encodings -> need to know schema UC Davis system treats elements/attributes identically for text search XIRQL supports /play[~title=“hamlet”]

Proposed XQuery Extension RankBy ::= Expr “rankby” “(” QuerySpecList “)” [“basedon” “(“ TargetSpecList “)”] [“using” MethodFunctCall] QuerySpecList ::= Expr (“,” QuerySpecList)? TargetSpecList ::=PathExpr (“,” TargetSpecList)?

Proposed XQuery Extension: Example document(“news.xml”)//article//paragraph rankby (“New York”) for $c in document(“newsmeta.xml”)//category return {$c/name[text()} For $a in document (“news.xml”)//article[@icd=$c/@id] Return $a/title/text() Rankby ($c/keywords)

Proposed XQuery Extension: Example For $a in document(“bib.xml”)//article Return {$a/title} Rankby (“albert”,”einstein”) basedon (references)

Observations Possible: Ranked text != Returned text Rank on references Return titles Query can be user supplied or path expression $c/keywords example

Implementation Term weighting Term weights are computed on the fly Index construction

Term Weighting Parameters we need for computing term weights: tf – term frequency in DF df – number of DFs containing term dlen – length of DF in terms slen – number of terms in DFS Premise: First step of query evaluation is p query So term weights can be computed on the fly while evaluating p query

Index Requirements Inverted index Non-replication of statistics DF inclusion DF intersection Worst case complexity Quadratic Number of DFs in current result set x number DFs contributing to term statistics

Document Guide

Node Encoding Use Document Guide Encode each node with path identifier (PID) PID = (DG-index,[path number,path number,…]) Example: /db/article[1]/text/para[3] Encoded as: (5,(1,3)) Encode DG-index with log(n) bits Encode each path number with log(k i ) bits

Node Encoding

Term Counter Accumulation Worst case: quadratic Trick: order PIDs to reduce set of DFs to check (n1,N1) < (n2,N2) iff n1<n2 or (n1=n2 and N1<N2) To find all descendants of (n1,N1) Start with (n1,N1) Proceed to first node (n2,N2) that is not a descendant Stop

Index Parameters

Performance

Limitations Need to start with “p query” (a path) Often inconvenient / not possible User must know schema No element/attribute distinction for r queries XQuery allows construction of new fragments from subfragments XQuery allows construction of completely new documents

XIRQL University of Dortmund Goal: open source XML search engine Motivation “Returnable” fragments are special “atomic units” E.g., don’t return a some text fragment Structured Document Retrieval Principle Empower users who don’t know the schema Enable search for any person_name no matter how schema refers to it

Atomic Units Specified in schema Only atomic units can be returned as result of search (unless unit specified) Tf.idf weighting is applied to atomic units Probabilistic combination of “evidence” from atomic units

XIRQL Indexing

Structured Document Retrieval Principle A system should always retrieve the most specific part of a document answering a query. Example query: xql Document: 0.3 XQL 0.5 example 0.8 XQL 0.7 syntax  Return section, not chapter

Augmentation weights Ensure that Structured Document Retrieval Principle is respected. Assume different query conditions are disjoint events -> independence. P(chapter,XQL) = P(XQL|chapter) +P(sec.|chapter)*P(XQL|sec.) -P(XQL|chapter)*P(sec.|chapter)*P(XQL|sec.) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636 Section ranked ahead of chapter

Datatypes Example: person_name Assign all elements and attributes with person semantics to this datatype Allow user to search for “person” without specifying path

XIRQL: Summary Relevance ranking Fragment/context selection Datatypes (person_name) Probablistic combination of evidence

XML Summary Why you don’t want to use a DB Relevance ranking Why you can’t use a standard IR engine Term statistics / indexing granularity Issues with fragments (granularity, coherence …)

XML Summary: Schemas Ideally: There is one schema User understands schema In practice: rare Many schemas Schemas not known in advance Schemas change Users don’t understand schemas Need to identify similar elements in different schemas Example: employee

XML Summary: UI Challenges Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender What is the query language you expose to user? XQuery? No. Forms? Parametric search? A textbox? In general: design layer between XML and user

Metadata

Dublin Core Element Set Title (e.g., Dublin Core Element Set) Creator (e.g., Hinrich Schuetze) Subject (e.g, keywords) Description (e.g., an abstract) Publisher (e.g., Stanford University) Contributor (e.g., Chris Manning) Date (e.g, 2002.12.03) Type (e.g., presentation) Format (e.g., ppt) Identifier (e.g., http://www.stanford.edu/class/cs276a/syllabus.html) Source (e.g. http://dublincore.org/documents/dces/) Language (e.g, English) Coverage (e.g., San Francisco Bay Area) Rights (e.g., Copyright Stanford University)

Why metadata? “Normalized” semantics Enables searches otherwise not possible: Time Author Url / filename Non-text content Images Audio Video

For Effective Metadata We Need: Semantics Commonly understood terms to describe information resources Syntax Standard grammar for connecting terms into meaningful “sentences” Exchange framework So we can recombine and exchange metadata across applications and subjects Weibel and Miller

Dublin Core: Goals Metadata standard Framework for interoperability Facilitate development of subject specific metadata More general goals of DC community Foster development of metadata tools Creating, editing, managing, navigating metadata Semantic registry for metadata Search declared meanings and their relationships Registry for metadata vocabularies Maybe somebody has already done the work

RDF = Resource Description Framework Engineering standard for metadata W3C standard Part of W3C’s metadata framework Specialized for WWW Desiderata Combine different metadata modules (e.g., different subject areas) Syndication, aggregation, threading

Metadata: Who creates them? Authors Editors Automatic creation Automatic context capture

Automatic Context Capture

Metadata Pros and Cons CONS Most authors are unwilling to spend time and energy on learning a metadata standard annotating documents they author Authors are unable to foresee all reasons why a document may be interesting. Authors may be motivated to sabotage metadata (patents). PROS Information retrieval often does not work. Words poorly approximate meaning. For truly valuable content, it pays to add metadata. Synthesis In reality, most documents have some valuable metadata If metadata is available, it improves relevance and user experience But most interesting content will always have inconsistent and spotty metadata coverage

Advanced Research Issues Cross-lingual IR Topic detection & tracking Information filtering and collaborative filtering Data fusion Automatic classification Merging database and IR technologies Digital libraries Multimedia information retrieval

Resources Jan-Marco Bremer’s publications on xml and ir: http://www.db.cs.ucdavis.edu/~bremer http://www.db.cs.ucdavis.edu/~bremer www.w3.org/XML - XML resources at W3C Ronald Bourret on native XML databases: http://www.rpbourret.com/xml/ProdsNative.htm http://www.rpbourret.com/xml/ProdsNative.htm Norbert Fuhr and Kai Grossjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001. http://www.sciam.com/2001/0501issue/0501berner s-lee.html http://www.sciam.com/2001/0501issue/0501berner s-lee.html http://www.xml.com/pub/a/2001/01/24/rdf.html

Additional Resources http://www.hyperorg.com/backissues/joho- jun26-02.html#semantic http://www.hyperorg.com/backissues/joho- jun26-02.html#semantic http://www.xml.com/pub/a/2001/05/23/je na.html http://www.xml.com/pub/a/2001/05/23/je na.html http://www710.univ- lyon1.fr/%7Echampin/rdf-tutorial/ http://www710.univ- lyon1.fr/%7Echampin/rdf-tutorial/ http://www- 106.ibm.com/developerworks/library/w-rdf/ http://www- 106.ibm.com/developerworks/library/w-rdf/

CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002.

Similar presentations

Presentation on theme: "CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002.

Similar presentations

Presentation on theme: "CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002."— Presentation transcript:

Similar presentations

About project

Feedback