Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002.

Similar presentations


Presentation on theme: "CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002."— Presentation transcript:

1 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

2 Today’s Topics Recap: XQuery Course evaluations XML indexing and search II Systems supporting relevance ranking UC Davis system (XQuery extension) XIRQL: Relevance ranking / XML search Summary, discussion Metadata

3 Recap: XQuery Møller and Schwartzbach

4 Queries Supported by XQuery Location/position (“chapter no.3”) Attribute/value /play/title contains “hamlet” Path queries title contains “hamlet” /play//title contains “hamlet” Complex graphs Ingredients occurring in two recipes Subsumes: hyperlinks What about relevance ranking?

5 XQuery 1.0 Standard on Order Document order defines a total ordering among all the nodes seen by the language processor. Informally, document order corresponds to a depth-first, left-to-right traversal of the nodes in the Data Model. … if a node in document A is before a node in document B, then every node in document A is before every node in document B.

6 Document Collection = 1 XML Doc. <!DOCTYPE MedlineCitationSet PUBLIC "MedlineCitationSet" "http://www.nlm.nih.gov/databases/dtd/nlmm edline_001211.dtd"> 91060009 1991 01 10 some content

7 Document Collection = 1 XML Doc. <!DOCTYPE MedlineCitationSet PUBLIC "MedlineCitationSet" "http://www.nlm.nih.gov/databases/dtd/nlmm edline_001211.dtd"> (content) …

8 How XQuery makes ranking difficult All documents in collection A must be ranked before all documents in collection B. Fragments must be ordered in depth-first, left-to-right order.

9 XQuery: Order By Clause F for $d in document("depts.xml")//deptno L let $e := document("emps.xml")//emp[deptno = $d] W where count($e) >= 10 O order by avg($e/salary) descending R return { $d, {count($e)}, {avg($e/salary)} }

10 XQuery Order By Clause Order by clause only allows ordering by “overt” criterion Relevance ranking Is often proprietary Can’t be expressed easily as function of set to be ranked Is better abstracted out of query formulation (cf. www)

11 Next: XML IR System with Relevance Ranking

12 Thursday (12/5): Presentation of Projects Copy slides into your subdirectory in the cs276a submit/ directory Slides must be named slides.{ppt,pdf} Slides must be submitted by noon tomorrow (12/4) You have 3 minutes to present on Thursday Use all of it Or leave 1 minute for 1 question Projects are due today

13 Evaluations

14 UC Davis System Extension of XQuery for relevance ranking Term counts are stored with only one related node (usually leaf) Term weight computation at query time Solves “granularity problem” Reasonable index size 30% - 60% of collection

15 Integrated Query Processing

16 Abbreviations DF = Document Fragment DFS = Document Fragment Sequence SSD = sequence of sets of DFs

17 Integrated Query Processing

18 Element vs Attribute Search for element /play[title=“hamlet”] Search for attribute /play[@title=“hamlet”] Both are valid encodings -> need to know schema UC Davis system treats elements/attributes identically for text search XIRQL supports /play[~title=“hamlet”]

19 Proposed XQuery Extension RankBy ::= Expr “rankby” “(” QuerySpecList “)” [“basedon” “(“ TargetSpecList “)”] [“using” MethodFunctCall] QuerySpecList ::= Expr (“,” QuerySpecList)? TargetSpecList ::=PathExpr (“,” TargetSpecList)?

20 Proposed XQuery Extension: Example document(“news.xml”)//article//paragraph rankby (“New York”) for $c in document(“newsmeta.xml”)//category return {$c/name[text()} For $a in document (“news.xml”)//article[@icd=$c/@id] Return $a/title/text() Rankby ($c/keywords)

21 Proposed XQuery Extension: Example For $a in document(“bib.xml”)//article Return {$a/title} Rankby (“albert”,”einstein”) basedon (references)

22 Observations Possible: Ranked text != Returned text Rank on references Return titles Query can be user supplied or path expression $c/keywords example

23 Implementation Term weighting Term weights are computed on the fly Index construction

24 Term Weighting Parameters we need for computing term weights: tf – term frequency in DF df – number of DFs containing term dlen – length of DF in terms slen – number of terms in DFS Premise: First step of query evaluation is p query So term weights can be computed on the fly while evaluating p query

25 Integrated Query Processing

26 Index Requirements Inverted index Non-replication of statistics DF inclusion DF intersection Worst case complexity Quadratic Number of DFs in current result set x number DFs contributing to term statistics

27 Document Guide

28 Node Encoding Use Document Guide Encode each node with path identifier (PID) PID = (DG-index,[path number,path number,…]) Example: /db/article[1]/text/para[3] Encoded as: (5,(1,3)) Encode DG-index with log(n) bits Encode each path number with log(k i ) bits

29 Node Encoding

30 Term Counter Accumulation Worst case: quadratic Trick: order PIDs to reduce set of DFs to check (n1,N1) < (n2,N2) iff n1<n2 or (n1=n2 and N1<N2) To find all descendants of (n1,N1) Start with (n1,N1) Proceed to first node (n2,N2) that is not a descendant Stop

31 Index Parameters

32 Performance

33 Limitations Need to start with “p query” (a path) Often inconvenient / not possible User must know schema No element/attribute distinction for r queries XQuery allows construction of new fragments from subfragments XQuery allows construction of completely new documents

34 XIRQL University of Dortmund Goal: open source XML search engine Motivation “Returnable” fragments are special “atomic units” E.g., don’t return a some text fragment Structured Document Retrieval Principle Empower users who don’t know the schema Enable search for any person_name no matter how schema refers to it

35 Atomic Units Specified in schema Only atomic units can be returned as result of search (unless unit specified) Tf.idf weighting is applied to atomic units Probabilistic combination of “evidence” from atomic units

36 XIRQL Indexing

37 Structured Document Retrieval Principle A system should always retrieve the most specific part of a document answering a query. Example query: xql Document: 0.3 XQL 0.5 example 0.8 XQL 0.7 syntax  Return section, not chapter

38 Augmentation weights Ensure that Structured Document Retrieval Principle is respected. Assume different query conditions are disjoint events -> independence. P(chapter,XQL) = P(XQL|chapter) +P(sec.|chapter)*P(XQL|sec.) -P(XQL|chapter)*P(sec.|chapter)*P(XQL|sec.) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636 Section ranked ahead of chapter

39 Datatypes Example: person_name Assign all elements and attributes with person semantics to this datatype Allow user to search for “person” without specifying path

40 XIRQL: Summary Relevance ranking Fragment/context selection Datatypes (person_name) Probablistic combination of evidence

41 XML Summary Why you don’t want to use a DB Relevance ranking Why you can’t use a standard IR engine Term statistics / indexing granularity Issues with fragments (granularity, coherence …)

42 XML Summary: Schemas Ideally: There is one schema User understands schema In practice: rare Many schemas Schemas not known in advance Schemas change Users don’t understand schemas Need to identify similar elements in different schemas Example: employee

43 XML Summary: UI Challenges Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender What is the query language you expose to user? XQuery? No. Forms? Parametric search? A textbox? In general: design layer between XML and user

44 Metadata

45 Dublin Core Element Set Title (e.g., Dublin Core Element Set) Creator (e.g., Hinrich Schuetze) Subject (e.g, keywords) Description (e.g., an abstract) Publisher (e.g., Stanford University) Contributor (e.g., Chris Manning) Date (e.g, 2002.12.03) Type (e.g., presentation) Format (e.g., ppt) Identifier (e.g., http://www.stanford.edu/class/cs276a/syllabus.html) Source (e.g. http://dublincore.org/documents/dces/) Language (e.g, English) Coverage (e.g., San Francisco Bay Area) Rights (e.g., Copyright Stanford University)

46 Why metadata? “Normalized” semantics Enables searches otherwise not possible: Time Author Url / filename Non-text content Images Audio Video

47 For Effective Metadata We Need: Semantics Commonly understood terms to describe information resources Syntax Standard grammar for connecting terms into meaningful “sentences” Exchange framework So we can recombine and exchange metadata across applications and subjects Weibel and Miller

48 Dublin Core: Goals Metadata standard Framework for interoperability Facilitate development of subject specific metadata More general goals of DC community Foster development of metadata tools Creating, editing, managing, navigating metadata Semantic registry for metadata Search declared meanings and their relationships Registry for metadata vocabularies Maybe somebody has already done the work

49 RDF = Resource Description Framework Engineering standard for metadata W3C standard Part of W3C’s metadata framework Specialized for WWW Desiderata Combine different metadata modules (e.g., different subject areas) Syndication, aggregation, threading

50 Metadata: Who creates them? Authors Editors Automatic creation Automatic context capture

51 Automatic Context Capture

52 Metadata Pros and Cons CONS Most authors are unwilling to spend time and energy on learning a metadata standard annotating documents they author Authors are unable to foresee all reasons why a document may be interesting. Authors may be motivated to sabotage metadata (patents). PROS Information retrieval often does not work. Words poorly approximate meaning. For truly valuable content, it pays to add metadata. Synthesis In reality, most documents have some valuable metadata If metadata is available, it improves relevance and user experience But most interesting content will always have inconsistent and spotty metadata coverage

53 Advanced Research Issues Cross-lingual IR Topic detection & tracking Information filtering and collaborative filtering Data fusion Automatic classification Merging database and IR technologies Digital libraries Multimedia information retrieval

54 Resources Jan-Marco Bremer’s publications on xml and ir: http://www.db.cs.ucdavis.edu/~bremer http://www.db.cs.ucdavis.edu/~bremer www.w3.org/XML - XML resources at W3C Ronald Bourret on native XML databases: http://www.rpbourret.com/xml/ProdsNative.htm http://www.rpbourret.com/xml/ProdsNative.htm Norbert Fuhr and Kai Grossjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001. http://www.sciam.com/2001/0501issue/0501berner s-lee.html http://www.sciam.com/2001/0501issue/0501berner s-lee.html http://www.xml.com/pub/a/2001/01/24/rdf.html

55 Additional Resources http://www.hyperorg.com/backissues/joho- jun26-02.html#semantic http://www.hyperorg.com/backissues/joho- jun26-02.html#semantic http://www.xml.com/pub/a/2001/05/23/je na.html http://www.xml.com/pub/a/2001/05/23/je na.html http://www710.univ- lyon1.fr/%7Echampin/rdf-tutorial/ http://www710.univ- lyon1.fr/%7Echampin/rdf-tutorial/ http://www- 106.ibm.com/developerworks/library/w-rdf/ http://www- 106.ibm.com/developerworks/library/w-rdf/


Download ppt "CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002."

Similar presentations


Ads by Google