1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.

Slides:



Advertisements
Similar presentations
XIRQL: Eine Anfragesprache für Information Retrieval in XML-Dokumenten
Advertisements

XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
XML May 3 rd, XQuery Based on Quilt (which is based on XML-QL) Check out the W3C web site for the latest. XML Query data model –Ordered !
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Multimedia Database Systems
XML/EDI Overview West Chester Electronic Commerce Resource Center (ECRC)
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
XML R ETRIEVAL Tarık Teksen Tutal I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Managing XML and Semistructured Data Lecture 8: Query Languages - XML-QL Prof. Dan Suciu Spring 2001.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Ch 4: Information Retrieval and Text Mining
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
1 COS 425: Database and Information Management Systems XML and information exchange.
Evaluating the Performance of IR Sytems
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
1 - Fuhr: Information Retrieval Methods for XML Documents XIRQL: Eine Anfragesprache für Information Retrieval in XML- Dokumenten Norbert Fuhr Universität.
Chapter 5: Information Retrieval and Web Search
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Information Retrieval in Practice
4/20/2017.
10/14/2001 Coping with Semantics in XML Document Management Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
XML-QL A Query Language for XML Charuta Nakhe
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Querying Structured Text in an XML Database By Xuemei Luo.
NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Chapter 6: Information Retrieval and Web Search
Database Systems Part VII: XML Querying Software School of Hunan University
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
ITCS 6265 Information Retrieval & Web Mining Lecture 18-A Fall 2009.
Keyword Query Routing.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
XML e X tensible M arkup L anguage (XML) By: Albert Beng Kiat Tan Ayzer Mungan Edwin Hendriadi.
1 Information Retrieval LECTURE 1 : Introduction.
XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
XML: Extensible Markup Language
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Text Based Information Retrieval
Toshiyuki Shimizu (Kyoto University)
XML Data Introduction, Well-formed XML.
Information Retrieval
eXtensible Markup Language (XML)
Semi-Structured data (XML Data MODEL)
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang

2 Background(1) --- Data Database: Bioinformatics---John SmithProtein SIGIRN.Fuhr, K. Grobjohann XIRQL JournalConf.AuthorsTitle Schema: Papers (Title, Authors, Conf., Journal) Un-structured DataWell-structured Data IR: Intel: New chip, new price war. February 1, 2004: 6:32 PM EST. Intel Corp. on Sunday said it had refreshed its line of microchips for desktop computers with a new version of the Pentium 4 processor, designed to run increasingly power-hungry office and home entertainment software faster. In 1998, ….. An example document: Lack of flexibility Lack of extensibility Lack of the logical structure of a document. Semi-structured Data DB+IR: XIRQL N.Fuhr K.Grobjohann SIGIR Why is semi-structured data important?

3 XML in a nutshell Hierarchical data format Nested element structure having a root Self describing data (tags), schema is attached to the data itself Karen Sparck Jones Peter Willett Morgan Kaufmann Readings in Information Retrieval … Start tag contentEnd tag Attribute Readings in … 1997 book … year author title Karen Sparck Jones Peter Willett id=“25” author Morgan Kaufmann publisher element

4 Background(2) --- Query Database:Boolean Query SQL (Structured Query Language): SELECT title FROM papers WHERE conf= ‘ SIGIR ’ Return the unranked tuples satisfying the query. IR:Ranked Query Keywords: paper SIGIR Return the ranked documents according to the relevance. How to query semi-structured data (e.g. XML data) ?

5 Related Work DB-oriented approaches –E.g. XML-QL, XQL, XQUERY … WHERE Harry Potter $a, $y in “books.xml”, $y>2002 CONSTRUCT $t DB+IR approaches –E.g. XIRQL IR-oriented approaches –E.g. this paper

6 Problem Refinement---CAS Search Document collection: –XML documents Each document is a hierarchical structure of nested elements Markup in the document mainly serves for exposing the logical structure of a document. Query –content + explicit references to the XML structure –specifies the target element need to be returned An example: Retrieval all articles from the years and deal with works on nonmonotonic reasoning. Do not retrieve articles that are calendar/call for papers.

7 Approach Compare apple and apple Recall vector space models –Both documents and queries are expressed in free text. –Compare unstructured data to unstructured data This paper: –Search XML documents via XML fragments

8 Query---XML Fragments(1) Topic 1: Find all books about fishing fishing Topic 2: Find all books having a title about search fishing { for $t in document ( “ library.xml ” //book/title) where contains ($t/text(), “ search ” ) return $t } XQuery More intuitive More flexible

9 Query --- XML Fragment(2) Limited expressiveness –E.g. “ Finding figures that describe the Corba architecture and the paragraphs that refer to those figures. “ Requires a “ join ” operation between two elements “ figures ” and “ paragraphs ”

10 Recall: Text Retrieval Task Give a query –According to the retrieval formula, compute the relevance score for each document; –Rank the documents according to relevance score. Vector Space Model –Represent doc/query by a vector of terms –Relevance between doc and query  distance between two vectors d q

11 Extending the Vector Space Model(1) Indexing unit: –E.g. ( “ Harry Potter ”, /book/title) –Can be matched with ( “ Harry Potter ”,/book) ( “ Harry Potter ”,/book/sec/title) Retrieval Formula Context resemblance measure Perfect match:,when ; 0,otherwise. Partial match:,when c i subsequence of c k ; 0, otherwise Fuzzy match: Flat (ignore context):

12 Extending the Vector Space Model(2),where If c is rare, idf(t,c) would be high in spite of t being very common. “ Merge-idf ” variant:,where and “ Merge ” variant:

13 Evaluation Runs –Partial-match –Partial-match. merge-idf –Partial-match.merge –Fuzzy-match.merge-idf –Flat (ignore context)

14 Result(1) Result for “ free-text-oriented ” topics –An example topic : 1995,1996,1997,1998,1999 XML Electronic commerce

15 Result(2) Result for “ context-oriented ” topics –An example topic: Content-Based retrieval of video databases

16 Summary Using XML fragments with an extended vector space model is promising. Use different solutions for different types of applications Something wrong?

17 Another Problem --- CO Search Document collection: –XML documents Query: – a set of keywords Task: Find smallest element satisfying the query Challenge: rank the components instead of document

18 t1 t2 Possible Method(1): treat each component as a document. Possible Solutions,where Problem with this method: XML components are nested.

19 t1 t2 Possible Method(2): counting TF at the component level; computing N & DF at the document level. Possible Solutions (Cont.),where Impossible to differentiate between the rankings of the three sections

20 Proposed Solution Create a index for each component type –Elements in each index are regarded as documents –Keep N, DF,TF for the specific component type –Can apply the regular vector space model on each index Given a query –Run the query in parallel on each index –Return one ranked list of results, one from each index Normalize the scores in each index into the range (0,1) –Achieved by computing Merge the normalized results into a one ranked list of all components Assume the set of potential components to be returned must be known in advance. Assume no nesting of the same component.

21 Conclusion Possible solutions to solve the following challenges. –Challenge 1 (Information/Doc Unit): What is an appropriate information unit? Document may no longer be the most natural unit Components in a document may be more appropriate –Challenge 2 (Query): What is an appropriate query language? Keyword (free text) query is no longer the only choice Constraints on the structures can be posed

22 References Retrieving the most relevant XML components, by Y. Mass, M. Mandelbrod. INEX ’ 03 workshop. Searching XML Documents via XML fragments, by D. Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and A. Soffer. SIGIR ’ 03 XIRQL: A Query Language for Information Retrieval in XML Documents by N. Fuhr, K. Gro ß johann. SIGIR ’ 02