XSEarch XML Search Engine Jonathan MAMOU October 2002.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
IR Models: Structural Models
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Vector Space Model CS 652 Information Extraction and Integration.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Effective XML Keyword Search with Relevance Oriented Ranking Paper by: Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu Presented by: Ilanit Goldshtein.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
IR Models: Review Vector Model and Probabilistic.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
XML files (with LINQ). Introduction to LINQ ( Language Integrated Query ) C#’s new LINQ capabilities allow you to write query expressions that retrieve.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Querying Structured Text in an XML Database By Xuemei Luo.
NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Chapter 6: Information Retrieval and Web Search
Database Systems Part VII: XML Querying Software School of Hunan University
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
Algorithmic Detection of Semantic Similarity WWW 2005.
Vector Space Models.
1 Information Retrieval LECTURE 1 : Introduction.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
General Architecture of Retrieval Systems 1Adrienn Skrop.
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Information Retrieval in Practice
Search Engine Architecture
Chapter 5: Information Retrieval and Web Search
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

XSEarch XML Search Engine Jonathan MAMOU October 2002

Motivation

XML Getting popular Allows meta-data to be embedded into documents Data-centric view : exchange format for structured data – meta data Document-centric view : Content – text, meta data Querying data and meta-data

One Fish Two Fish by John Meyer & Peter Smith Costs Only: $7.95 Goodnight Moon by Margaret Brown Costs Only: $10.55 Brown Bear by Bill Martin Jr. Costs Only: $6.00 Buy our Classic Children’s books. amazing.com

One Fish Two Fish John Meyer Peter Smith 7.95 Goodnight Moon Margaret Brown

A query Find titles and prices of books by ‘ Meyer ’ or ‘ Smith ’

IR Approach How to deal with tags? Discard all tags Simplicity Loss of information (structure)  lower retrieval performance Keep tags as keyword How to write the query? “ Title price book author Meyer Smith ”

IR Approach (cont ’ d) Can ’ t specify that Meyer and Smith are the authors Can ’ t specify that title, price and author belongs to same book Can ’ t specify desired output (i.e., titles, price)

Database approach FOR $b IN document(“bib.xml”)//book WHERE $b/author contains ‘Meyer’ OR $b/author contains ‘Smith’ RETURN $b/title $b/price Difficult for naive user Requires knowledge of document structure Dependent on document structure

Our Goal Combine IR and database techniques : tags + text Simple language Logical Structure, not physical Require knowledge of tag names, not structure Queries should work even if structure changes Rank results

Framework

bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Tree Representation We need to find tuples of related title and price nodes.

author name Dr. Meyer author name book M. Brown Goodnight Moon title book title price One Fish Two Fish $12.50 book title price Cat in the Hat $14.95 bookinfo Another Tree Representation Similar document, but with different hierarchical structure from the previous. We need to find tuples of related title, author and price nodes.

Interconnection Consider a title and price node Intuition: The nodes belong to different book entities bookinfo Just Lost book title name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 The lowest common ancestor of the circled nodes

Interconnection (cont ’ d) Just Lost title bookinfo book name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Intuition: The nodes belong to same book entity The lowest common ancestor of the circled nodes

Interconnection (cont ’ d) Just Lost title bookinfo book name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Intuition: The nodes belong to same book entity

Relationship tree Nodes n 1,n 2 n their lowest common ancestor Tn the subtree rooted at n The relationship tree of n 1,n 2 is the tree obtained by pruning from Tn all nodes other than n 1,n 2 that are not ancestors of n 1,n 2

Interconnection We say that n 1,n 2 are interconnected if the relationship tree does not contain 2 distinct nodes with the same label Or the relationship tree contains exactly one pair of distinct nodes with the same label and this pair is comprised of n 1,n 2

All-Pairs Interconnection A set of nodes is all-pairs interconnected if every pair of nodes are interconnected

Star interconnection bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 name The 2 names are not interconnected

Star Interconnection (cont ’ d) A set of nodes is star interconnected if all the nodes in the set are interconnected to the same node

Search terms, Search query Search Term (l,k) l label (context) k keyword Search Query AND:L1 OR:L2 L1, L2 list of search terms AND:(title,)(price,) OR:(author,Meyer)(author:Smith)

Answer AND:N1 OR:N2 N1, N2 are list of nodes Matching between N1,N2 and L1,L2 N1 and N2 are interconnected All all-pair answers are star answers Maximal answer

bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Example (title,) (price,) (author,Meyer) Find matchings of title, author and price to the nodes in the tree title authorprice null

Computing answers All-pairs Determining whether the set of answers is empty is NP-complete If L1 is empty, computing the set of answers is polynomial in the size of input and output Star computing the set of answers is polynomial in the size of input and output

Ranking results Unstructured Keyword weight (tfilf) Tags weight Result size Structured Nodes distance Ancestor-descendant

Keyword Weight Compute the weight of a keyword k within a given node n Variation of the tfidf, one of the metric of Vector Space Model (classical model in IR)

Keyword Weight (cont ’ d) Term Frequency (tf): number of appearances of k within n tf(k,n) = occ(k,n) / (max occ(k ’,n)) Inverse Leaf Frequency (ilf): inverse frequency of k among all the leafs in the corpus idf(k) = log(1+N/N k ) W(k,n) = tf(k,n) * idf(k) Normalized per leave

Tag Weight Give weight to tags according to their importance E.g. give more weight to than to

Result Size Number of search terms appearing in the result (OR part)

Ranking-Structured Nodes distance size of the relationship tree Ancestor-descendant relationship “ more ” interconnected

System overview

XSEarch overview XML corpus with logical hierarchy Indexer Search query Results Offline Online

Document Location array Generate a unique id, did Associate each did with the physical location of the corresponding document Logical structure of the corpus

Node Encoding Array Generate for each interior node a id, nid Node encoding Defined recursively Node encoding of its parent Index of the node among its siblings Eg: Associate each nid with its node encoding

Node Label Array Associate each nid with its label

Inverted Tag Index For each tag, keep posting list: list of nodes labeled with this tag weight Nid1 tag Nid3Nid2

Inverted Keyword Index For each kw, keep posting list: list of leafs containing this keyword weight of the kw within the leaf (tfilf) Nid1,w1 kw Nid3,w3Nid2,w2

Node Interconnection Matrix element ij contains: 1, if ni and nj are interconnected 0, else n*n symmetric sparse matrix Dynamic programming

Alternative Hash set : keep only interconnected nodes Key: pair (ni, nj)

Interconnection Let n be the number of nodes It is possible to determine whether n1 and n2 are interconnected in O(n) time It is possible to determine interconnection of all pairs in O(n 2 ) Offline/Online computation

Interconnection for (i=size-1; i>=0; i--) for (j=i+1; j<=size; j++) if i ancestor of j connected(iChild,j) AND connected(i,jFather) AND labelIChild != labelJ AND labelI != labelJFather for (j=i+1; j<size; j++) if i not ancestor of j connected(i,jFather) AND connected(iFather,j) AND labelI != labelJFather AND labelIFather != labelJ

Demo