INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Slides:



Advertisements
Similar presentations
XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Improved TF-IDF Ranker
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
IR Models: Overview, Boolean, and Vector
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Dynamic Element Retrieval in a Structured Environment Crouch, Carolyn J. University of Minnesota Duluth, MN October 1, 2006.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
Chapter 19: Information Retrieval
The Vector Space Model …and applications in Information Retrieval.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
An Architecture for Emergent Semantics Sven Herschel, Ralf Heese, and Jens Bleiholder Humboldt-Universität zu Berlin/ Hasso-Plattner-Institut.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Vector Space Models.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Web Search & Information Retrieval. 2 Boolean queries: Examples Simple queries involving relationships between terms and documents Simple queries involving.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
A Formal Study of Information Retrieval Heuristics
Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Multimedia Information Retrieval
אחזור מידע, מנועי חיפוש וספריות
Information Retrieval
Basic Information Retrieval
Structure and Content Scoring for XML
Chapter 5: Information Retrieval and Web Search
Structure and Content Scoring for XML
Boolean and Vector Space Retrieval Models
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza, Jonathan Mamou, Yehoshua Sagiv, Benjamin Sznajder, Efrat Twito The Hebrew University of Jerusalem

Approach IR techniques were extended in the context of XML corpus –The granularity of the retrieval is refined: fragments of document (and not necessarily whole document) are considered as potential results –The additional information provided by the structure of the document, and of the query, is exploited when retrieving results

Approach (cont’d) An extensible system was built –E.g., new ranking techniques can be added easily The system was implemented in a short time –E.g., topics are translated into XSL stylesheets Programming language: Java Operating System: Windows XP

Topic Only the title of the topic, denoted T, is used for retrieval We denote –T+ the list of terms in T that are preceded by a + sign –T- the list of terms that are preceded by a - sign –To the list of optional terms We have implemented our retrieval system only for CO and SCAS topics (not VCAS)

Topic Processor Filter Indexer Extractor Relevant documents RankerMerger Relevant fragments Fragments augmented with ranking scores TopicResult Indices IEEE Digital Library Ranker 5 Ranker 4 Ranker 3 Ranker 2 Ranker 1

Preprocess XML documents and topics –Terms are stemmed (using Porter stemmer) –Stopwords are eliminated Indices are built

Topic Processor Filter Indexer Extractor Relevant documents RankerMerger Relevant fragments Fragments augmented with ranking scores TopicResult Indices IEEE Digital Library Ranker 5 Ranker 4 Ranker 3 Ranker 2 Ranker 1

Index Inverted Keyword Index –Associates each term with the list of documents (id’s) containing it Keyword-Distance Index –Stores information about distance between two terms over all the sentences in all the documents of the corpus

Index Tag Index –Associates to each tag a weight, according to the “importance” of its content –E.g., the information provided by the front matter is more important than the information provided by a subsection Inverse Document Frequency Index –Associates to each term its IDF, classical in IR –IDF is the fraction of documents in the corpus containing the term

Topic Processor Filter Indexer Extractor Relevant documents RankerMerger Relevant fragments Fragments augmented with ranking scores TopicResult Indices IEEE Digital Library Ranker 5 Ranker 4 Ranker 3 Ranker 2 Ranker 1

Filter Documents not containing all the terms of T+ are considered as irrelevant Documents containing all the terms of T+ are extracted from the corpus

Topic Processor Filter Indexer Extractor Relevant documents RankerMerger Relevant fragments Fragments augmented with ranking scores TopicResult Indices IEEE Digital Library Ranker 5 Ranker 4 Ranker 3 Ranker 2 Ranker 1

Relevant Fragments Relevant fragments from each document that passed the filtering are extracted Relevant fragments –CAS: determined by the topic title –CO: the system determines potentially relevant fragments whole document front matter abstract any section any subsection

Extracting Relevant Fragments from a Document An XPath processor is not suitable, since the syntax of CAS topics is more general than that of XPath. The relevant fragments are extracted by means of an XSL stylesheet that is generated from T –For CAS topics, the stylesheet also checks that the returned fragments satisfy the predicates of the title The implementation of the translator of topics to XSL stylesheets, is fast

Topic Processor Filter Indexer Extractor Relevant documents RankerMerger Relevant fragments Fragments augmented with ranking scores TopicResult Indices IEEE Digital Library Ranker n Ranker … Ranker 1

An Overview of the Ranking Process n different rankers give scores based on the structure and the content of the fragments –In our implementation, 5 rankers –For some rankers, the weights of tags are incorporated into the score –Each ranker gives scores to all the fragments returned by the extractor –For each result, the scores of all the relevant fragments are aggregated

Word-Number Ranker This ranker counts the number of terms from T- and To appearing in the fragment The score is –increased when the number of terms from To is increased –decreased when the number of terms from T- is increased

IDF Ranker We measure the “rarity” of a term using the classical formula of IDF The score is –increased when the number of rare terms from To is increased –decreased when the number of rare terms from T- is increased

TFIDF Ranker It is an extension of the Vector Space Model to XML documents TF counts the number of occurrences of a term in the fragment (and not the whole document) –Each occurrence is multiplied by the weight of its tag TFIDF = TF * IDF The score of a fragment is computed by –adding the TFIDF of terms from T+ and To –subtracting the TFIDF of terms from T-

Proximity Ranker This ranker is based on the correlation between pairs of words from T+ and To appearing in a single phrase in a sliding window containing 5 terms Such a pair is called lexical affinity (LA) The score of a fragment is computed by counting the number of LA’s The score is increased when a LA appears under “important” tags

Similarity Ranker Idea: If two terms appear frequently in the same sentence in the corpus, they should be considered as related It is a sort of blind query refinement The score of a fragment is based on –Distance between the terms of the query and the terms of the fragment –Increases when the pair appears under “important” tags

Topic Processor Filter Indexer Extractor Relevant documents RankerMerger Relevant fragments Fragments augmented with ranking scores TopicResult Indices IEEE Digital Library Ranker 5 Ranker 4 Ranker 3 Ranker 2 Ranker 1

Merger The scores of the various rankers are merged into a single rank The main problem is how to determine the relative weight of each ranker The scores of the 5 rankers are lexicographically sorted as follows –An order among the rankers is determined –A tuple of the 5 scores is created for each result –The tuples are lexicographically sorted

Merger (cont’d) Our submitted results use different orderings of the rankers E.g., –Word Number –Idf –Similarity –Proximity –TFIDF

Conclusion Our system builds and uses indices It combines different rankers The rankers use both the content and the structure The system is extensible –The implementation uses configuration files –New rankers can be added easily –The system can be easily adapted to changes in the formal syntax of queries

Future Works We still need to experiment thoroughly with the system –Modify the merger by using a single formula to combine the scores of the different rankers –How to determine the relative weight of each ranker? –Add and modify rankers

Thank You. Questions?