JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 1 ExtMiner: Combining Multiple Ranking and Clustering Algorithms for Structured Document Retrieval Miika Nurminen.

Slides:

Advertisements

Similar presentations

2008 EPA and Partners Metadata Training Program: 2008 CAP Project Geospatial Metadata: Intermediate Course Module 3: Metadata Catalogs and Geospatial One.

Advertisements

Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,

An Overview of the Integration of the UCSF Dept. of Radiology Teaching File with MIRC Wyatt M. Tellis University of California San Francisco Departments.

eClassifier: Tool for Taxonomies

1 jNIK IT tool for electronic audit papers 17th meeting of the INTOSAI Working Group on IT Audit (WGITA) SAI POLAND (the Supreme Chamber of Control)

ASYCUDA Overview … a summary of the objectives of ASYCUDA implementation projects and features of the software for the Customs computer system.

28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.

Presented to: By: Date: Federal Aviation Administration Registry/Repository in a SOA Environment SOA Brown Bag #5 SWIM Team March 9, 2011.

Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.

Website Design What is Involved?. Web Design ConsiderationsSlide 2Bsc Web Design Stage 1 Website Design Involves Interface Design Site Design –Organising.

Introduction Lesson 1 Microsoft Office 2010 and the Internet

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Traditional IR models Jian-Yun Nie.

CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.

1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,

Application of Ensemble Models in Web Ranking

Chapter 5: Introduction to Information Retrieval

Multimedia Database Systems

Web indexing ICE0534 – Web-based Software Development July Seonah Lee.

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.

1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.

Information Retrieval in Practice

Search Engines and Information Retrieval

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 1 ProcMiner: Advancing Process Analysis and Management Miika Nurminen Anne Honkaranta Tommi Kärkkäinen Faculty.

1 CS 430: Information Discovery Lecture 20 The User in the Loop.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Federated Searching Pre-Conference Workshop - The federated searching cookbook Qin Zhu HP Labs Research Library February 18, 2007.

Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.

MOVIE QUOTES SEARCH ENGINE Students: Meytal Bialik Zvi Cahana Supervisors: Hayim Makabee Oren Somekh Technion – Israel Institute Of Technology Computer.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.

Search Engines and Information Retrieval Chapter 1.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Algorithmic Detection of Semantic Similarity WWW 2005.

Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.

Information Retrieval

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.

Collaborative Query Previews in Digital Libraries Lin Fu, Dion Goh, Schubert Foo Division of Information Studies School of Communication and Information.

Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003.

June 30, 2005 Public Web Site Search Project Update: 6/30/2005 Linda Busdiecker & Andy Nguyen Department of Information Technology.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Information Retrieval in Practice

Information Storage and Retrieval Fall Lecture 1: Introduction and History.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Search Engine Architecture

Building Search Systems for Digital Library Collections

Eric Sieverts University Library Utrecht Institute for Media &

Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology

Chapter 5: Information Retrieval and Web Search

Presentation transcript:

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 1 ExtMiner: Combining Multiple Ranking and Clustering Algorithms for Structured Document Retrieval Miika Nurminen Anne Honkaranta Tommi Kärkkäinen Faculty of Information Technology University of Jyväskylä, Finland

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 2 Motivation Organizations are provided with overwhelming amount of digital information –New ways for retrieving, filtering and managing information are needed People find it difficult to express their information needs as index terms and keywords –Even if they do, the retrieved sets of documents do not necessarily match the information needs Heterogeneous document collections cannot be sufficiently searched when merely index terms are applied –(eg. Plain text vs. HTML vs. Word Doc vs. general XML) Potential solutions: integration of text mining techniques, providing different views to documents, taking document structure to account

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 3 Related work Extended Vector Model (Fox et al, 1988) combines various document features (such as index terms and links) in ranking Scatter/Gather-system (Cutting et al, 1992) introduced continuous search process based on clustering LightHouse (Leuski & Allan, 2000) featured tight integration between ranked list and visualization of clusters (Crouch et al, 2003) have previously applied extended vector model for XML retrieval MSEEC (Hannappel et al, 1999) presented architecture for combining multiple clustering algorithms (Ben-Aharon et al, 2003) combined various rankers for content- and structure-based XML search

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 4 Our approach: ExtMiner A platform and a proof-of- concept for combining –Different document features (eg. text, structure, links, metadata) –Ranking algorithms (eg. Cosine measure, PageRank) –Clustering algorithms (eg. DBSCAN, hierarchical clustering) –Visualization algorithms (eg. FastMap projection) Integrates many of the features previously implemented in separate systems Continuous search process based on ranked lists and cluster model

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 5 ExtMiner architecture

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 6 3 layers: UI, Application logic and Document index Document index consists of similarity matrices and a field- based term/link index Application logic includes pluggable ranking, clustering and visualization algorithms and extensible mechanism for index creation from various document repositories UI provides customizable views for documents, ranked search result list and cluster model tree Implemented with Java, published as open source. Third- party open source components (eg. Jakarta Lucene, JOpenChart) are utilized. ExtMiner architecture (decomposed)

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 7 Conversion approaches XML XHTML DocBook TEI … RTF HTML PDF Open- Office LaTeX Word DOC DB tidy writer2html pdftohtml tex4ht custom converter Ext- Miner TXT

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 8 Indexing and configuration Documents must be available in a local filesystem Stemming,stopword removal and tf*idf weighting is performed by Lucene Digester handles rule-based XML parsing Documents are represented as field-based index (eg. tuples of vectors) Fields can be index terms, links, headers or document type –specific external metadata or structural information encoded as vectors Document-to-document similarities are precalculated for clustering Different index formers and field definitions can be utilized, depending on document type and application domain

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 9 Searching and clustering Extended vector model is applied both in ranking and clustering similarity calculation Let d be a document and q a query, both represented as tuples of n vectors (fields). Relevance estimate R is calculated as r denotes the restriction that extracts k-th vector from the tuple, sim is the similarity measure (such as boolean matching, cosine measure or co-citation), w denotes a field-specific weight supplied by the user (or matched evenly by default) Substitute q with another document and you have a document-to-document similarity measure for clustering Any metric clustering algorithm can be used, provided that the implemantation is available

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 10 User interface and visualization Iterative search and clustering process –Search and clustering can be performed iteratively and focused to an appropriate subset of the collection Interactive cluster model –The user can select documents from any of the views provided by the application: ranked list, cluster tree or visual projection. Cluster tree is interactive: a cluster can be marked as noise or subclusters of a single cluster can be merged (useful with hierarchical clustering) Simultaneous views for lists and clusters –Both views are needed since lists and clusters support different search objectives. Clusters are easy to understand and help to cope with ambiguous terms, although they do not improve search quality as such. Any MDS (multidimensional scaling) –style projection algorithm can be used for visualization (currently FastMap) Documents can be opened in web browser or custom viewer (eg. text editor, XML tree view)

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 11 Case 1: Course essays Introduction to Software Engineering –course was carried out in Fall 2004 at University of Jyväskylä Each student was assigned to produce 13 essays, one for each lecture. Over 200 signed up to the course, finally over 1000 essays. ExtMiner was utilized for checking up and comparing the essays Fields used: index term and headers were extracted directly from the documents. Author(s), major subject(s) and lecture number was provided as metadata. The lecturer could retrieve essays from the collection by using each of the fields as search key Clustering allowed cross-insecting each cluster pertaining to certain lecture or subject matter in relation to each other

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 12 Case 1: Course essays

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 13 Case 2: KnowPap KnowPap is an e-learning application for paper production technologies, containing a collection of HTML-documents, pictures, video clips and other education material A subset of 300 documents was imported from KnowPap web site and indexed in ExtMiner Index terms, headers and links to multimedia material (including target type) were extracted from HTML files With multimedia link index ExtMiner was used as a proof-of- concept interface for browsing a simple multimedia ”database”. The user could retrieve web pages or directly multimedia material, depending on query. Paper technology trainers could use ExtMiner as a tool for organizing, browsing and retrieving training material components for novel training content

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 14 Case 2: KnowPap

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 15 Case 3: References collection ExtMiner was used for organizing a collection of references for one of the authors’ thesis (title in English: Data Mining from Structured Documents) The collection consisted of 145 HTML and PDF documents, the latter were converted to HTML as well. Documents were preprocessed and converted to XML with HTML Tidy. Over 50% of the documents did not pass the preprocessing stage (malformed HTML, PDF files that were essentially scanned pictures etc), resulting in 69 indexable documents Only index term and header fields were used Documents were clustered with both DBSCAN and Group Average hierarchical clustering, resulting in roughly similar cluster models with comparable subject areas

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 16 Case 3: DBSCAN results 5 subject areas + 24 ”noise” documents: –Generic XML cluster –”Main” cluster (IR and document clustering articles) –LSI cluster –Data mining cluster –XML indexing cluster DBSCAN parameters were adjusted manually

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 17 Case 3: Group average results 2 new subject areas, one dropped, no ”noise” –Link cluster –”General” nontechnical articles (classified as noise by DBSCAN) –No LSI cluster Hiearchical tree pruning was done manually

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 18 Further research ExtMiner shows potential to become a supporting tool for information management in SMEs or organizational workgroups Can be used as a platform for further IR or data mining research User interface needs further development, currently not suitable for novice users Use of ExtMiner requires manual work for preprocessing heterogeneous source documents The system should be enhanced with validation functionality for evaluating search and clustering quality with standard test collections Manual selection of clustering parameters, hierarchical tree pruning or field weights requires expertese Clustering performance was not adequate with large (>1000) document collections because of O(n 2 ) time complexity (document- to-document similarities).

JYVÄSKYLÄN YLIOPISTO UNIVERSITY OF JYVÄSKYLÄ 19 Thank you!