INEX 2009 XML Mining Track James Reed Jonathan McElroy Brian Clevenger.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
XML DOCUMENTS AND DATABASES
Introduction to Information Retrieval
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
San Diego Supercomputer Center Analyzing the NSDL Collection Peter Shin, Charles Cowart Tony Fountain, Reagan Moore San Diego Supercomputer Center.
Ant Inspired Data Mining Brandon Emerson April 22,
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval in Practice
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
XML Document Mining Challenge Bridging the gap between Information Retrieval and Machine Learning Ludovic DENOYER – University of Paris 6.
1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.
Web Mining Research: A Survey
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Problem Addressed The Navigation –Aided Retrieval tries to provide navigational aided query processing. It claims that the conventional Information Retrieval.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
Overview of Search Engines
Databases & Data Warehouses Chapter 3 Database Processing.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Linking Wikipedia to the Web Antonio Flores Bernal Department of Computer Sciencies San Pablo Catholic University 2010.
Tag Data and Personalized Information Retrieval 1.
Information Storage Analysis & Retrieval group
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Relational Databases (MS Access)
JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy.
IDA / ADIT Databasteknik Databaser och bioinformatik Data structures and Indexing (I) Fang Wei-Kleiner.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
An OAI-Compliant Federated Physics Digital Library for the NSDL Department of Computer Science Old Dominion University, Norfolk, VA In Collaboration.
Filtering and Recommendation INST 734 Module 9 Doug Oard.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.
Text Mining Application Programming Chapter 1 Introduction Manu Konchady, 2006.
National Technical University of Ukraine “Kiev Polytechnic Institute” Heat and energy design faculty Department of automation design of energy processes.
INFORMATION RETRIEVAL PROJECT Creation of clusters of concepts that represent a domain corpus.
Knowledge based Question Answering System Anurag Gautam Harshit Maheshwari.
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
XML Databases. XML Like HTML –Tags –Fixed vocabulary of tags and fixed structure –Tags indicate formatting, not semantics Strict HTML – XHTML –Always.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Document Clustering for Natural Language Dialogue-based IR (Google for the Blind) Antoine Raux IR Seminar and Lab Fall 2003 Initial Presentation.
Query Optimization Cases. D. ChristozovINF 280 DB Systems Query Optimization: Cases 2 Executable Block 1 Algorithm using Indices (if available) Temporary.
Introduction to Information Retrieval. What is IR? Sit down before fact as a little child, be prepared to give up every conceived notion, follow humbly.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Data mining in web applications
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Organization: Overview
Search Engine Architecture
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Introduction to the ESIP Discovery Cluster
Query Caching in Agent-based Distributed Information Retrieval
Data Mining Chapter 6 Search Engines
Information Organization: Overview
Information Retrieval and Web Design
Structure of IR Systems
Text Mining Application Programming Chapter 1 Introduction
Information Retrieval and Web Design
Presentation transcript:

INEX 2009 XML Mining Track James Reed Jonathan McElroy Brian Clevenger

Introduction INEX is An initiative looking into use of XML retrieval The clustering task uses Information Retrieval, Data Mining, Machine Learning and XML fields Goal: To measure how well clustering methods work for retrieving collections from large sets of documents. Also to measure performance specifically for XML IR

Problem Task: to test the Jardine Hypothesis which states: “documents that cluster together have a similar relevance to a given query.” If (true) {a small fraction of clusters need to be searched, increasing the throughput of an IR system;}

Data Wikipedia is the source 60 Gigabytes with about 2.7 million documents in XML format Provide Complete and Subsets of the meta-data

Data Files Tags and trees: :... : Links:... Entities: :... : Bag-of-Words (BOW...Wow!): –BOW File: :... : –Term Index File: 1472,bracelet 547,depend

Solution: A Two Pronged Approach First Prong: –Analyze Links to discover maximum flow communities –Using Ford-Fulkerson Algorithm Second Prong: –Use information from BOW and Entities to develop similarity measures between documents within clusters –Attempt to refine and develop more better clusters