Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
Large-Scale Entity-Based Online Social Network Profile Linkage.
A N I NTERACTIVE C LUSTERING - BASED A PPROACH TO I NTEGRATING S OURCE Q UERY I NTERFACES ON THE D EEP W EB Wensheng Wu Clement Yu AnHai Doan Weiyi Meng.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Aki Hecht Seminar in Databases (236826) January 2009
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
The Data Mining Visual Environment Motivation Major problems with existing DM systems They are based on non-extensible frameworks. They provide a non-uniform.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He, Kevin Chen-Chuan Chang, Jiawei Han Presented by Dayi Zhou.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Fundamentals of Information Systems, Fifth Edition
Social scope: Enabling Information Discovery On Social Content Sites
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Merging Source Query Interfaces on Web Databases Eduard C. Dragut (speaker) Wensheng Wu Prasad Sistla Clement Yu Weiyi Meng Eduard C. Dragut (speaker)
University of Malta CSA3080: Lecture 4 © Chris Staff 1 of 14 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources Gagan Agrawal Fan Wang, Tantan Liu Ohio State University.
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
MICHAL TVAROŽEK, MICHAL BARLA, GYÖRGY FRIVOLT, MAREK TOMŠA, MÁRIA BIELIKOVÁ Improving Semantic Search via Integrated Personalized Faceted and Visual Graph.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM Submitted in partial fulfillment of requirement for the V Sem MCA Mini Project Under Visvesvaraya Technological.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Text Clustering Hongning Wang
Automation of Web Form Queries Beth Watson Jared Coplin Mentor: Dr. Anne Ngu WEAvE: Web Exploration and Analytic Engine.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Of 24 lecture 11: ontology – mediation, merging & aligning.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Introduction to PubChem BioAssay
Semi-Supervised Clustering
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Dr. Sudha Ram Huimin Zhao Department of MIS University of Arizona
Data Integration for Relational Web
[jws13] Evaluation of instance matching tools: The experience of OAEI
Stratified Sampling for Data Mining on the Deep Web
A Graph-Based Approach to Learn Semantic Descriptions of Data Sources
Web Mining Department of Computer Science and Engg.
Magnet & /facet Zheng Liang
Grid Based Data Integration with Automatic Wrapper Generation
Intent-Aware Semantic Query Annotation
Answering Cross-Source Keyword Queries Over Biological Data Sources
Web Mining Research: A Survey
Supporting High-Performance Data Processing on Flat-Files
WSExpress: A QoS-Aware Search Engine for Web Services
Presentation transcript:

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University Applying Data Mining Techniques for Schema Matching across Biological Deep Web Data Sources Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University Scientific deep web data sources Querying Interface: submitting query Input schema: describing input attributes Output Web page: querying results output schema: describing output attributes Inter-dependence A data source provides input for another data source Input-output attributes matching across multiple data sources Exploring inter-dependence between data sources Automatically generating query plans given users’ queries Integrating multiple data sources Motivation Model for Schemas Input Schema Input Attributes -- label and instances Output Schema Hierarchical model Siblings: Related attributes in a table or a separate block Output Attributes -- label, instances, parent and siblings Similarity Function Similarity between corresponding properties Linguistic similarity is utilized to compute the similarity between strings System Design Aim: identifying semantic matching between input attributes and output attributes across multiple data sources Approaches Discovering instances for input attributes Help web pages -- querying interfaces and their linked web pages Output web pages of other data sources Schema matching via clustering Schema Matching Discovering Instances for Input Attributes Discover semantic correspondence between attributes Mapping attributes are grouped together Hierarchical clustering Similarity between attributes are calculated At each step, groups with largest similarity are merged into one group Input and output attributes in the same group: inter-dependence between their data sources Bridge effect -- two attributes are similar if they are both similar to a third attribute. From output web pages Discovering instances from output web pages Iteratively borrowing instances from related output attributes More output attributes and instances provide more instances for input attributes From help web pages Potential web pages linked from query interface Identifying help web pages by anchor text of links, e.g. ‘search hint’ Locating potential instances by meaningful keywords, e.g. ‘for instance’ Discovering potential domain-specific instances, less frequently used in other domains Validating potential instances through querying interface We applied a series of data mining techniques for schema matching. We show that the instances for deep web data sources can be discovered from the query interfaces themselves. We also show the instances are obtained from output pages of related data sources Our approach has been effective on a number of biological data sources. Conclusion Experiment Results Data Sources 11 data sources with 24 query interfaces Data sources provide SNP, Gene, Protein and related information Instances discovered from Interface Accuracy on different subsets Impact of size of input instance sets Accuracy of all types of schema matching Tantan Liu liut@cse.ohio-state.edu Fan Wang wangfa@cse.ohio-state.edu Gagan Agrawal agrawal@cse.ohio-state.edu