Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Applying Data Mining Techniques for Schema Matching across Biological Deep Web Data Sources Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University Scientific deep web data sources Querying Interface: submitting query Input schema: describing input attributes Output Web page: querying results output schema: describing output attributes Inter-dependence A data source provides input for another data source Input-output attributes matching across multiple data sources Exploring inter-dependence between data sources Automatically generating query plans given users’ queries Integrating multiple data sources Motivation Model for Schemas Input Schema Input Attributes -- label and instances Output Schema Hierarchical model Siblings: Related attributes in a table or a separate block Output Attributes -- label, instances, parent and siblings Similarity Function Similarity between corresponding properties Linguistic similarity is utilized to compute the similarity between strings System Design Aim: identifying semantic matching between input attributes and output attributes across multiple data sources Approaches Discovering instances for input attributes Help web pages -- querying interfaces and their linked web pages Output web pages of other data sources Schema matching via clustering Schema Matching Discovering Instances for Input Attributes Discover semantic correspondence between attributes Mapping attributes are grouped together Hierarchical clustering Similarity between attributes are calculated At each step, groups with largest similarity are merged into one group Input and output attributes in the same group: inter-dependence between their data sources Bridge effect -- two attributes are similar if they are both similar to a third attribute. From output web pages Discovering instances from output web pages Iteratively borrowing instances from related output attributes More output attributes and instances provide more instances for input attributes From help web pages Potential web pages linked from query interface Identifying help web pages by anchor text of links, e.g. ‘search hint’ Locating potential instances by meaningful keywords, e.g. ‘for instance’ Discovering potential domain-specific instances, less frequently used in other domains Validating potential instances through querying interface We applied a series of data mining techniques for schema matching. We show that the instances for deep web data sources can be discovered from the query interfaces themselves. We also show the instances are obtained from output pages of related data sources Our approach has been effective on a number of biological data sources. Conclusion Experiment Results Data Sources 11 data sources with 24 query interfaces Data sources provide SNP, Gene, Protein and related information Instances discovered from Interface Accuracy on different subsets Impact of size of input instance sets Accuracy of all types of schema matching Tantan Liu Fan Wang Gagan Agrawal

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Similar presentations

Presentation on theme: "Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Similar presentations

Presentation on theme: "Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University"— Presentation transcript:

Similar presentations

About project

Feedback