Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

-- MetaQuerier Mid-flight -- Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web Kevin C. Chang Joint work with: Bin He, Zhen.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
gSpan: Graph-based substructure pattern mining
1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria.
EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.
Discriminative Segment Annotation in Weakly Labeled Video Kevin Tang, Rahul Sukthankar Appeared in CVPR 2013 (Oral)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Dayi Zhou Week 4 (Oct. 19)
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
The Data Mining Visual Environment Motivation Major problems with existing DM systems They are based on non-extensible frameworks. They provide a non-uniform.
Web Mining. Two Key Problems  Page Rank  Web Content Mining.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He, Kevin Chen-Chuan Chang, Jiawei Han Presented by Dayi Zhou.
Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search.
Finding Hidden Correlations and Filtering out Incorrect Matchings with Compatibility Detection across Web Query Interfaces Lei Lei June 11, 2004 June 11,
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Chapter 10: Information Integration and Synthesis.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Dr. Azeddine Chikh IS446: Internet Software Development.
Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006.
Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.
Merging Source Query Interfaces on Web Databases Eduard C. Dragut (speaker) Wensheng Wu Prasad Sistla Clement Yu Weiyi Meng Eduard C. Dragut (speaker)
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Entity Search Are you searching for what you want? Kevin C. Chang Joint work with: Bin He, Zhen Zhang, Chengkai Li, Govind Kabra, Shui-Lung Chuang, Joe.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Information Extraction and Integration Bing Liu Department of Computer Science University of Illinois at Chicago (UIC)
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web Tao Cheng, Kevin Chang University Of Illinois, Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Context-Aware Wrapping: Synchronized Data Extraction Shui-Lung Chuang, Kevin.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Statistical Schema Matching across Web Query Interfaces
Meaningful Labeling of Integrated Query Interfaces
A Unifying View on Instance Selection
Data Integration for Relational Web
Stratified Sampling for Data Mining on the Deep Web
Database Design Hacettepe University
Extracting Patterns and Relations from the World Wide Web
Toward Large Scale Integration
Context-Aware Internet
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
WSExpress: A QoS-Aware Search Engine for Web Services
Presentation transcript:

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ. Illinois at Urbana-Champaign

MetaQuerier 2 Context: MetaQuerier Large-scale integration of the deep Web MetaQuerier QueryResult The Deep Web

MetaQuerier 3 Challenge: Matching query interfaces (QIs) Book Domain Music Domain m:n complex matching 1:1 simple matching

MetaQuerier 4 Demo.

MetaQuerier 5 Traditional approaches of schema matching – Pairwise attribute correspondence But, scale is a challenge…  How to address the challenge of large scale? And, scale is an opportunity!  How to leverage the opportunity of large scale? Pairwise Attribute Correspondence S1.author  S2.name S1.subject  S2.category S1: author title subject ISBN S2: name title category format Pairwise Matching

MetaQuerier 6 A holistic schema matching paradigm Holistic Schema Matching S2: writer title category format S3: name title keyword binding S1: author title subject ISBN Input: Set of schemas Output: Semantic model, for all attribute matchings author = name = writer subject = category format = binding

MetaQuerier 7 Holistic matching is, in essence– Data mining to discover semantics for information integration Semantics (semantic correspondences) Observations (attribute occurrences) Hidden Regularities Statistical Analysis -- for Model Discovery Generation  Our Hypothesis  Our Approach

MetaQuerier 8 Regularity: Co-occurrence patterns Author{ Last NameFirst Name } =, Grouping Attributes Synonym Attributes (a) amazon.com(b) (d) 1bookstreet.com (c) bn.com

MetaQuerier 9 Schema matching as correlation mining Across many sources: Synonym attributes with negative correlation  synonym attributes are semantically alternative  thus, rarely co-occur in query interfaces Grouping attributes with positive correlation  grouping attributes are semantically complement  thus, often co-occur in query interfaces

MetaQuerier 10 Data preparation: Prepare schema transactions to be mined Interface Extraction [SIGMOD’04] Type Recognition  Type is not declared in Web interfaces  Identify types from instance values, e.g., integer, datetime  Used for constraining merging and matching Syntactic Merging  merge attributes with syntactically similar names e.g., title of book to title, author’s name to author  merge attributes with syntactically similar instance values attributeoperatorvalue

MetaQuerier 11 DCM: Dual Correlation Mining framework 1. Positive correlation mining as potential groups 2. Negative correlation mining as potential matchings Mining positive correlations Last Name (any), First Name (any) Mining negative correlations Author (any) = {Last Name (any), First Name (any)} ISBN (any) = {Last Name (any), First Name (any)} 3. Matching selection as model construction Author (any) = {Last Name (any), First Name (any)} Subject (string) = Category (string) Format (string) = Binding (string)

MetaQuerier 12 Correlation measure for qualification To find groups and matchings that pass the correlation threshold Observation: Pairwise correlations  e.g., in Airfares domain, to = arrival city = destination  to and arrival city are negatively correlated  to and destiation are negatively correlated  arrival city and destination are negatively correlated Measure: m: some correlation measure for two items  support downward closure --- enable Apriori algorithm  accommodate different measure m C min = min m(A i, A j ), for all i <> j

MetaQuerier 13 The mining process – A standard Apriori algorithm Departure City Destination …. From To … Departure City Arrival City … Schema Transactions Destination = To Destination = Arrival City To = Arrival City Departure City = From … …. … Destination = To = Arrival City … …. Correlated items with length 2 Correlated items with length 3

MetaQuerier 14 Correlation measures for ranking To rank and select matchings in model construction Qualification measure is not good for ranking  a set cannot win its subset due to the downward closure e.g., min({1, 2, 3}) < min({2, 3})  superset contains more matchings and should be preferred Ranking measure:  A set doest not win its superset  When tie, breaking the tie by semantic richness A 1 = A 2 = A 3 is semantically richer than A 1 = A 2 A 1 = {A 2, A 3 } is semantically richer than A 1 = A 2 C max = max m(A i, A j ), for all i <> j

MetaQuerier 15 Choosing the m --- Measuring the correlation of two items Contingency table We explore 22 measures, e.g., Lift = f 00 f 11 /(f 01 f 10 ) Jaccard = f 11 /(f 11 +f 01 +f 10 )

MetaQuerier 16 Choosing the m --- The problems of existing measures Co-presence (f 11 ) is more important than co-absence (f 00 ) Less positive correlation but a higher Lift = 17 More positive correlation but a lower Lift = 0.69 Rare attributes are not statistically convincing A p as rare attributes and Jaccard = 0.02 No rare attributes and Jaccard = 0.02

MetaQuerier 17 Choosing the m --- H -measure H-measure H = f 01 f 10 /(f +1 f 1+ ) Ignore the co-absence Less positive correlation H = 0.25 More positive correlation H = 0.07 Differentiate the subtlety of negative correlations A p as rare attributes and H = 0.49 No rare attributes and H = 0.92

MetaQuerier 18 Experimental setup 447 deep Web sources in 8 domains  Domains Travel: Airfares, Hotels, Car Rentals Entertainment: Books, Movies, Music Records Living: Jobs, Automobiles  Available as the TEL-8 dataset in UIUC Web Integration Repository

MetaQuerier 19 Results in Books and Airfares domains Books author (any) = {last name (any), first name (any)} subject (string) = category (string) format (string) = binding (string) Airfares passenger (integer) = {adult (integer), child (integer), infant (integer)} from (string) = departure city (string) = depart (string) departure date (datetime) = depart (datetime) return date (datetime) = return (datetime) class (string) = cabin (string) destination (string) = to (string) = {departure city (string), arrival city (string)}

MetaQuerier 20 Contributions Insight  We build a conceptually novel connection between data integration and correlation mining schema matching as a new application of correlation mining correlation mining as a new approach for schema matching Techniques  The dual correlation mining framework  Measures for qualification and ranking  H-measure, robust for negative correlations

MetaQuerier 21 Thank You!