Semantic Interoperability and Data Warehouse Design

Semantic Interoperability and Data Warehouse Design
Sudha Ram Andersen Consulting Professor Huimin Zhao Department of MIS 430J McClelland Hall Eller College of Business and Public Administration University of Arizona Tucson, AZ 85721 Phone: (520) URL:

Need for Integration

Detecting Correspondences
Objective Detecting schema-level correspondences is the first step in schema integration. Detecting data-level correspondences is the first step in data integration and cleansing. These are the most critical steps in data warehousing. Objective: automate these steps as much as possible. Potential Benefits Real-world data is dirty! Don’t warehouse dirty data! Avoid “garbage in, garbage out”! Cleaner data, lower cost, better decision.

Understanding Correspondences
MITRE has spent several years, largely on human interaction, to integrate the database systems of the U.S. Air Force. Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” Integrator MITRE has spent several years, largely on human interaction, to integrate the database systems of the U.S. Air Force. Let's look at one scenario. Suppose the integrator wants to know whether the mission start time of database A means the same as the mission take off time of database B. He contact the local DBA of database B via letter, phone, fax, or whatever. Local DBA

Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” The next day, “I maintain the database. But how to interpret the data is up to the domain experts. " Integrator If he's lucky, he got this response from the local DBA the next day. Local DBA

Domain Experts Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” The next day, “I maintain the database. But how to interpret the data is up to the domain experts. " Integrator Now he has to ask the same question at domain experts. Local DBA

Two weeks later, “That depends, you know.” Domain Experts Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” The next day, “I maintain the database. But how to interpret the data is up to the domain experts. " Integrator This kind of communication regarding correspondences between attributes, entities, and relationships often takes weeks or even months. When the volume of the databases is huge, e.g., hundreds of tables, thousands of attributes, the process of understanding the correspondences become very time-consuming. A lot of time and effort are wasted in human interaction. Here we only described the situation regarding schema-level correspondences. Detecting data-level correspondences is even harder, because data are much much bigger than schemas. Many organizations have millions of customers. Manually detecting duplicated data from such huge databases is infeasible. Local DBA Volume: Hundreds of tables, thousands of attributes. A lot of time is wasted in human interaction.

Proposed Approach DB1 DB2 DBn ... Schema Integration Data Integration
Warehouse ... Statistical Clustering Expert Rules Schema Integration Data Integration SOM Machine Learning Schema-Level Correspondences Integrated Schema Data-Level Correspondences

Schema-Level Correspondences
Cluster Analysis Statistical techniques: K-means and Hierarchical clustering. Neural Nets: Self-Organizing Map (SOM) Cluster similar schematic constructs, i.e., attributes, entities, and relationships. Combine multiple types of input features, e.g, names, document, structure, statistics. Apply multiple clustering methods to cross-validate results. Provide an interactive tool for incremental analysis.

Input Features Classification of Input Features Database object names
Documentation Schematic information Data content Usage patterns Business rules and integrity constraints Users’ minds and business processes Observations No single optimal set of input features exists. Direct semantic features are more important than indirect ones.

Data-Level Correspondences
Given two relations r1 and r2 with the same schema. For a pair of tuples t1 from r1 and t2 from t2, we want to decide whether they correspond to the same real-world object. Difficulties Missing information. Wrong data Data entry errors. Names are routinely misspelled. Nick names. Address and salary change over time. Abbreviations: “Caltech” for “California Institute of Technology” Many different ways to spell McDonald’s.

Techniques Comparing Individual Attributes: Comparing Records:
Exact match (true/false): gender Edit distance, phonetic distance (e.g., Soundex), and "typewriter" distance between two names. Special lookup tables (e.g., name in different languages) and distance functions. Comparing Records: Rule-based Technique Generate (fuzzy) rules via knowledge acquisition. If same_name AND similar_address, then same_person. Machine Learning techniques Learn matching rules from training data. C4.5, Back Propagation Neural Nets, etc.

Why Both Rule-based and Machine Learning
Rule-based techniques: Hard to specify a comprehensive set of rules. Machine Learning: Need large amount of training data. Different requirements at different stages. DW Development phase: Domain expert rules + human evaluation => training data for machine learning. Subsequent regular operation: Learned rules can be used to reduce the amount of human evaluation

Experimental Analysis
Database A Database B

K-Means

Hierarchical Clustering

Self-Organizing Map (SOM)
Attribute Map Combining multiple types of input features

SOM Black-White High similarity

SOM Black-White Intermediate similarity

SOM Black-White Low similarity

SOM Use structural features only. Big clusters.

SOM Black-White Intermediate similarity

SOM Black-White Low similarity

Entity Map

Entity Map (Black-White)

Conclusion * Multi-technique approach for detecting both schema-level and data-level correspondences. SOM tool for clustering schema objects. Experimental Analysis: Combining multiple input features improves the accuracy of semantic clustering. Using only indirect semantic features may not generate tight clusters. SOM tool visualizes clustering results and enables incremental analysis.

Future Work Integrate multiple techniques into a complete integration and cleansing tool. Evaluate utility of the tool in large real-world data warehousing projects. Commercial Tools: Data standardization in a single source. e.g., Hotdata: addresses and phone numbers Identify duplicates from multiple sources. Enterprise/Integrator from Apertus: Expert specified rules. Integrity from Vality: Customized probabilistic matching rules. Detect both schema-level and data-level correspondences using various techniques.

Semantic Interoperability and Data Warehouse Design

Similar presentations

Presentation on theme: "Semantic Interoperability and Data Warehouse Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semantic Interoperability and Data Warehouse Design

Similar presentations

Presentation on theme: "Semantic Interoperability and Data Warehouse Design"— Presentation transcript:

Similar presentations

About project

Feedback