Automating the Integration of Heterogeneous Databases

Automating the Integration of Heterogeneous Databases
Eduard Hovy, Andrew Philpot, Patrick Pantel Digital Government Research Center University of Southern California

DGRC research projects
Major research focus: integration of data from disparate sources since 1999: EIA project: Energy price data (EIA, CENSUS, BLS, etc.) — integrated over 50,000 data tables under central data model, accessible to user via NL interface in English and Spanish ARGOS project: transportation / freight flow data integration, with different gov partners Air Quality Integration project: Air Quality (emissions) data from districts in California, compared to central state database We’ve learned several lessons: 1. Don’t waste the gov partner’s time! 2. Don’t make overblown promises about system functionality — this is research! 3. Respect the gov partner’s need for data privacy!

Problem: data consolidation
Massive data heterogeneity in Government (as well as in large organizations): Data is collected at different times by different people Data definition / format / types slowly evolve Problems: Staff and public cannot effectively locate, share or compare data Cannot achieve computational data interoperability We investigate (semi)-automatic methods for integrating data sources Two approaches: 1. Automatically link database into metadata structure 2. Automatically find cross-database mappings

Our work on database integration
Our goal Develop (semi-)automated methods to speed up the data alignment and database integration process Data: EPA’s Air Quality Management Districts: 35 in California, each collecting its own emissions inventory databases its own way, using its own standards CARB in Sacramento integrates this data annually and passes it to US-EPA, who integrates it with other states’ data Approach Focus on the human-centered process of identifying semantic equivalence between source and target data sets Develop a suite of tools to map, transform, and re-aggregate data from one schema to another: Leverage statistical techniques of human language technology Address issues of scalability and maintenance Work with others to develop sophisticated visualization tools Eventually integrate toolkit into larger data management framework

Techniques studied Context: two sets of EPA Air Quality databases with partly overlapping, slightly altered data First tried Expectation Maximization (EM) algorithm: cell-by-cell, column-by-column, column-within-row alignment not much success Best technique: Mutual Information-based clustering: CBC system developed by Patrick Pantel Mostly used for text clustering and automated ontology building through clustering Define set of features (orthography, datatype…) Given N and M columns of values: for each column, find high enough Mutual Information scores among values’ features Compare across columns and overall Evaluate: Compare against hand-built ‘gold standard’ mapping across databases Why MI? — Individual data values don’t help, but pairs of them inform C(w1w2) C(w1).C(w2) MI(w1w2) = Humpty Dumpty White House green house sock sausage

Magic: Pointwise Mutual Information
Phone calls originating from a SoCal city: Hamburg Leipzig Mosul Kabul L.A D.C Hamburg Culver City 199 Anaheim Leipzig Mosul Toronto Boston Ventura St. Louis Kabul Hollywood 21 Covina Long Beach 16 Carson Compton Database of telecommunications – one table per Southern California city 336 300281 = 21 227 =

Example of MI MI: the high NO2 producers Facility NO2 XYZ 39.2
F&J Brothers Superba Standard Oil Refinery 996.6 Ozonia Archie Welding Co Green Cleaner NO2 XYZ = 39.2 measured at NO * = * XYZ = NO2 F&J = 40.1 measured at NO * = * F&J = Facility MI-NO2 F&J Brothers Standard Oil Refinery Archie Welding Co Superba Ozonia XYZ Green Cleaner 39.2 / = 40.1 / = MI: the high NO2 producers

Pointwise Mutual Information
Calls originating from a SoCal city… sorted by MI: Kabul Mosul 7.05 Leipzig 5.78 Hamburg 5.58 Culver City L.A D.C Anaheim Ventura Toronto Boston Covina Compton St. Louis Long Beach 2.03 Carson Hollywood Joint probability Assuming independence measures the strength of association between the feature and the element it is the reduction in uncertainty of one event due to knowing about another (Pantel 2003; Church and Hanks 1989)

Pointwise Mutual Information
measures the strength of association between the feature and the element reflects the reduction in uncertainty of one event due to knowing about another may include a factor to adjust for low values (Pantel 2003; Church and Hanks 1989) Joint probability Assuming independence NO2 to F&J NO2 to any any to F&J any to any

Matching columns with MI
Facility NO2 O2 Cat XYZ J1 F&J Brothers J1 Superba AA Standard Oil Refinery A3 Ozonia B Archie Welding Co J2 Green Cleaner J1 Comp_Name X_val XYZ Corp Superba Limpiar 182 We Burn It Corp 511 Frank and Joe Acme Standard Oil Refinery 997 Ozonia Dry Cleaners 236 Steve’s Auto Body 139 The Green Cleaners 23 Thrifty Recycling Archie’s Welding 307 Nonesuch Building 56 FooBar Inc. 496 ZZZZ Best Inc Comp_Name MI-X_val Frank and Joe Standard Oil Refinery We Burn It Corp Archie’s Welding Superba Limpiar Steve’s Auto Body Ozonia Dry Cleaners XYZ Corp The Green Cleaners Acme Nonesuch Building Thrifty Recycling FooBar Inc ZZZZ Best Inc Facility MI-NO2 MI-O2 MI-C F&J Brothers Standard Oil Refinery Archie Welding Co Superba Ozonia XYZ Green Cleaner

Similarity measure Place all the feature vectors into a ‘vector space’
Cosine coefficient: Commonly used metric in spaces where matching observations are more indicative of similarity than missing observations are indicative of dissimilarity (Popular in Information Retrieval, etc.) Measures the cosine of the angle between the two feature vectors of two columns ei and ej: (Salton and McGill 1983)

Aligning columns from databases A and B
Prepare data collections Define features: some aspect of the data Features can be: 1) Cell values (e.g. ‘90292’, ‘ ’ …) 2) Preprocessed domain classifiers that append the class of data to the cell value (e.g. ‘zip5:90292’, ‘tel+area:(310) ’…) 3) Orthographic/type aspects (e.g. ‘capitalized:IBM’, ‘integer:12’ …) For each data cell, create its feature set For each column, create a feature vector to characterize Compute pointwise mutual information vectors for each column: find unusual combinations of feature values Compute similarity of each vector in A to each in B Align each column from A with its most similar one in B Evaluation: Compare against hand-build gold standard

Data and SiFT procedure
Source databases used to date: Provided by Santa Barbara County Air Pollution Control District (SBCAPCD) and Ventura County Air Pollution Control District (VCAPCD) Both provided emissions inventory for 2001 and 2002: covering facilities, devices, processes, permitting history, as well as criteria and toxic emissions Target database to match: Provided by the California Air Resources Board (CARB) Statewide emissions inventories for 2001 and 2002 Steps we follow: 1. Prepare two data collections 2. Run the system to find candidate alignments 3. Manually inspect: accept or reject candidates 4. Cycle until system exhausts Mutual Information scores

SiFT alignment interface
A correct alignment found by SIfT from the Process Description column in the SBCAPCD database to the CARB database 1. User asks: show me this SB column 4. Why did these align?? 2. SB column features and match values 3. Best match: CARB column and values

Details: feature values, matches, scores
Details of mapping Out of all gases, automatically found matching column with only organic ones

Example alignment Features for the Process Description column in the SBCAPCD db and matching column in CARB Features sorted in descending order of Mutual Information Red features are shared by both columns; blue features only from SBCAPCD; green features only from CARB The more red on top, the better the alignment

Experimental architecture
We experimented with: alignment projection

Experiment 1: Alignment
Gold standard: We manually aligned the 2001 SBCAPCD and CARB emissions inventories Difficult to do: only partial matching achieved by hand

Experiment 1 results: correctness
Surprise: Simple features outperform the others Lesson: unless feature extraction is very accurate, let Mutual Information automatically discover the key values of a particular data domain

Experiment 1 results: top-N
Surprise: if the system can find a correct alignment for a given column, then the alignment will be found in the first two returned candidate alignments Impact: Greatly reduce human search time by only looking at two candidate alignments per column

Experiment 2: Projection

Experiment 2 setup Question: given aligned historical (2001) data and each dataset’s changes for 2002, can we automatically ‘predict’ future (2002) data? Experiment: We had 2001 VCAPCD and 2001 SBCAPCD databases aligned with CARB’s 2001 database We also used projection schemas from 2001 to 2002, for VCAPCD, SBCAPCD, and CARB independently So we created values for the projected CARB 2002 database Gold standard: CARB provided us with their final 2002 database, as truth Evaluation: We randomly sampled 50 columns in the automatically integrated CARB 2002 databases A human judge was asked to classify each aligned column as correct, partially correct, or incorrect:

Experiment 2 results: correctness
Particularly bad at aligning binary (Yes/No or 0/1) columns MI not useful since binary values are shared by many columns

Experiment 2 results: top-N
System’s similarity measure gives confidence score for alignment decisions Hypothesis: the higher the confidence, the higher the chances that alignment is correct Verification: table shows this is true

Experiment 2 results summary
Automatically generated California Air Resources Board (CARB) database out of Santa Barbara (SBCAPCD) and Ventura (VCAPCD) County databases for 2002 Human judge compared system output to gold standard (true 2002 database from CARB) Number of correct elements (from test sample of 50): Accuracy: top-ranked N alignments  gold standard:

Conclusion We propose an information theoretic model for performing data-driven column alignments Past work: We aligned 2001 SBCAPCD and VCAPCD with CARB data sets and projected the data to 2002: 75% precision and 72.2% recall for column alignment 55%–59% accuracy on the projection task Current work: We use the model to find aliases/duplicates (perform record linkage) within a single database SiFT has the potential to significantly reduce the amount of human labor for managing multiple heterogeneous databases

Thank you!

Automating the Integration of Heterogeneous Databases

Similar presentations

Presentation on theme: "Automating the Integration of Heterogeneous Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automating the Integration of Heterogeneous Databases

Similar presentations

Presentation on theme: "Automating the Integration of Heterogeneous Databases"— Presentation transcript:

Similar presentations

About project

Feedback