Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using Synthetic Scenarios

2 Main Points Tuning matching systems: long standing problem –becomes increasingly worse We propose a principled solution –exploits synthetic input/output pairs –promising, though much work remains Idea applicable to other contexts

3 price agent-name address Schema Matching 1-1 matchcomplex match listed-price contact-name city state Schema 2 120,000 George Bush Crawford, TX 239,900 Hillary Clinton New York City, NY 320K Jane Brown Seattle WA 240K Mike Smith Miami FL Schema 1

4 Schema Matching is Ubiquitous Databases –data integration, –model management –data translation, –collaborative data sharing –keyword querying, schema/view integration –data warehousing, peer data management, … AI –knowledge bases, ontology merging, information gathering agents,... Web –e-commerce, Deep Web, Semantic Web eGovernment, bio-informatics, scientific data management

5 Current State of Affairs Finding semantic mappings is now a key bottleneck! –largely done by hand, labor intensive & error prone Numerous matching techniques have been developed –Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin,... –AI: Stanford, Karlsruhe University, NEC Japan,... Techniques are often synergistic, leading to multi-component matching architectures –each component employs a particular technique –final predictions combine those of the components

6 An Example: LSD [SIGMOD-01] Schema 1 Urbana, IL James Smith Seattle, WA Mike Doan address agent-name area contact-agent Peoria, IL (206) 634 9435 Kent, WA (617) 335 4243 Schema 2 Name Matcher Naive Bayes Matcher Combiner 0.3 agent name contact agent 0.5 0.1 area => (address, 0.7), (description, 0.3) contact-agent => (agent-phone, 0.7), (agent-name, 0.3) comments => (address, 0.6), (desc, 0.4) Match Selector Constraint Enforcer Only one attribute of Schema 2 matches address area = address contact-agent = agent-phone... comments = desc

7 Multi-Component Matching Solutions Such systems are very powerful... –maximize accuracy; highly customizable to individual domain... but place a serious tuning burden on domain users Constraint enforcer Match selector Matcher Combiner … Matcher 1 Matcher n Constraint enforcer Match selector Combiner Matcher 1Matcher n … Constraint enforcer Match selector Combiner Matcher 1Matcher n … Match selector Combiner LSDCOMASF LSD-SF Developed in many recent works –e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02; Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05 Now commonly adopted, with industrial-strength systems –e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig]

8 Tuning Schema Matching Systems Library of matching components Constraint enforcer Match selector Combiner Matcher 1Matcher n … Execution graph Knobs of decision tree matcher Threshold selector Bipartite graph selector A* search enforcer Relax. labeler ILP Average combiner Min combiner Max combiner Weighted sum combiner q-gram name matcher Decision tree matcher Naïve Bays matcher TF/IDF name matcher SVM matcher Characteristics of attr. Post-prune? Size of validation set Split measure Given a particular matching situation –how to select the right components? –how to adjust the multitude of knobs? Untuned versions produce inferior accuracy, however...

9 Large number of knobs –e.g., 8-29 in our experiments Wide variety of techniques –database, machine learning, IR, information theory, etc. Complex interaction among components Not clear how to compare the quality of knob configs Matching systems are still tuned manually, by trial and error Multiple component systems make tuning even worse... Tuning is Extremely Difficult Developing efficient tuning techniques is crucial to making matching systems attractive in practice

10 The eTuner Solution Given schema S & matching system M –tunes M to maximize average accuracy of matching S with future schemas –incurs virtually no cost to user Key challenge 1: Evaluation –must search for best knob config –how to compute the quality of any knob config C? –if knowing ground-truth matches for a representative workload W = {(S,T1),..., (S,Tn)}, then can use W to evaluate C –but often have no such W Key challenge 2: Search –how to efficiently evaluate the huge space of knob configs?

11 Key Idea: Generate Synthetic Input/Output Pairs Need workload W = {(S,T1), (S,T2), …, (S,Tn)} To generate W –start with S –perturb S to generate T1 –perturb S to generate T2 –etc. Know the perturbation => know matches between S & Ti

12 Key Idea: Generate Synthetic Input/Output Pairs Perturb # of tables id first last salary ($) 1 Bill Laup40,000 $ 2 Mike Brown60,000 $ EMPLOYEES EMPS emp-last idwage Laup145200 Brown259328 V1V1 Schema S 1 2 3 id first last salary ($) 1 Bill Laup40,000 $ 2 Mike Brown60,000 $ 3 Jean Ann30,000 $ 4 Roy Bond70,000 $ EMPLOYEES id first last salary ($) 3JeanAnn30,000$ 4RoyBond70,000$ EMPLOYEES Perturb # of columns in each table last id salary($) Laup140,000$ Brown260,000$ EMPLOYEES Perturb column and table names Perturb data tuples in each table EMPS emp-last idwage Laup140,000$ Brown260,000$ EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) U 1 2 3 V 1 2 3 3 12 3 3 3 V1V1 U Ω 1 : a set of semantic matches VnVn... Split S into V and U with disjoint data tuples

13 Examples of Perturbation Rules Number of tables –merge two tables based on a join path –splits a table into two Structure of table –merges two columns –e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name) –drop a column –swap location of two columns Names of tables/columns –rules capture common name transformations –abbreviation to the first 3-4 characters, dropping all vowels, synonyms, dropping prefixes, adding table name to column name, etc Data values –rules capture common format transformations: 12/4 => Dec 4 –values are changed based on some distributions (e.g., Gaussian) See paper for details

14 The eTuner Architecture Staged Tuner Tuning Procedures Workload Generator Perturbation Rules Matching Tool M Synthetic Workload (Optional) Tuned Matching Tool M U Ω 1 V 1 U Ω 2 V 2 U Ω n V n Schema S

15 The Staged Tuner Level 1 Level 2 Level 3 Constraint enforcer Match selector Combiner Matcher 1Matcher n … Level 4 Tuning direction Tune sequentially starting with lowest-level components Assume –execution graph has k levels, m nodes per level –each node can be assigned one of n components –each component has p knobs, each of which has q values tuning examines (npqkm) out of (npq)^(km) knob configs

16 Empirical Evaluation Domain# schemas # tables per schema # attributes per schema # tuples per table reference paper Real Estate52301000LSD (SIGMOD01) Courses531350LSD Inventory104 20Corpus (ICDE05) Product2250120iMAP (SIGMOD04) Domains LSD : 6 Matchers, 6 Combiners, 1 Constraint enforcer, 2 Match selectors, 21 Knobs iCOMA : 10 Matchers, 4 Combiners, 2 Match selectors, 20 Knobs SF : 3 Matchers, 1 Constraint enforcer, 2 Match selectors, 8 Knobs LSD-SF : 7 Matcher, 7 Combiners, 1 Constraint enforcer, 2 Match selectors, 29 Knobs Matching systems

17 Matching Accuracy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 CourseInventoryProductReal Estate LSD COMA SF Off-the-shelf Domain-independent LSD-SF eTuner achieves higher accuracy than current best methods, at virtually no cost to the user Domain-dependent Source-dependent eTUNER: Automatic eTUNER: Human-assisted CourseInventoryProductReal Estate 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 CourseInventoryProductReal Estate CourseInventoryProductReal Estate 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

18 Cost of Using eTuner You have a schema S and a matching system M Vendor supplies eTuner –will hook it up with matching system M Vendor supplies a matching system M –bundles eTuner inside

19 Sensitivity Analysis Adding perturbation rules Exploiting prior match results (enriching the workload) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 11020254050 Schemas in Synthetic Workload (#) Accuracy (F1) Average Inventory Domain Real Estate Domain 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0224466 88 Tuned LSD Previous matches in collection (%)

20 Summary: The eTuner Project @ Illinois Tuning matching systems is crucial –long standing problem, is getting worse –a next logical step in schema matching research Provides an automatic & principled solution –generates a synthetic workload, employs it to tune efficiently –incurs virtually no cost to human users –exploits user assistance whenever available Extensive experiments over 4 domains with 4 systems Future directions –find optimal synthetic workload –apply to other matching scenarios –adapt ideas to scenarios beyond schema matching (see 3 rd speaker)

21 Backup: User Assistance S(phone1,phone2,…) Generate V by dropping phone2: V(phone1,…) Rename phone1 in V: V(x,…) Problem: –x matches phone1, x does not match phone2 User: –group phone1 and phone2 –so if x matches phone1, it will also match phone2 Intuition: tell system do not bother to try distinguish phone1 and phone2

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

Similar presentations

Presentation on theme: "Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

Similar presentations

Presentation on theme: "Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using."— Presentation transcript:

Similar presentations

About project

Feedback