Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.

Slides:



Advertisements
Similar presentations
Protein Secondary Structure Prediction Using BLAST and Relaxed Threshold Rule Induction from Coverings Leong Lee Missouri University of Science and Technology,
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Clustering Categorical Data The Case of Quran Verses
CSE 330: Numerical Methods
Native-Conflict-Aware Wire Perturbation for Double Patterning Technology Szu-Yu Chen, Yao-Wen Chang ICCAD 2010.
Greedy Algorithms Greed is good. (Some of the time)
Label Placement and graph drawing Imo Lieberwerth.
1 Fast Primal-Dual Strategies for MRF Optimization (Fast PD) Robot Perception Lab Taha Hamedani Aug 2014.
The number of edge-disjoint transitive triples in a tournament.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Identifying "Good" Architectural Design Alternatives with Multi-Objective Optimization Strategies By Lars Grunske Presented by Robert Dannels.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Basic Data Mining Techniques
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
1 1 Slide Chapter 14: Goal Programming Goal programming is used to solve linear programs with multiple objectives, with each objective viewed as a "goal".
Distributed Combinatorial Optimization
Linear Programming Applications
Linear Programming Applications
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Factor Graphs Young Ki Baik Computer Vision Lab. Seoul National University.
Tractable Symmetry Breaking Using Restricted Search Trees Colva M. Roney-Dougal, Ian P. Gent, Tom Kelsey, Steve Linton Presented by: Shant Karakashian.
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Simpson Rule For Integration.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Graph-based Segmentation. Main Ideas Convert image into a graph Vertices for the pixels Vertices for the pixels Edges between the pixels Edges between.
9/14/2012ISC329 Isabelle Bichindaritz1 Database System Life Cycle.
Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.
Querying Structured Text in an XML Database By Xuemei Luo.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
SE: CHAPTER 7 Writing The Program
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Exact and heuristics algorithms
Mechanical Engineering Department 1 سورة النحل (78)
Methodology – Monitoring and Tuning the Operational System.
Column Generation By Soumitra Pal Under the guidance of Prof. A. G. Ranade.
Tetris Agent Optimization Using Harmony Search Algorithm
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
Software Engineering Issues Software Engineering Concepts System Specifications Procedural Design Object-Oriented Design System Testing.
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
Binary Decision Diagrams Prof. Shobha Vasudevan ECE, UIUC ECE 462.
Approximation Algorithms based on linear programming.
Kim HS Introduction considering that the amount of MRI data to analyze in present-day clinical trials is often on the order of hundreds or.
 Problem Analysis  Coding  Debugging  Testing.
Introduction to Algorithms: Brute-Force Algorithms.
Graph-based Segmentation
Requirement Prioritization
Greedy Technique.
Lec 6: Practical Database Design Methodology and Use of UML Diagrams
Linear programming Simplex method.
Unit# 9: Computer Program Development
Objective of This Course
Algorithms for Budget-Constrained Survivable Topology Design
Linear programming Simplex method.
A Robotic Cloud Advisory Service
A handbook on validation methodology. Metrics.
Presentation transcript:

Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented by Prasanna Kunchavaram ( )‏ ITCS rd November 2009

Introduction schema integration – is the process of unification of heterogeneous data sources to obtain a single non redundant, consistent data source schema integration - the process of combining local schemas into a global, integrated schema Examples :  Combining data bases/ tables due to a merger or acquisition of companies.  Combining two products into one and resulting combination of historic sales data.  Creation of new table for employees using employment data and medical records.

Introduction (continued)‏ Correspondence- is the matching between the elements of heterogeneous schemas. There is a weight and direction associated to each correspondence based on the confidence of the matches. For example weight of correspondence between A to B might be different from the weight of correspondence between B to A. Previous Schema integration tools provide interactive option for the users to select a desired integrated schema using the surviving correspondences.

Problem Previous schema integration techniques  Do not consider direction and weight of correspondence.  Need user interaction for a final integration decision.  Laborious process of selecting easy as well as difficult integration.  Time consuming.  Resource intensive.

Problem (continued)‏ Example : Weighted and directed correspondence Options for schema integration

Solution 1)Relationships in the integrated schema are defined using direction and weight of correspondence between elements. 2)Such relationships are ranked based on priority of similarity and coverage to produce top k schemas. 3)Easy integrations are adopted without user interaction. 4)For difficult integrations user is provided an option to select constraints on the schemas involved. 5)System generates revised top k schemas that satisfy constraints. 6)Steps 4 and 5 are repeated till final schema is obtained.

Example of Easy integration (Integration without user interaction)‏

Concept and Concept graph A concept is a relation name associated with a set of attributes in a schema. Correspondences between schemas are expressed using Concept graph. A concept graph is a pair (V, has) where V is a set of concepts and has is a set of directed and labeled edges between concepts. Correspondence of concepts across schemas is defined by the pair of weights (in both directions). Considering pair (C1,C2) of concepts, where C1 is from schema S1 and C2 is from schema S2. The weight of the directed correspondence C1 → C2, can be denoted by ˆs(C1,C2). The weight of the directed correspondence C2 → C1, can be denoted by ˆs(C2, C1). Correspondence of concepts across S1 and S2 is defined by pair [ˆs(C1,C2),ˆs(C2, C1)]

Example- Concept graph

Assignment An assignment A is a fixed-sized, ordered vector of bits where each bit X represents the state of one correspondence, value 1 representing a correspondence and value 0 representing an absence of correspondence. Set of assignments are ranked to get the top K assignments. For each assignment with value 1 the concepts involved in the respective correspondence should be combined. There are two ways by which concepts can be combined based on the weight and direction of the similarity. The two methods are merge and has A threshold λ is used for deciding which method is to be used for the combination

Example of λ effect on integration decision

Algorithm

Cost function (used to rank assignments)‏ And n is total number of non-zero correspondences Example Assignment with weights

Top K algorithm 1)Calculate ^S i and ^D i and assign 1 for correspondences where ^S i > ^D i and 0 where ^S i < ^D i. 2)The result is the optimal assignment for k=1. Next k-1 best assignments is based on decision to flip the bits of assignment vector. 3)Let the vector Δf be the difference between ^S i and ^D i. calculate Δf to quantify the cost impact of flipping the bit i from its current value in the assignment A 1. For each i, Δf represents the increase in cost with respect to cost(A 1 ) if the bit i in A 1 were to be flipped. 4) Sort Δf in increasing order and denote as Δf s. Find the next assignment that minimizes the increase in cost. 5)Now the 2 nd best assignment can be obtained by flipping bit X i that gives the least cost increase. 6)Next compute the 3 rd best assignment, we need to change the variable with the next cost increase and leave X i unflipped. If there are two choices, select the choice that gives smaller cost increase. 7)Other assignments are calculated likewise.

Top K Algorithm- Example

Tuning λ As stated before is the threshold which is used for combining concepts in an integration based on the following rules Steps to calculate λ 1. iteratively scan all the correspondences in E, where E is the set of correspondences that are selected by at least one of the top k assingments. 2. for each such correspondence, record max(ˆs1, ˆs2) and add this value to a list L, and finally 3. set λ to be the minimum of the values in L.

Example of Schema Integration with different λ values

Results

Conclusion Top K algorithm for schema integration that executes in polynomial time is developed. Important information like weight and direction of the correspondence are efficiently used to reduce user interaction. Easy integrations are performed by the system without any user interaction while keeping the data consistent. Results clearly state that the algorithm can be efficient in reducing user interaction and thus reducing the time taken to achieve complex schema integration. Future work includes automation (integration without user interaction) and enhancements to the algorithm to implement with couple of hundred schemas.

Questions ?