Presentation is loading. Please wait.

Presentation is loading. Please wait.

1Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding A Versatile Graph Matching Algorithm by Sergey Melnik, Hector Garcia-Molina, Erhard.

Similar presentations


Presentation on theme: "1Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding A Versatile Graph Matching Algorithm by Sergey Melnik, Hector Garcia-Molina, Erhard."— Presentation transcript:

1 1Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding A Versatile Graph Matching Algorithm by Sergey Melnik, Hector Garcia-Molina, Erhard Rahm

2 2Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Introduction & Motivation Goal: matching elements of related, complex objects Matching elements of two data schemes Matching elements of two data instances Many conceivable uses for object matching Looking for a generic algorithm with wide applicability

3 3Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Applications Comparing data schemes: –Items from different shopping sites –Merger between two corporations –Preparation of data for data warehousing and analyzing processes Comparing data instances: –Bio-informatics –Collaboration: allowing multiple users to edit a program / system

4 4Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Existing Approaches Comparing SQL: can use type information Comparing XML: can use hierarchy Requires domain-specific knowledge and coding Solution: Generic algorithm that is agnostic to domain Structural model – relies on structural similarities to find a matching

5 5Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Part I: Algorithm Framework General Discussion of Algorithm Input, Output, and Main Components

6 6Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Algorithm Framework Input: two objects to match Representation of objects as graphs: G1=(V1, E1), G2=(V2, E2) Matching between graphs gives mapping: V1xV2   Filtering of mapping to obtain meaningful match Output: mapping between elements of input objects Human verification sometimes required

7 7Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Input  Graph  Mapping  Filtering Input are two objects to be matched Match will be between sub-elements of the two objects Match of sub-elements will be scored. High scores indicate a strong similarity Assumption: Objects can be represented as graphs

8 8Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Input  Graph  Mapping  Filtering Represent objects as directed, labeled graphs Choose any sensible graph representation (this is domain-specific) that maintains structural information Structural information in graphs will be used for mapping. Intuition: similar elements have similar neighbors G1 = (V1, E1), G2 = (V2, E2)

9 9Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Input  Graph  Mapping  Filtering We want a mapping  :V1xV2   Convenient to normalize such that 0   (v,u)  1 Begin with initial mapping function: –Null function:  (v, u) := 1 for all v in V1, u in V2 –String Matching function –Other domain-specific function Perform an iterative fixpoint calculation. Each iteration floods the similarity value  (v,u) to the neighbors of v and u

10 10Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Input  Graph  Mapping  Filtering We have a mapping  :V1xV2   We are usually not interested in all pairs V1xV2 Applying filtering functions yields a partial mapping: –Threshold (only when  (v,u) > some constant) –Wedding (each v mapped to only one u and vice versa) Result is a useful mapping that matches elements of V1 with elements of V2

11 11Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Part II: An Example - Relational Schemas An Example Employing the Algorithm to Match Two Simple Relational Schemas

12 12Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Relational Schemas Scenario: two relational schemas that describe similar or same data Goal: match elements of two given relational schemas Input: SQL statements for creating each scheme Desired output: a meaningful mapping between the elements of the two schemas

13 13Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Relational Schemas Input  Graph  Mapping  Filtering CREATE TABLE Personnel ( Pno int, Pname string, Dept string, Born date, UNIQUE perskey(Pno) ) S1 CREATE TABLE Employee ( EmpNo int PRIMARY KEY, EmpName varchar(50), DeptNo int REFERENCES Department, Salary dec(15,2), Birthdate date ) CREATE TABLE Department ( DeptNo int PRIMARY KEY, DeptName varchar(70) ) S2

14 14Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Relational Schemas Algorithm script: G1 = SQLDDL2Graph(S1); G2 = SQLDDL2Graph(S2); initialMap = StringMatch(G1, G2); product = SFJoin(G1, G2, initialMap); result = SelectThreshold(product)

15 15Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Relational Schemas Input  Graph  Mapping  Filtering Any graph representation of schemas can be chosen Representation should maintain as much information as possible, in particular structural information Example uses Open Information Model (OIM) – based graph representation

16 16Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Relational Schemas Input  Graph  Mapping  Filtering

17 17Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Relational Schemas Input  Graph  Mapping  Filtering Calculate initial mapping to improve performance Initial mapping can apply domain knowledge In this example: StringMatch is used: –Compares common prefixes and suffixes of literals –Assumes elements with similar names have similar meaning –Applies on all elements – including elements that are created by the graph representation (e.g. ‘type’) Initial mapping still far from satisfactory

18 18Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Top values of similarity mapping  after StringMatch  Node in G1Node in G2  Node in G1Node in G2 1.0Column 0.26‘Pname’‘DeptName’ 0.66ColumnTypeColumn0.26‘Pname’‘EmpName’ 0.66‘Dept’‘DeptNo’0.22‘date’‘BirthDate’ 0.66‘Dept’‘DeptName’0.11‘Dept’‘Department’ 0.5UniqueKeyPrimaryKey0.06‘int’‘Department’ Example: Relational Schemas Input  Graph  Mapping  Filtering

19 19Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Relational Schemas Input  Graph  Mapping  Filtering Next step: similarity flooding ( SFJoin ) Initial similarity values taken from initial mapping In each iteration similarity of two elements affects the similarity of their respective neighbors (e.g. similarity of type names such as ‘string’ adds to similarity of columns from the same type) Iterate until similarity values are stable

20 20Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Relational Schemas Input  Graph  Mapping  Filtering After fixpoint calculation, the mapping  is filtered to provide a meaningful mapping The filter operator SelectThreshold removes node pairs for which  (u,v) < some constant In this example, the mapping product contained 211 node pairs with positive similarities, which were filtered to a total of 12 node pairs

21 21Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity mapping  after SelectThreshold  Node in G1Node in G2  Node in G1Node in G2 1.0Column 0.29UniqueKey: perskey PrimaryKey: on EmpNo 0.81Personnel * Employee * 0.28Personnel / Dept + Department / DeptName + 0.66ColType 0.25Personnel / Pno + Employee / EmpNo + 0.44int ** 0.19UniqueKeyPrimaryKey 0.43Table 0.18Personnel / Pname + Employee / EmpName + 0.35date ** 0.17Personnel / Born + Employee / Birthdate + *Table**SQL column type+ Column Example: Relational Schemas

22 22Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Relational Schemas Summary of example: Good results without domain-specific knowledge Graph representation may vary Similarity flooding results need to be filtered

23 23Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Part III: Similarity Flooding Calculation Details of the Similarity Flooding Calculation Algorithm

24 24Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding Calculation Start with directed, labeled graphs A, B Every edge e in a graph is represented by a triplet (s,p,o): edge labeled p from s to o Define pairwise connectivity graph PCG(A, B):

25 25Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding Calculation Pairwise Connectivity Graph – Example

26 26Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding Calculation Induced Propagation Graph: add edges in opposite direction Edge weights: propagation coefficients. They measure how the similarity propagates to neighbors One way to calculate weights: each edge type (label) contributes a total of 1.0 outgoing propagation

27 27Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding Calculation Induced Propagation Graph – Example

28 28Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding Calculation Similarity measure  (x,y)  0 for all x  A and b  B. We also call  a “mapping” Iterative computation of , with propagation in each iteration  i is the mapping after the i’th iteration  0 is the initial mapping Each iteration computes  i based on  i-1 and the propagation graph Stop when a stable mapping is reached

29 29Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding Calculation Propagation from  i for similarity of x and y is the sum of all similarities from neighbors, each multiplied by the propagation coefficients

30 30Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding Calculation Many ways to iterate: Choice will aim to achieve high quality and fast convergence

31 31Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding Calculation Basic: each iteration propagates from neighbors; Initial mapping has diminishing effect A: initial mapping has high importance. Propagation has diminishing effect

32 32Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding Calculation B: initial mapping has high importance, recurring in propagation C: initial mapping and current mapping have identical importance

33 33Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Part IV: Filtering Overview of Various Approaches to Filtering of SF Mapping

34 34Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Filtering Result of iterations is a mapping  between all pairs in V1 and V2. We usually want much less information! Filtering will remove pairs, leaving us with only the interesting ones There are many ways to filter. Filter choice is domain-specific

35 35Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Filtering Possible filtering directions: Remove uninteresting pairs according to domain- specific knowledge (e.g. ‘column’, ‘table’, ‘string’ from SQL matches) and typing information. Cardinality considerations: do we want a 1:1 mapping? A n:m mapping? Threshold: remove matches with low scores

36 36Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Filtering: Cardinality Cardinality-based filters can use techniques from bilateral graph (“marriage”) problems: Stable marriage Assignment problem: max. of  (x,y) Maximum mapping: max. number of 1:1 matches Maximal mapping: not contained in other mapping Perfect/Complete: all are “married” All the above give [0,1]:[0,1] (monogamous) matches, and can be found in polynomial time

37 37Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Filtering: Relative Similarity  (x,y) is the absolute similarity of x and y We can also define a relative similarity: Relative similarity is directed. The reverse direction is defined in an analogue manner Bipartite graph methods can also handle directed graphs

38 38Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Filtering: Threshold Threshold can be applied to absolute or relative similarities A useful example: threshold of t rel =1.0 gives a perfectionist egalitarian polygamy – e.g. no man/woman is willing to accept any but the best match

39 39Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Part V: Examples Examples of Algorithm Application to Various Problems

40 40Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Change Detection Goal: change detection in two labeled trees Original tree T1 was changed to give T2: –Node names were replaced –Subtrees were copied and moved –New node was inserted We want the best match for every node of T2 –Cardinality constraint: [0,n] – [1,1]

41 41Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Change Detection Algorithm Script: Product = SFJoin(T2, T1); Result = SelectLeft(product);

42 42Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Change Detection No initial mapping SelectLeft operator selects best absolute match for each element in left argument Results can also provide hints on type of change that was performed!

43 43Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Change Detection

44 44Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Matching Schemas Using Instance Data Goal: match two XML Schemas using instance data Two XML product descriptions from two shopping websites We want to use the instance data to match the XML schemas

45 45Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Matching Schemas Using Instance Data

46 46Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Matching Schemas Using Instance Data Algorithm Script: G1 = XML2DOMGraph(db1); G2 = XML2DOMGraph(db2); initialMap = StringMatch(G1, G2); product = SFJoin(G1, G2, initialMap); result = XMLMapFilter(product, G1, G2) Only new piece of code is the XMLMapFilter operator

47 47Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Example: Schemas, Instance Data

48 48Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Part VI: Analysis Match Quality, Algorithm Complexity, Convergence and Limitations

49 49Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Match Quality Assessing match quality is difficult Human verification and tuning of matching is often required A useful metric would be to measure the amount of human work required to reach the perfect match Recall: how many good matches did we show? Precision: how many of the matches we show are good?

50 50Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Convergence Fixpoint iterations are an eigenvector computation for the matrix that corresponds to the propagation graph Computation converges iff graph is strongly connected To achieve this we use dampening: use  0 in the fixpoint formula, where  0 (x,y) > 0 for all x,y Convergence rate depends on spectral radius of the matrix, and can be improved by high dampening values

51 51Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Convergence In many cases we are only interested in order of map pairs, and not absolute values of . The order usually stabilizes before the actual values do

52 52Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Complexity Usually 5-30 iterations Each iteration is O(|E|) (edges in propagation graph) |E| = O(|E1||E2|) |E1| = O(|V1| 2 ) – if G1 is highly connected |E2| = O(|V2| 2 ) – if G2 is highly connected Worst case of each iteration is O(|V1| 2|V2| 2 ) Average case of each iteration is O(|V1||V2|)

53 53Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Limitations Algorithm requires representation as directed, labeled graph –Degrades when edges are unlabeled or undirected –Degrades when labeling is more uniform Assumes structural adjacency contributes to similarity –Will not work for matching HTML Requires matched objects to be of same type and with same graph representation

54 54Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Limitations Algorithm cannot utilize order and aggregation information (e.g. for XML) –Order: the order of sub-elements within an element –Aggregation: an element containing an “array” of sub- elements

55 55Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Part VII: Variability and Applications Discussion of Algorithm Variability Areas and Possible Applications

56 56Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Variability in Algorithm Graph representation of input objects Calculation of propagation coefficients Initial mapping function Iteration formula Filtering function

57 57Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Graph Representation Graph representation of input objects is arbitrary; sub-elements can be modeled as nodes, edges, or both. On one hand: –Richer graph captures more structure information –Type information about sub-elements can be modeled On the other hand: –Larger graphs mean longer computation –Rich graph often implies more uniform labeling

58 58Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Propagation Coefficients Propagation coefficients can be calculated in many ways: –Sum of all outgoing edges is 1.0 –Equal weigh (1.0) for all edges –Sum of all outgoing edges of label ‘p’ is 1.0 –Sum of all incoming edges is 1.0 –Label-specific weight allocation –Etc.

59 59Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Initial Mapping Function Initial mapping can improve performance and help convergence Initial mapping function can be naïve, or it can employ domain-specific knowledge

60 60Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Iteration Formula Each iteration calculates  i+1 from  i,  0, and  (  i ) Iteration formula can vary, giving different weight and effect to these components –Example: if initial mapping is good, give higher weight to  0 Formula affects convergence speed as well as resultant mapping

61 61Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Filtering Function Results of iterations require filtering to become a meaningful mapping Many approaches to filtering are possible, as discussed Choice usually stems from graph representation and specific goal. For example: –If graphs contain many type-related nodes, they can be pruned from results –If goal is to detect changes, we want a match for each element of the newer object

62 62Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Applications There are many possible applications besides the ones described: Comparing websites –Old vs. new versions of website –Two websites with information about same subject –Structural information gained from containment and links

63 63Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Applications Natural language processing and speech recognition: –Match given sentence to XML template –Match two text segments that refer to the same subject Finding self-similarities and related data items by running SFJoin(G,G) Preparation of data and schemas for data warehousing and data mining –Canonization of data and meta-data

64 64Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Semantic Interpretation - Example For example (1st approach), the user utterance: "I would like a medium coca cola and a large pizza with pepperoni and mushrooms. ” could be converted to the following semantic result { drink: { beverage: "coke ” drinksize: "medium ” } pizza: { pizzasize: "large" topping: [ "pepperoni", "mushrooms" ] }

65 65Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Applications More…

66 66Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Summary Generic algorithm – with many applications Relies on structural information captured in graph representation Domain-specific customizations can improve performance and match quality Useful but does not deliver 100% exact results; human verification often required


Download ppt "1Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding A Versatile Graph Matching Algorithm by Sergey Melnik, Hector Garcia-Molina, Erhard."

Similar presentations


Ads by Google