Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFORMATION INTEGRATION

Similar presentations


Presentation on theme: "INFORMATION INTEGRATION"— Presentation transcript:

1 INFORMATION INTEGRATION
Presenters: Namrata Buddhadev (104) Deepti Bhardwaj (103)

2 Index 21.6 Local-as-View Mediators 21.6.1 Motivation for LAV Mediators
Terminology for LAV Mediators Expanding Solutions Containment of Conjunctive Queries Why the Containment-Mapping Test Works Finding Solutions to a Mediator Query Why the LMSS Theorem Holds

3 21.7 Entity Resolution Deciding Whether Records Represent a Common Entity Merging Similar Records Useful Properties of Similiarity and Merge Functions The R-Swoosh Algorithm for ICAR Records Other Approches to Entitiy Resolution

4 Local-as-View Mediators
GAV: Global as view mediators are like view, it doesn’t exist physically, but piece of it are constructed by the mediator by asking queries LAV: Local as view mediators, defines the global predicates at the mediator, but we do not define these predicates as views of the source of data Global expressions are defined for each source involving global predicates that describe the tuple that source is able to produce and queries are answered at mediator by discovering all possible ways to construct the query using the views provided by sources

5 Motivation for LAV Mediators
LAV mediators help us to discover how and when to use that source in a given query Example: Par(c,p)-> GAV of Par(c,p) gives information about the child and parent but does not give information of grandparents LAV Par(c,p) will help to get information of chlid-parent and even grandparent

6 Terminology for LAV Mediation
It is in form of logic that serves as the language for defining views. Datalog is used which will remain common for the queries of mediator and source which is known as Conjunctive query. LAV has global predicates which are the subgoals of mediator queries Conjunctive queries defines the views which has unique view predicate and that view has Global predicates and associated with particular view.

7 Example: Par(c,p)->Global predicate
view defined by conjunctive query: V1(c,p)<- Par(c,p) Another source produces: V2(c,g)<-Par(c,p) AND Par(p,g) Query at the mediator ask for great grandparents facts: Q(w,z)<-Par(w,x) AND Par(x,y) AND Par(y,z) Or Q(w,z)<-V1(w,x) AND V2(x,z) Or Q(w,z)<-V2(w,y) AND V1(y,z)

8 Expanding Solutions Query Q, Solution S, Sub goals : V(a1,a2,..,an)[can be same] V(b1,b2,..,bn)<-B (Entire Body)[distinct], we can replace V(a1,..an) in solution S by a version of body B that has the sub goals of B with variables possibly altered. Rules: Find local variables of B which are there in the body but not in the head, we can replace any local variables within the conjunctive query if it does not appear elsewhere in the conjunctive query.

9 If there are any local variables of B that appear in B or in S, replace each one by a distinct new variable that appears nowhere in V or in S. In the body B, replace each bi, by ai, for i=1,2,..n. Example: V(a,b,c,d)<-E(a,b,x,y) AND F(x,y,c,d) here for V, x and y are local so, x, y->e, f so, V(a,b,c,d)<-E(a,b,e,f) AND F(e,f,c,d) a,d ->x, b->y and c->1 V(x,y,1,x) has two subgoals E(x,y,e,f) and F(e,f,1,x).

10 Containment of Conjunctive Queries
Conjunctive query S be the solution to the mediator Q, Expansion of S->E, produces same answers that Q produces, so, E subset Q. A containment mapping from Q to E is function Γ(x) is the ith argument of the head E. Add to Γ the rule that Γ(c) =c for any constant c. IF P(x1,x2,..xn) is a subgoal of Q, then P(Γ(x1), Γ(x2),.., Γ(xn)) is a subgoal of E.

11 Example: Queries: P1: H(x,y)<-A(x,z) AND A(z,y)
P2: H(a,b)<-A(a,c) AND A(c,d) AND A(d,b) consider Γ(x)=a and Γ(y)=b, first subgoal A(x,z) can only map to A(a,c) of P2. 1. Γ(z) must be C as A(x,z) can map A(a,c) of P2. 2. Γ(z) must be d as Γ(y)=b, subgoal A(z,y) of P1 becomes A(d,b) in P2. So, no containment mapping from P! and P2 exists.

12 Complexity of the containment Mapping Test :
It is NP-complete to decide whether there is an containment mapping from one conjunctive query to another. Importance of containment mappings is expressed by the theorem: If Q1 and A2 are conjunctive queries, then Q2 is subset or equal to Q1, if and only if there is a containment mapping from Q1 and Q2.

13 Why Containment Mapping Test Works:
Questions: If there is containment mapping, why must there be a containment of conjunctive queries? If there is containment, why must there be a containment mapping?

14 Finding Solutions to a Mediator Query
Query Q, solutions S, Expansion E of S is contained in Q. “If a query Q has n subgoals, then any answer produced by any solution is also produced by a solution that has at most n subgoals. This is known by LMSS Theorem

15 Example: Q1: Q(w,z)<-Par(w,x) AND Par(x,y) AND Par(y,z)
S1: Q(w,z)<-V1(w,x) AND V2(x,z) S2: Q(w,z)<-V1(w,x) AND V2(x,z) AND V1(t,u) AND V2(u,v) by LMSS, E2: Q(w,z)<-Par(w,x) AND Par(x,p) AND Par(t,u) AND Par(u,q) AND Par(q,v) and E2 is subset or equal to E1 using containment mapping that sends each vairable of E1 to the same variable in E2.

16 Why the LMSS Theorem Holds
Query Q with n subgoals and S with n subgoals, E of S must be contained in query Q, E is expansion of Q. S’ must be the solution got after removing all subgoals from S those are not the target of Q. E subset or equal to Q and also E’ is the expansion of S’. So, S is subser of S’ : identity mapping. Thus there is no need for solution s among the solution S among the solutions to query Q.

17 Entity Resolution Determining whether two records or tuples do or do not represent the same person, organization, place or other entity is called ENTITY RESOLUTION.

18 Deciding Whether Records Represents a Common Entity
Two records represent the same individual if the two records have similar values for each of the fields associated with those records. It is not sufficient that the values of corresponding fields be identical because of following reasons: 1. Misspellings 2. Variant Names 3. Misunderstanding of Names 4. Evolution of Values 5. Abbreviations Thus when deciding whether two records represent the same entity, we need to look carefully at the kinds of discrepancies and use the test that measures the similarity of records.

19 Deciding Whether Records Represents a Common Entity
Two records represent the same individual if the two records have similar values for each of the fields associated with those records. It is not sufficient that the values of corresponding fields be identical because of following reasons: 1. Misspellings 2. Variant Names 3. Misunderstanding of Names 4. Evolution of Values 5. Abbreviations Thus when deciding whether two records represent the same entity, we need to look carefully at the kinds of discrepancies and use the test that measures the similarity of records.

20 Deciding Whether Records Represents a Common Entity - Edit Distance
First approach to measure the similarity of records is Edit Distance. Values that are strings can be compared by counting the number of insertions and deletions of characters it takes to turn one string into another. So the records represent the same entity if their similarity measure is below a given threshold.

21 Deciding Whether Records Represents a Common Entity - Normalization
To normalize records by replacing certain substrings by others. For instance: we can use the table of abbreviations and replace abbreviations by what they normally stand for. Once normalize we can use the edit distance to measure the difference between normalized values in the fields.

22 Merging Similar Records
Merging means replacing two records that are similar enough to merge and replace by one single record which contain information of both. There are many merge rules: 1. Set the field in which the records disagree to the empty string. 2. (i) Merge by taking the union of the values in each field (ii) Declare two records similar if at least two of the three fields have a nonempty intersection Name Address Phone 1. Susan Oak St Susan Maple St Susan Maple St After Merging Name Address Phone (1-2-3) Susan {123 Oak St., 456 Maple St} { , }

23 Useful Properties of Similarity and Merge Functions
The following properties say that the merge operation is a semi lattice : Idempotence : That is, the merge of a record with itself should surely be that record. Commutativity : If we merge two records, the order in which we list them should not matter. Associativity : The order in which we group records for a merger should not matter. There are some other properties that we expect similarity relationship to have: Idempotence for similarity : A record is always similar to itself Commutativity of similarity : In deciding whether two records are similar it does not matter in which order we list them Representability : If r is similar to some other record s, but s is instead merged with some other record t, then r remains similar to the merger of s and t and can be merged with that record.

24 R-swoosh Algorithm for ICAR Records
Input: A set of records I, similarity function and a merge function. Output: A set of merged records O. Method: O:= emptyset; WHILE I is not empty DO BEGIN Let r be any record in I; Find, if possible, some record s in O that is similar to r; IF no record s exists THEN move r from I to O ELSE BEGIN delete r from I; delete s from O; add the merger of r and s to I; END;

25 Other Approaches to Entity Resolution
The other approaches to entity resolution are : Non- ICAR Datasets Clustering Partitioning

26 Clustering : Partitioning : Non ICAR Datasets :
We can define a dominance relation r<=s that means record s contains all the information contained in record r. If so, then we can eliminate record r from further consideration. Clustering : Some time we group the records into clusters such that members of a cluster are in some sense similar to each other and members of different clusters are not similar. Partitioning : We can group the records, perhaps several times, into groups that are likely to contain similar records and look only within each group for pairs of similar records.

27 Thank You


Download ppt "INFORMATION INTEGRATION"

Similar presentations


Ads by Google