Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian.

Similar presentations


Presentation on theme: "Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian."— Presentation transcript:

1 Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian

2 -- Abstruse Goose (177) Motivation Information Theory is relevant to all of humanity...

3 Background  Many problems in data management need precise reasoning about information content, transfer and loss – Structure Extraction – Privacy preservation – Schema design – Probabilistic data ?

4 Information Theory  First developed by Shannon as a way of quantifying capacity of signal channels.  Entropy, relative entropy and mutual information capture intrinsic informational aspects of a signal  Today: – Information theory provides a domain-independent way to reason about structure in data – More information = interesting structure – Less information linkage = decoupling of structures

5 Tutorial Thesis Information theory provides a mathematical framework for the quantification of information content, linkage and loss. This framework can be used in the design of data management strategies that rely on probing the structure of information in data.

6 Tutorial Goals  Introduce information-theoretic concepts to VLDB audience  Give a ‘data-centric’ perspective on information theory  Connect these to applications in data management  Describe underlying computational primitives Illuminate when and how information theory might be of use in new areas of data management.

7 7 Outline Part 1  Introduction to Information Theory  Application: Data Anonymization  Application: Data Integration Part 2  Review of Information Theory Basics  Application: Database Design  Computing Information Theoretic Primitives  Open Problems

8 Histograms And Discrete Distributions x1 x2 x1 x4 x2 x3 x1 X Column of data Xf(X) x14 x22 x31 x41 Histogram Xp(X) x10.5 x20.25 x x Probability distribution normalize aggregate counts

9 Histograms And Discrete Distributions x1 x2 x1 x4 x2 x3 x1 X Column of data Xf(X) x14 x22 x31 x41 Histogram Xp(X) x x20.2 x x Probability distribution aggregate counts Xf(x)*w(X) x14*5=20 x22*3=6 x31*2=2 x41*2=2 normalizereweight

10 From Columns To Random Variables  We can think of a column of data as “represented” by a random variable: – X is a random variable – p(X) is the column of probabilities p(X = x1), p(X = x2), and so on – Also known (in unweighted case) as the empirical distribution induced by the column X.  Notation: – X (upper case) denotes a random variable (column) – x (lower case) denotes a value taken by X (field in a tuple) – p(x) is the probability p(X = x)

11 11 Joint Distributions  Discrete distribution: probability p(X,Y,Z)  p(Y) = ∑ x p(X=x,Y) = ∑ x ∑ z p(X=x,Y,Z=z) XYZp(X,Y,Z) x1y1z x1y2z x1y1z x1y2z x2y3z x2y3z x3y3z x4y3z Xp(X) x10.5 x20.25 x x Yp(Y) y10.25 y20.25 y30.5 XYp(X,Y) x1y10.25 x1y20.25 x2y30.25 x3y x4y30.125

12 Entropy Of A Column  Let h(x) = log 2 1/p(x)  h(X) is column of h(x) values. H(X) = E X [h(x)] =  X p(x) log 2 1/p(x) Two views of entropy  It captures uncertainty in data: high entropy, more unpredictability  It captures information content: higher entropy, more information. Xp(X)h(X) x10.51 x x x H(X) = 1.75 < log |X| = 2

13 Examples  X uniform over [1,..., 4]. H(X) = 2  Y is 1 with probability 0.5, in [2,3,4] uniformly. – H(Y) = 0.5 log log 6 ~= 1.8 < 2 – Y is more sharply defined, and so has less uncertainty.  Z uniform over [1,..., 8]. H(Z) = 3 > 2 – Z spans a larger range, and captures more information XYZ

14 Comparing Distributions  How do we measure difference between two distributions ?  Kullback-Leibler divergence: – d KL (p, q) = E p [ h(q) – h(p) ] =  i p i log(p i /q i ) Inference mechanism Prior beliefResulting belief

15 Comparing Distributions  Kullback-Leibler divergence: – d KL (p, q) = E p [ h(q) – h(p) ] =  i p i log(p i /q i ) – d KL (p, q) >= 0 – Captures extra information needed to capture p given q – Is asymmetric ! d KL (p, q) != d KL (q, p) – Is not a metric (does not satisfy triangle inequality)  There are other measures: –  2 -distance, variational distance, f-divergences, …

16 Conditional Probability  Given a joint distribution on random variables X, Y, how much information about X can we glean from Y ?  Conditional probability: p(X|Y) – p(X = x1 | Y = y1) = p(X = x1, Y = y1)/p(Y = y1) XYp(X,Y)p(X|Y)p(Y|X) x1y x1y x2y x3y x4y Xp(X) x10.5 x20.25 x x Yp(Y) y10.25 y20.25 y30.5

17 Conditional Entropy  Let h(x|y) = log 2 1/p(x|y)  H(X|Y) = E x,y [h(x|y)] =  x  y p(x,y) log 2 1/p(x|y)  H(X|Y) = H(X,Y) – H(Y)  H(X|Y) = H(X,Y) – H(Y) = 2.25 – 1.5 = 0.75  If X, Y are independent, H(X|Y) = H(X) XYp(X,Y)p(X|Y)h(X|Y) x1y x1y x2y x3y x4y

18 Mutual Information  Mutual information captures the difference between the joint distribution on X and Y, and the marginal distributions on X and Y.  Let i(x;y) = log p(x,y)/p(x)p(y)  I(X;Y) = E x,y [I(X;Y)] =  x  y p(x,y) log p(x,y)/p(x)p(y) XYp(X,Y)h(X,Y)i(X;Y) x1y x1y x2y x3y x4y Xp(X)h(X) x x x x Yp(Y)h(Y) y y y

19 Mutual Information: Strength of linkage  I(X;Y) = H(X) + H(Y) – H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)  If X, Y are independent, then I(X;Y) = 0: – H(X,Y) = H(X) + H(Y), so I(X;Y) = H(X) + H(Y) – H(X,Y) = 0  I(X;Y) <= max (H(X), H(Y)) – Suppose Y = f(X) (deterministically) – Then H(Y|X) = 0, and so I(X;Y) = H(Y) – H(Y|X) = H(Y)  Mutual information captures higher-order interactions: – Covariance captures “linear” interactions only – Two variables can be uncorrelated (covariance = 0) and have nonzero mutual information: – X  R [-1,1], Y = X 2. Cov(X,Y) = 0, I(X;Y) = H(X) > 0

20 Information-Theoretic Clustering  Clustering takes a collection of objects and groups them. – Given a distance function between objects – Choice of measure of complexity of clustering – Choice of measure of cost for a cluster  Usually, – Distance function is Euclidean distance – Number of clusters is measure of complexity – Cost measure for cluster is sum-of-squared-distance to center  Goal: minimize complexity and cost – Inherent tradeoff between two

21 Feature Representation v1 v2 v1 v4 v2 v3 v1 X Column of data Xf(X) v14 v22 v31 v41 Histogram Xp(X) v10.5 v20.25 v v Probability distribution normalize aggregate counts Let V = {v1, v2, v3, v4} X is “explained” by distribution over V. “Feature vector” of X is [0.5, 0.25, 0.125, 0.125]

22 Feature Representation V v1v2v3v4 X X X p(v2|X2) = 0.2 Feature vector

23 Information-Theoretic Clustering  Clustering takes a collection of objects and groups them. – Given a distance function between objects – Choice of measure of complexity of clustering – Choice of measure of cost for a cluster  In information-theoretic setting – What is the distance function ? – How do we measure complexity ? – What is a notion of cost/quality ?  Goal: minimize complexity and maximize quality – Inherent tradeoff between two

24 Measuring complexity of clustering  Take 1: complexity of a clustering = #clusters – standard model of complexity.  Doesn’t capture the fact that clusters have different sizes. 

25 Measuring complexity of clustering  Take 2: Complexity of clustering = number of bits needed to describe it.  Writing down “k” needs log k bits.  In general, let cluster t  T have |t| elements. – set p(t) = |t|/n – #bits to write down cluster sizes = H(T) =  p t log 1/p t H( ) < H( )

26 Information-theoretic Clustering (take I)  Given data X = x1,..., xn explained by variable V, partition X into clusters (represented by T) such that H(T) is minimized and quality is maximized

27 Soft clusterings  In a “hard” clustering, each point is assigned to exactly one cluster.  Characteristic function – p(t|x) = 1 if x  t, 0 if not.  Suppose we allow points to partially belong to clusters: – p(T|x) is a distribution. – p(t|x) is the “probability” of assigning x to t How do we describe the complexity of a clustering ?

28 Measuring complexity of clustering  Take 1: – p(t) =  x p(x) p(t|x) – Compute H(T) as before.  Problem: H(T1) = H(T2) !! T1t1t2T2t1t2 x10.5 x x20.5 x h(T)0.5 h(T)0.5

29 Measuring complexity of clustering  By averaging the memberships, we’ve lost useful information.  Take II: Compute I(T;X) !  Even better: If T is a hard clustering of X, then I(T;X) = H(T) XT1p(X,T)i(X;T) x1t x1t x2t x2t I(T1;X) = 0 XT2p(X,T)i(X;T) x1t x1t x2t x2t I(T2;X) = 0.46

30 Information-theoretic Clustering (take II)  Given data X = x1,..., xn explained by variable V, partition X into clusters (represented by T) such that I(T,X) is minimized and quality is maximized

31 Measuring cost of a cluster Given objects X t = {X1, X2, …, Xm} in cluster t, Cost(t) = (1/m)  i d(Xi, C) =  i p(Xi) d KL (p(V|Xi), C) where C = (1/m)  i p(V|Xi) =  i p(Xi) p(V|Xi) = p(V)

32 Mutual Information = Cost of Cluster Cost(t) = (1/m)  i d(Xi, C) =  i p(Xi) d KL (p(V|Xi), p(V))  i p(Xi) KL( p(V|Xi), p(V)) =  i p(Xi)  j p(vj|Xi) log p(vj|Xi)/p(vj) =  i,j p(Xi, vj) log p(vj, Xi)/p(vj)p(Xi) = I(X t, V) !! Cost of a cluster = I(X t,V)

33 Cost of a clustering  If we partition X into k clusters X 1,..., X k Cost(clustering) =  i p i I(X i, V) (p i = |X i |/|X|)

34 Cost of a clustering  Each cluster center t can be “explained” in terms of V: – p(V|t) =  i p(Xi) p(V|Xi)  Suppose we treat each cluster center itself as a point:

35 Cost of a clustering  We can write down the “cost” of this “cluster” – Cost(T) = I(T;V)  Key result [BMDG05] : Cost(clustering) = I(X, V) – (T, V) Minimizing cost(clustering) => maximizing I(T, V)

36 Information-theoretic Clustering (take III)  Given data X = x1,..., xn explained by variable V, partition X into clusters (represented by T) such that I(T;X) -  I(T;V) is maximized  This is the Information Bottleneck Method [TPB98]  Agglomerative techniques exist for the case of ‘hard’ clusterings  is the tradeoff parameter between complexity and cost  I(T;X) and I(T;V) are in the same units.

37 Information Theory: Summary  We can represent data as discrete distributions (normalized histograms)  Entropy captures uncertainty or information content in a distribution  The Kullback-Leibler distance captures the difference between distributions  Mutual information and conditional entropy capture linkage between variables in a joint distribution  We can formulate information-theoretic clustering problems

38 38 Outline Part 1  Introduction to Information Theory  Application: Data Anonymization  Application: Data Integration Part 2  Review of Information Theory Basics  Application: Database Design  Computing Information Theoretic Primitives  Open Problems

39 39 Data Anonymization Using Randomization  Goal: publish anonymized microdata to enable accurate ad hoc analyses, but ensure privacy of individuals’ sensitive attributes  Key ideas: – Randomize numerical data: add noise from known distribution – Reconstruct original data distribution using published noisy data  Issues: – How can the original data distribution be reconstructed? – What kinds of randomization preserve privacy of individuals? Information Theory for Data Management - Divesh & Suresh

40 40 Data Anonymization Using Randomization  Many randomization strategies proposed [AS00, AA01, EGS03]  Example randomization strategies: X in [0, 10] – R = X + μ (mod 11), μ is uniform in {-1, 0, 1} – R = X + μ (mod 11), μ is in {-1 (p = 0.25), 0 (p = 0.5), 1 (p = 0.25)} – R = X (p = 0.6), R = μ, μ is uniform in [0, 10] (p = 0.4)  Question: – Which randomization strategy has higher privacy preservation? – Quantify loss of privacy due to publication of randomized data Information Theory for Data Management - Divesh & Suresh

41 41 Data Anonymization Using Randomization  X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh IdX s10 s23 s35 s40 s58 s60 s76 s80

42 42 Data Anonymization Using Randomization  X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh IdXμ s10 s230 s351 s400 s581 s60 s761 s800 → IdR1 s110 s23 s36 s40 s59 s610 s77 s80

43 43 Data Anonymization Using Randomization  X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh IdXμ s100 s23 s350 s401 s581 s60 s76 s801 → IdR1 s10 s22 s35 s41 s59 s610 s75 s81

44 44 Reconstruction of Original Data Distribution  X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} – Reconstruct distribution of X using knowledge of R1 and μ – EM algorithm converges to MLE of original distribution [AA01] Information Theory for Data Management - Divesh & Suresh IdXμ s100 s23 s350 s401 s581 s60 s76 s801 → IdR1 s10 s22 s35 s41 s59 s610 s75 s81 → IdX | R1 s1{10, 0, 1} s2{1, 2, 3} s3{4, 5, 6} s4{0, 1, 2} s5{8, 9, 10} s6{9, 10, 0} s7{4, 5, 6} s8{0, 1, 2}

45 45 Analysis of Privacy [AS00]  X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} – If X is uniform in [0, 10], privacy determined by range of μ Information Theory for Data Management - Divesh & Suresh IdXμ s100 s23 s350 s401 s581 s60 s76 s801 → IdR1 s10 s22 s35 s41 s59 s610 s75 s81 → IdX | R1 s1{10, 0, 1} s2{1, 2, 3} s3{4, 5, 6} s4{0, 1, 2} s5{8, 9, 10} s6{9, 10, 0} s7{4, 5, 6} s8{0, 1, 2}

46 46 Analysis of Privacy [AA01]  X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} – If X is uniform in [0, 1]  [5, 6], privacy smaller than range of μ Information Theory for Data Management - Divesh & Suresh IdXμ s100 s21 s350 s461 s501 s61 s75 s861 → IdR1 s10 s20 s35 s47 s51 s60 s74 s87 → IdX | R1 s1{10, 0, 1} s2{10, 0, 1} s3{4, 5, 6} s4{6, 7, 8} s5{0, 1, 2} s6{10, 0, 1} s7{3, 4, 5} s8{6, 7, 8}

47 47 Analysis of Privacy [AA01]  X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} – If X is uniform in [0, 1]  [5, 6], privacy smaller than range of μ – In some cases, sensitive value revealed Information Theory for Data Management - Divesh & Suresh IdXμ s100 s21 s350 s461 s501 s61 s75 s861 → IdR1 s10 s20 s35 s47 s51 s60 s74 s87 → IdX | R1 s1{0, 1} s2{0, 1} s3{5, 6} s4{6} s5{0, 1} s6{0, 1} s7{5} s8{6}

48 48 Quantify Loss of Privacy [AA01]  Goal: quantify loss of privacy based on mutual information I(X;R) – Smaller H(X|R)  more loss of privacy in X by knowledge of R – Larger I(X;R)  more loss of privacy in X by knowledge of R – I(X;R) = H(X) – H(X|R)  I(X;R) used to capture correlation between X and R – p(X) is the prior knowledge of sensitive attribute X – p(X, R) is the joint distribution of X and R Information Theory for Data Management - Divesh & Suresh

49 49 Quantify Loss of Privacy [AA01]  Goal: quantify loss of privacy based on mutual information I(X;R) – X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh XR1p(X,R1)h(X,R1)i(X;R1) Xp(X)h(X) 5 6 R1p(R1)h(R1)

50 50 Quantify Loss of Privacy [AA01]  Goal: quantify loss of privacy based on mutual information I(X;R) – X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh XR1p(X,R1)h(X,R1)i(X;R1) Xp(X)h(X) R1p(R1)h(R1)

51 51 Quantify Loss of Privacy [AA01]  Goal: quantify loss of privacy based on mutual information I(X;R) – X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} Information Theory for Data Management - Divesh & Suresh XR1p(X,R1)h(X,R1)i(X;R1) Xp(X)h(X) R1p(R1)h(R1)

52 52 Quantify Loss of Privacy [AA01]  Goal: quantify loss of privacy based on mutual information I(X;R) – X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} – I(X;R) = 0.33 Information Theory for Data Management - Divesh & Suresh XR1p(X,R1)h(X,R1)i(X;R1) Xp(X)h(X) R1p(R1)h(R1)

53 53 Quantify Loss of Privacy [AA01]  Goal: quantify loss of privacy based on mutual information I(X;R) – X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1} – I(X;R1) = 0.33, I(X;R2) = 0.5  R2 is a bigger privacy risk than R1 Information Theory for Data Management - Divesh & Suresh XR2p(X,R2)h(X,R2)i(X;R2) Xp(X)h(X) R2p(R2)h(R2)

54 54 Quantify Loss of Privacy [AA01]  Equivalent goal: quantify loss of privacy based on H(X|R) – X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1} – Intuition: we know more about X given R2, than about X given R1 – H(X|R1) = 0.67, H(X|R2) = 0.5  R2 is a bigger privacy risk than R1 Information Theory for Data Management - Divesh & Suresh XR2p(X,R2)p(X|R2)h(X|R2) XR1p(X,R1)p(X|R1)h(X|R1)

55 55 Quantify Loss of Privacy  Example: X is uniform in [0, 1] – R3 = e (p = ), R3 = X (p = ) – R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)  Is R3 or R4 a bigger privacy risk? Information Theory for Data Management - Divesh & Suresh

56 56 Worst Case Loss of Privacy [EGS03]  Example: X is uniform in [0, 1] – R3 = e (p = ), R3 = X (p = ) – R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)  I(X;R3) = << I(X;R4) = Information Theory for Data Management - Divesh & Suresh XR3p(X,R3)h(X,R3)i(X;R3) 0e e XR4p(X,R4)h(X,R4)i(X;R4)

57 57 Worst Case Loss of Privacy [EGS03]  Example: X is uniform in [0, 1] – R3 = e (p = ), R3 = X (p = ) – R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)  I(X;R3) = << I(X;R4) = – But R3 has a larger worst case risk Information Theory for Data Management - Divesh & Suresh XR3p(X,R3)h(X,R3)i(X;R3) 0e e XR4p(X,R4)h(X,R4)i(X;R4)

58 58 Worst Case Loss of Privacy [EGS03]  Goal: quantify worst case loss of privacy in X by knowledge of R – Use max KL divergence, instead of mutual information  Mutual information can be formulated as expected KL divergence – I(X;R) = ∑ x ∑ r p(x,r)*log 2 (p(x,r)/p(x)*p(r)) = KL(p(X,R) || p(X)*p(R)) – I(X;R) = ∑ r p(r) ∑ x p(x|r)*log 2 (p(x|r)/p(x)) = E R [KL(p(X|r) || p(X))] – [AA01] measure quantifies expected loss of privacy over R  [EGS03] propose a measure based on worst case loss of privacy – IW(X;R) = MAX R [KL(p(X|r) || p(X))] Information Theory for Data Management - Divesh & Suresh

59 59 Worst Case Loss of Privacy [EGS03]  Example: X is uniform in [0, 1] – R3 = e (p = ), R3 = X (p = ) – R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)  IW(X;R3) = max{0.0, 1.0, 1.0} > IW(X;R4) = max{0.028, 0.028} Information Theory for Data Management - Divesh & Suresh XR3p(X,R3)p(X|R3)i(X;R3) 0e e XR4p(X,R4)p(X|R4)i(X;R4)

60 60 Worst Case Loss of Privacy [EGS03]  Example: X is uniform in [5, 6] – R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1} – R2 = X + μ (mod 11), μ is uniform in {0, 1}  IW(X;R1) = max{1.0, 0.0, 0.0, 1.0} = IW(X;R2) = {1.0, 0.0, 1.0} – Unable to capture that R2 is a bigger privacy risk than R1 Information Theory for Data Management - Divesh & Suresh XR1p(X,R1)p(X|R1)i(X;R1) XR2p(X,R2)p(X|R2)i(X;R2)

61 61 Data Anonymization: Summary  Randomization techniques useful for microdata anonymization – Randomization techniques differ in their loss of privacy  Information theoretic measures useful to capture loss of privacy – Expected KL divergence captures expected loss of privacy [AA01] – Maximum KL divergence captures worst case loss of privacy [EGS03] – Both are useful in practice Information Theory for Data Management - Divesh & Suresh

62 62 Outline Part 1  Introduction to Information Theory  Application: Data Anonymization  Application: Data Integration Part 2  Review of Information Theory Basics  Application: Database Design  Computing Information Theoretic Primitives  Open Problems Information Theory for Data Management - Divesh & Suresh

63 63 Schema Matching  Goal: align columns across database tables to be integrated – Fundamental problem in database integration  Early useful approach: textual similarity of column names – False positives: Address ≠ IP_Address – False negatives: Customer_Id = Client_Number  Early useful approach: overlap of values in columns, e.g., Jaccard – False positives: Emp_Id ≠ Project_Id – False negatives: Emp_Id = Personnel_Number Information Theory for Data Management - Divesh & Suresh

64 64 Opaque Schema Matching [KN03]  Goal: align columns when column names, data values are opaque – Databases belong to different government bureaucracies – Treat column names and data values as uninterpreted (generic)  Example: EMP_PROJ(Emp_Id, Proj_Id, Task_Id, Status_Id) – Likely that all Id fields are from the same domain – Different databases may have different column names Information Theory for Data Management - Divesh & Suresh WXYZ w2x1y1z2 w4x2y3z3 w3x3y3z1 w1x2y1z2 ABCD a1b2c1d1 a3b4c2d2 a1b1c1d2 a4b3c2d3

65 65 Opaque Schema Matching [KN03]  Approach: build complete, labeled graph G D for each database D – Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y) – Perform graph matching between G D1 and G D2, minimizing distance  Intuition: – Entropy H(X) captures distribution of values in database column X – Mutual information I(X;Y) captures correlations between X, Y – Efficiency: graph matching between schema-sized graphs Information Theory for Data Management - Divesh & Suresh

66 66 Opaque Schema Matching [KN03]  Approach: build complete, labeled graph G D for each database D – Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y) Information Theory for Data Management - Divesh & Suresh ABCD a1b2c1d1 a3b4c2d2 a1b1c1d2 a4b3c2d3 Ap(A) a10.5 a30.25 a40.25 Bp(B) b10.25 b20.25 b30.25 b40.25 Cp(C) c10.5 c20.5 Dp(D) d10.25 d20.5 d30.25

67 67 Opaque Schema Matching [KN03]  Approach: build complete, labeled graph G D for each database D – Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)  H(A) = 1.5, H(B) = 2.0, H(C) = 1.0, H(D) = 1.5 Information Theory for Data Management - Divesh & Suresh ABCD a1b2c1d1 a3b4c2d2 a1b1c1d2 a4b3c2d3 Ah(A) a11.0 a32.0 a42.0 Bh(B) b12.0 b22.0 b32.0 b42.0 Ch(C) c11.0 c21.0 Dh(D) d12.0 d21.0 d32.0

68 68 Opaque Schema Matching [KN03]  Approach: build complete, labeled graph G D for each database D – Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)  H(A) = 1.5, H(B) = 2.0, H(C) = 1.0, H(D) = 1.5, I(A;B) = 1.5 Information Theory for Data Management - Divesh & Suresh ABCD a1b2c1d1 a3b4c2d2 a1b1c1d2 a4b3c2d3 Ah(A) a11.0 a32.0 a42.0 Bh(B) b12.0 b22.0 b32.0 b42.0 ABh(A,B)i(A;B) a1b a3b42.0 a1b a4b32.0

69 69 Opaque Schema Matching [KN03]  Approach: build complete, labeled graph G D for each database D – Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y) Information Theory for Data Management - Divesh & Suresh ABCD a1b2c1d1 a3b4c2d2 a1b1c1d2 a4b3c2d3 AB DC

70 70 Opaque Schema Matching [KN03]  Approach: build complete, labeled graph G D for each database D – Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y) – Perform graph matching between G D1 and G D2, minimizing distance  [KN03] uses euclidean and normal distance metrics Information Theory for Data Management - Divesh & Suresh WX ZY AB DC

71 71 Opaque Schema Matching [KN03]  Approach: build complete, labeled graph G D for each database D – Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y) – Perform graph matching between G D1 and G D2, minimizing distance Information Theory for Data Management - Divesh & Suresh WX ZY AB DC

72 72 Opaque Schema Matching [KN03]  Approach: build complete, labeled graph G D for each database D – Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y) – Perform graph matching between G D1 and G D2, minimizing distance Information Theory for Data Management - Divesh & Suresh WX ZY AB DC

73 73 Heterogeneity Identification [DKOSV06]  Goal: identify columns with semantically heterogeneous values – Can arise due to opaque schema matching [KN03]  Key ideas: – Heterogeneity based on distribution, distinguishability of values – Use Information Bottleneck to compute soft clustering of values  Issues: – Which information theoretic measure characterizes heterogeneity? – How to set parameters in the Information Bottleneck method? Information Theory for Data Management - Divesh & Suresh

74 74 Heterogeneity Identification [DKOSV06]  Example: semantically homogeneous, heterogeneous columns Information Theory for Data Management - Divesh & Suresh Customer_Id Customer_Id (908) (877)

75 75 Heterogeneity Identification [DKOSV06]  Example: semantically homogeneous, heterogeneous columns Information Theory for Data Management - Divesh & Suresh Customer_Id Customer_Id (908) (877)

76 76 Heterogeneity Identification [DKOSV06]  Example: semantically homogeneous, heterogeneous columns  More semantic types in column  greater heterogeneity – Only versus + phone Information Theory for Data Management - Divesh & Suresh Customer_Id Customer_Id (908) (877)

77 77 Heterogeneity Identification [DKOSV06]  Example: semantically homogeneous, heterogeneous columns Information Theory for Data Management - Divesh & Suresh Customer_Id (877) Customer_Id (908) (877)

78 78 Heterogeneity Identification [DKOSV06]  Example: semantically homogeneous, heterogeneous columns  Relative distribution of semantic types impacts heterogeneity – Mainly + few phone versus balanced + phone Information Theory for Data Management - Divesh & Suresh Customer_Id (877) Customer_Id (908) (877)

79 79 Heterogeneity Identification [DKOSV06]  Example: semantically homogeneous, heterogeneous columns Information Theory for Data Management - Divesh & Suresh Customer_Id (908) (877) Customer_Id (908) (877)

80 80 Heterogeneity Identification [DKOSV06]  Example: semantically homogeneous, heterogeneous columns Information Theory for Data Management - Divesh & Suresh Customer_Id (908) (877) Customer_Id (908) (877)

81 81 Heterogeneity Identification [DKOSV06]  Example: semantically homogeneous, heterogeneous columns  More easily distinguished types  greater heterogeneity – Phone + (possibly) SSN versus balanced + phone Information Theory for Data Management - Divesh & Suresh Customer_Id (908) (877) Customer_Id (908) (877)

82 82 Heterogeneity Identification [DKOSV06]  Heterogeneity = space complexity of soft clustering of the data – More, balanced clusters  greater heterogeneity – More distinguishable clusters  greater heterogeneity  Soft clustering – Soft  assign probabilities to membership of values in clusters – How many clusters: tradeoff between space versus quality – Use Information Bottleneck to compute soft clustering of values Information Theory for Data Management - Divesh & Suresh

83 83 Heterogeneity Identification [DKOSV06]  Hard clustering Information Theory for Data Management - Divesh & Suresh X = Customer_IdT = Cluster_Id t t t t1 (908) t t t3 (877) t2

84 84 Heterogeneity Identification [DKOSV06]  Soft clustering: cluster membership probabilities  How to compute a good soft clustering? Information Theory for Data Management - Divesh & Suresh X = Customer_IdT = Cluster_Idp(T|X) t t t t20.25 (908) t t10.5 (908) t t20.5

85 85 Heterogeneity Identification [DKOSV06]  Represent strings as q-gram distributions Information Theory for Data Management - Divesh & Suresh X = Customer_IdV = 4-gramsp(X,V) Customer_Id (908) (877)

86 86 Heterogeneity Identification [DKOSV06]  iIB: find soft clustering T of X that minimizes I(T;X) – β*I(T;V)  Allow iIB to use arbitrarily many clusters, use β* = H(X)/I(X;V) – Closest to point with minimum space and maximum quality Information Theory for Data Management - Divesh & Suresh X = Customer_IdV = 4-gramsp(X,V) Customer_Id (908) (877)

87 87 Heterogeneity Identification [DKOSV06]  Rate distortion curve: I(T;V)/I(X;V) vs I(T;X)/H(X) β* Information Theory for Data Management - Divesh & Suresh

88 88 Heterogeneity Identification [DKOSV06]  Heterogeneity = mutual information I(T;X) of iIB clustering T at β*  0 ≤I(T;X) (= 0.126) ≤ H(X) (= 2.0), H(T) (= 1.0) – Ideally use iIB with an arbitrarily large number of clusters in T Information Theory for Data Management - Divesh & Suresh X = Customer_IdT = Cluster_Idp(T|X)i(T;X) t t t t (908) t t (908) t t

89 89 Heterogeneity Identification [DKOSV06]  Heterogeneity = mutual information I(T;X) of iIB clustering T at β* Information Theory for Data Management - Divesh & Suresh

90 90 Data Integration: Summary  Analyzing database instance critical for effective data integration – Matching and quality assessments are key components  Information theoretic measures useful for schema matching – Align columns when column names, data values are opaque – Mutual information I(X;V) captures correlations between X, V  Information theoretic measures useful for heterogeneity testing – Identify columns with semantically heterogeneous values – I(T;X) of iIB clustering T at β* captures column heterogeneity Information Theory for Data Management - Divesh & Suresh

91 91 Outline Part 1  Introduction to Information Theory  Application: Data Anonymization  Application: Data Integration Part 2  Review of Information Theory Basics  Application: Database Design  Computing Information Theoretic Primitives  Open Problems Information Theory for Data Management - Divesh & Suresh

92 92 Review of Information Theory Basics  Discrete distribution: probability p(X)  p(X,Y) = ∑ z p(X,Y,Z=z) XYZp(X,Y,Z) x1y1z x1y2z x1y1z x1y2z x2y3z x2y3z x3y3z x4y3z Information Theory for Data Management - Divesh & Suresh Xp(X) x10.5 x20.25 x x Yp(Y) y10.25 y20.25 y30.5 XYp(X,Y) x1y10.25 x1y20.25 x2y30.25 x3y x4y30.125

93 93 Review of Information Theory Basics  Discrete distribution: probability p(X)  p(Y) = ∑ x p(X=x,Y) = ∑ x ∑ z p(X=x,Y,Z=z) XYZp(X,Y,Z) x1y1z x1y2z x1y1z x1y2z x2y3z x2y3z x3y3z x4y3z Information Theory for Data Management - Divesh & Suresh Xp(X) x10.5 x20.25 x x Yp(Y) y10.25 y20.25 y30.5 XYp(X,Y) x1y10.25 x1y20.25 x2y30.25 x3y x4y30.125

94 94 Review of Information Theory Basics  Discrete distribution: conditional probability p(X|Y)  p(X,Y) = p(X|Y)*p(Y) = p(Y|X)*p(X) XYp(X,Y)p(X|Y)p(Y|X) x1y x1y x2y x3y x4y Information Theory for Data Management - Divesh & Suresh Xp(X) x10.5 x20.25 x x Yp(Y) y10.25 y20.25 y30.5

95 95 Review of Information Theory Basics  Discrete distribution: entropy H(X)  h(x) = log 2 (1/p(x)) – H(X) = ∑ X=x p(x)*h(x) = 1.75 – H(Y) = ∑ Y=y p(y)*h(y) = 1.5 (≤ log 2 (|Y|) = 1.58) – H(X,Y) = ∑ X=x ∑ Y=y p(x,y)*h(x,y) = 2.25 (≤ log 2 (|X,Y|) = 2.32) XYp(X,Y)h(X,Y) x1y x1y x2y x3y x4y Information Theory for Data Management - Divesh & Suresh Xp(X)h(X) x x x x Yp(Y)h(Y) y y y

96 96 Review of Information Theory Basics  Discrete distribution: conditional entropy H(X|Y)  h(x|y) = log 2 (1/p(x|y)) – H(X|Y) = ∑ X=x ∑ Y=y p(x,y)*h(x|y) = 0.75 – H(X|Y) = H(X,Y) – H(Y) = 2.25 – 1.5 XYp(X,Y)p(X|Y)h(X|Y) x1y x1y x2y x3y x4y Information Theory for Data Management - Divesh & Suresh Xp(X)h(X) x x x x Yp(Y)h(Y) y y y

97 97 Review of Information Theory Basics  Discrete distribution: mutual information I(X;Y)  i(x;y) = log 2 (p(x,y)/p(x)*p(y)) – I(X;Y) = ∑ X=x ∑ Y=y p(x,y)*i(x;y) = 1.0 – I(X;Y) = H(X) + H(Y) – H(X,Y) = – 2.25 XYp(X,Y)h(X,Y)i(X;Y) x1y x1y x2y x3y x4y Information Theory for Data Management - Divesh & Suresh Xp(X)h(X) x x x x Yp(Y)h(Y) y y y

98 98 Outline Part 1  Introduction to Information Theory  Application: Data Anonymization  Application: Data Integration Part 2  Review of Information Theory Basics  Application: Database Design  Computing Information Theoretic Primitives  Open Problems Information Theory for Data Management - Divesh & Suresh

99 99 Information Dependencies [DR00]  Goal: use information theory to examine and reason about information content of the attributes in a relation instance  Key ideas: – Novel InD measure between attribute sets X, Y based on H(Y|X) – Identify numeric inequalities between InD measures  Results: – InD measures are a broader class than FDs and MVDs – Armstrong axioms for FDs derivable from InD inequalities – MVD inference rules derivable from InD inequalities Information Theory for Data Management - Divesh & Suresh

100 100 Information Dependencies [DR00]  Functional dependency: X → Y – FD X → Y holds iff  t1, t2 ((t1[X] = t2[X])  (t1[Y] = t2[Y])) Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6

101 101 Information Dependencies [DR00]  Functional dependency: X → Y – FD X → Y holds iff  t1, t2 ((t1[X] = t2[X])  (t1[Y] = t2[Y])) Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6

102 102 Information Dependencies [DR00]  Result: FD X → Y holds iff H(Y|X) = 0 – Intuition: once X is known, no remaining uncertainty in Y  H(Y|X) = 0.5 Information Theory for Data Management - Divesh & Suresh XYp(X,Y)p(Y|X)h(Y|X) x1y x1y x2y x3y x4y Xp(X) x10.5 x20.25 x x Yp(Y) y10.25 y20.25 y30.5

103 103 Information Dependencies [DR00]  Multi-valued dependency: X →→ Y – MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z) Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6

104 104 Information Dependencies [DR00]  Multi-valued dependency: X →→ Y – MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z) Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6 XY x1y1 x1y2 x2y3 x3y3 x4y3 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 =

105 105 Information Dependencies [DR00]  Multi-valued dependency: X →→ Y – MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z) Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6 XY x1y1 x1y2 x2y3 x3y3 x4y3 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 =

106 106 Information Dependencies [DR00]  Result: MVD X →→ Y holds iff H(Y,Z|X) = H(Y|X) + H(Z|X) – Intuition: once X known, uncertainties in Y and Z are independent  H(Y|X) = 0.5, H(Z|X) = 0.75, H(Y,Z|X) = 1.25 Information Theory for Data Management - Divesh & Suresh = XYh(Y|X) x1y11.0 x1y21.0 x2y30.0 x3y30.0 x4y30.0 XZh(Z|X) x1z11.0 x1z21.0 x2z31.0 x2z41.0 x3z50.0 x4z60.0 XYZh(Y,Z|X) x1y1z12.0 x1y2z22.0 x1y1z22.0 x1y2z12.0 x2y3z31.0 x2y3z41.0 x3y3z50.0 x4y3z60.0

107 107 Information Dependencies [DR00]  Result: Armstrong axioms for FDs derivable from InD inequalities  Reflexivity: If Y  X, then X → Y – H(Y|X) = 0 for Y  X  Augmentation: X → Y  X,Z → Y,Z – 0 ≤ H(Y,Z|X,Z) = H(Y|X,Z) ≤ H(Y|X) = 0  Transitivity: X → Y & Y → Z  X → Z – 0 ≥ H(Y|X) + H(Z|Y) ≥ H(Z|X) ≥ 0 Information Theory for Data Management - Divesh & Suresh

108 108 Database Normal Forms  Goal: eliminate update anomalies by good database design – Need to know the integrity constraints on all database instances  Boyce-Codd normal form: – Input: a set ∑ of functional dependencies – For every (non-trivial) FD R.X → R.Y  ∑ +, R.X is a key of R  4NF: – Input: a set ∑ of functional and multi-valued dependencies – For every (non-trivial) MVD R.X →→ R.Y  ∑ +, R.X is a key of R Information Theory for Data Management - Divesh & Suresh

109 109 Database Normal Forms  Functional dependency: X → Y – Which design is better? Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y1z2 x2y2z3 x2y2z4 x3y3z5 x4y4z6 XY x1y1 x2y2 x3y3 x4y4 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 =

110 110 Database Normal Forms  Functional dependency: X → Y – Which design is better?  Decomposition is in BCNF Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y1z2 x2y2z3 x2y2z4 x3y3z5 x4y4z6 XY x1y1 x2y2 x3y3 x4y4 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 =

111 111 Database Normal Forms  Multi-valued dependency: X →→ Y – Which design is better? Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6 XY x1y1 x1y2 x2y3 x3y3 x4y3 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 =

112 112 Database Normal Forms  Multi-valued dependency: X →→ Y – Which design is better?  Decomposition is in 4NF Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6 XY x1y1 x1y2 x2y3 x3y3 x4y3 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 =

113 113 Well-Designed Databases [AL03]  Goal: use information theory to characterize “goodness” of a database design and reason about normalization algorithms  Key idea: – Information content measure of cell in a DB instance w.r.t. ICs – Redundancy reduces information content measure of cells  Results: – Well-designed DB  each cell has information content > 0 – Normalization algorithms never decrease information content Information Theory for Data Management - Divesh & Suresh

114 114 Well-Designed Databases [AL03]  Information content of cell c in database D satisfying FD X → Y – Uniform distribution p(V) on values for c consistent with D\c and FD – Information content of cell c is entropy H(V)  H(V 62 ) = 2.0 Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y1z2 x2y2z3 x2y2z4 x3y3z5 x4y4z6 V 62 p(V 62 )h(V 62 ) y y y y

115 115 Well-Designed Databases [AL03]  Information content of cell c in database D satisfying FD X → Y – Uniform distribution p(V) on values for c consistent with D\c and FD – Information content of cell c is entropy H(V)  H(V 22 ) = 0.0 Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y1z2 x2y2z3 x2y2z4 x3y3z5 x4y4z6 V 22 p(V 22 )h(V 22 ) y y20.0 y30.0 y40.0

116 116 Well-Designed Databases [AL03]  Information content of cell c in database D satisfying FD X → Y – Information content of cell c is entropy H(V)  Schema S is in BCNF iff  D  S, H(V) > 0, for all cells c in D – Technicalities w.r.t. size of active domain Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y1z2 x2y2z3 x2y2z4 x3y3z5 x4y4z6 cH(V) c c c c c c

117 117 Well-Designed Databases [AL03]  Information content of cell c in database D satisfying FD X → Y – Information content of cell c is entropy H(V)  H(V 12 ) = 2.0, H(V 42 ) = 2.0 Information Theory for Data Management - Divesh & Suresh V 42 p(V 42 )h(V 42 ) y y y y XY x1y1 x2y2 x3y3 x4y4 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 V 12 p(V 12 )h(V 12 ) y y y y

118 118 Well-Designed Databases [AL03]  Information content of cell c in database D satisfying FD X → Y – Information content of cell c is entropy H(V)  Schema S is in BCNF iff  D  S, H(V) > 0, for all cells c in D Information Theory for Data Management - Divesh & Suresh XY x1y1 x2y2 x3y3 x4y4 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 cH(V) c c c c

119 119 Well-Designed Databases [AL03]  Information content of cell c in DB D satisfying MVD X →→ Y – Information content of cell c is entropy H(V)  H(V 52 ) = 0.0, H(V 53 ) = 2.32 Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6 V 52 p(V 52 )h(V 52 ) y V 53 p(V 53 )h(V 53 ) z z z z40.0 z z

120 120 Well-Designed Databases [AL03]  Information content of cell c in DB D satisfying MVD X →→ Y – Information content of cell c is entropy H(V)  Schema S is in 4NF iff  D  S, H(V) > 0, for all cells c in D Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6 cH(V) c c c c c c c c cH(V) c c c c c c c c

121 121 Well-Designed Databases [AL03]  Information content of cell c in DB D satisfying MVD X →→ Y – Information content of cell c is entropy H(V)  H(V 32 ) = 1.58, H(V 34 ) = 2.32 Information Theory for Data Management - Divesh & Suresh V 34 p(V 34 )h(V 34 ) z z z z40.0 z z XY x1y1 x1y2 x2y3 x3y3 x4y3 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 V 32 p(V 32 )h(V 32 ) y y y

122 122 Well-Designed Databases [AL03]  Information content of cell c in DB D satisfying MVD X →→ Y – Information content of cell c is entropy H(V)  Schema S is in 4NF iff  D  S, H(V) > 0, for all cells c in D Information Theory for Data Management - Divesh & Suresh XY x1y1 x1y2 x2y3 x3y3 x4y3 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 cH(V) c c c c c cH(V) c c c c c c

123 123 Well-Designed Databases [AL03]  Normalization algorithms never decrease information content – Information content of cell c is entropy H(V) Information Theory for Data Management - Divesh & Suresh XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6 cH(V) c c c c c c c c

124 124 Well-Designed Databases [AL03]  Normalization algorithms never decrease information content – Information content of cell c is entropy H(V) Information Theory for Data Management - Divesh & Suresh cH(V) c c c c c c XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6 XY x1y1 x1y2 x2y3 x3y3 x4y3 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 = cH(V) c c c c c c c c

125 125 Well-Designed Databases [AL03]  Normalization algorithms never decrease information content – Information content of cell c is entropy H(V) Information Theory for Data Management - Divesh & Suresh cH(V) c c c c c c XYZ x1y1z1 x1y2z2 x1y1z2 x1y2z1 x2y3z3 x2y3z4 x3y3z5 x4y3z6 XY x1y1 x1y2 x2y3 x3y3 x4y3 XZ x1z1 x1z2 x2z3 x2z4 x3z5 x4z6 = cH(V) c c c c c c c c

126 126 Database Design: Summary  Good database design essential for preserving data integrity  Information theoretic measures useful for integrity constraints – FD X → Y holds iff InD measure H(Y|X) = 0 – MVD X →→ Y holds iff H(Y,Z|X) = H(Y|X) + H(Z|X) – Information theory to model correlations in specific database  Information theoretic measures useful for normal forms – Schema S is in BCNF/4NF iff  D  S, H(V) > 0, for all cells c in D – Information theory to model distributions over possible databases Information Theory for Data Management - Divesh & Suresh

127 127 Outline Part 1  Introduction to Information Theory  Application: Data Anonymization  Application: Data Integration Part 2  Review of Information Theory Basics  Application: Database Design  Computing Information Theoretic Primitives  Open Problems Information Theory for Data Management - Divesh & Suresh

128 Domain size matters  For random variable X, domain size = supp(X) = {xi | p(X = xi) > 0}  Different solutions exist depending on whether domain size is “small” or “large”  Probability vectors usually very sparse

129 Entropy: Case I - Small domain size  Suppose the #unique values for a random variable X is small (i.e fits in memory)  Maximum likelihood estimator: – p(x) = #times x is encountered/total number of items in set

130 Entropy: Case I - Small domain size  H MLE =  x p(x) log 1/p(x)  This is a biased estimate: – E[H MLE ] < H  Miller-Madow correction: – H’ = H MLE + (m’ – 1)/2n m’ is an estimate of number of non-empty bins n = number of samples  Bad news: ALL estimators for H are biased.  Good news: we can quantify bias and variance of MLE: – Bias <= log(1 + m/N) – Var(H MLE ) <= (log n) 2 /N

131 Entropy: Case II - Large domain size  |X| is too large to fit in main memory, so we can’t maintain explicit counts.  Streaming algorithms for H(X): – Long history of work on this problem – Bottomline: (1+  )-relative-approximation for H(X) that allows for updates to frequencies, and requires “almost constant”, and optimal space [HNO08].

132 Streaming Entropy [CCM07]  High level idea: sample randomly from the stream, and track counts of elements picked [AMS]  PROBLEM: skewed distribution prevents us from sampling lower-frequency elements (and entropy is small)  Idea: estimate largest frequency, and distribution of what’s left (higher entropy)

133 Streaming Entropy [CCM07]  Maintain set of samples from original distribution and distribution without most frequent element.  In parallel, maintain estimator for frequency of most frequent element – normally this is hard – but if frequency is very large, then simple estimator exists [MG81] (Google interview puzzle!)  At the end, compute function of these two estimates  Memory usage: roughly 1/  2 log(1/  ) (  is the error)

134 Entropy and MI are related  I(X;Y) = H(X,Y) – H(X) – H(Y)  Suppose we can c-approximate H(X) for any c > 0: Find H’(X) s.t |H(X) – H’(X)| <= c  Then we can 3c-approximate I(X;Y): – I(X;Y) = H(X,Y) – H(X) – H(Y) <= H’(X,Y)+c – (H’(X)-c) – (H’(Y)-c) <= H’(X,Y) – H’(X) – H’(Y) + 3c <= I’(X,Y) + 3c  Similarly, we can 2c-approximate H(Y|X) = H(X,Y) – H(X)  Estimating entropy allows us to estimate I(X;Y) and H(Y|X)

135 Computing KL-divergence: Small Domains  “easy algorithm”: maintain counts for each of p and q, normalize, and compute KL-divergence.  PROBLEM ! Suppose q i = 0: – p i log p i /q i is undefined !  General problem with ML estimators: all events not seen have probability zero !! – Laplace correction: add one to counts for each seen element – Slightly better: add 0.5 to counts for each seen element [KT81] – Even better, more involved: use Good-Turing estimator [GT53]  YIeld non-zero probability for “things not seen”.

136 Computing KL-divergence: Large Domains  Bad news: No good relative-approximations exist in small space.  (Partial) good news: additive approximations in small space under certain technical conditions (no p i is too small).  (Partial) good news: additive approximations for symmetric variant of KL-divergence, via sampling.  For details, see [GMV08,GIM08]

137 Information-theoretic Clustering  Given a collection of random variables X, each “explained” by a random variable Y, we wish to find a (hard or soft) clustering T such that I(T,X) –  I(T, Y) is minimized.  Features of solutions thus far: – heuristic (general problem is NP-hard) – address both small-domain and large-domain scenarios.

138 Agglomerative Clustering (aIB) [ST00]  Fix number of clusters k 1.While number of clusters < k 1. Determine two clusters whose merge loses the least information 2. Combine these two clusters 2.Output clustering  Merge Criterion: – merge the two clusters so that change in I(T;V) is minimized  Note: no consideration of  (number of clusters is fixed)

139 Agglomerative Clustering (aIB) [S]  Elegant way of finding the two clusters to be merged:  Let d JS (p,q) = (1/2)(d KL (p,m) + d KL (q,m)), m = (p+q)/2  d JS (p,q) is a symmetric distance between p, q (Jensen- Shannon distance)  We merge clusters that have smallest d JS (p,q), (weighted by cluster mass) pqm

140 Iterative Information Bottleneck (iIB) [S]  aIB yields a hard clustering with k clusters.  If you want a soft clustering, use iIB (variant of EM) – Step 1: p(t|x) ← exp(-  d KL (p(V|x),p(V|t)) assign elements to clusters in proportion (exponentially) to distance from cluster center ! – Step 2: Compute new cluster centers by computing weighted centroids: p(t) =  x p(t|x) p(x) p(V|t) =  x p(V|t) p(t|x) p(x)/p(t) – Choose  according to [DKOSV06]

141 Dealing with massive data sets  Clustering on massive data sets is a problem  Two main heuristics: – Sampling [DKOSV06]: pick a small sample of the data, cluster it, and (if necessary) assign remaining points to clusters using soft assignment. How many points to sample to get good bounds ? – Streaming: Scan the data in one pass, performing clustering on the fly How much memory needed to get reasonable quality solution ?

142 LIMBO (for aIB) [ATMS04]  BIRCH-like idea: – Maintain (sparse) summary for each cluster (p(t), p(V|t)) – As data streams in, build clusters on groups of objects – Build next-level clusters on cluster summaries from lower level

143 143 Outline Part 1  Introduction to Information Theory  Application: Data Anonymization  Application: Data Integration Part 2  Review of Information Theory Basics  Application: Database Design  Computing Information Theoretic Primitives  Open Problems Information Theory for Data Management - Divesh & Suresh

144 Open Problems  Data exploration and mining – information theory as first-pass filter  Relation to nonparametric generative models in machine learning (LDA, PPCA,...)  Engineering and stability: finding right knobs to make systems reliable and scalable  Other information-theoretic concepts ? (rate distortion, higher-order entropy,...) THANK YOU !

145 145 References: Information Theory  [CT] Tom Cover and Joy Thomas: Information Theory.  [BMDG05] Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, Joydeep Ghosh. Learning with Bregman Divergences, JMLR  [TPB98] Naftali Tishby, Fernando Pereira, William Bialek. The Information Bottleneck Method. Proc. 37 th Annual Allerton Conference, 1998 Information Theory for Data Management - Divesh & Suresh

146 146 References: Data Anonymization  [AA01] Dakshi Agrawal, Charu C. Aggarwal: On the design and quantification of privacy preserving data mining algorithms. PODS  [AS00] Rakesh Agrawal, Ramakrishnan Srikant: Privacy preserving data mining. SIGMOD  [EGS03] Alexandre Evfimievski, Johannes Gehrke, Ramakrishnan Srikant: Limiting privacy breaches in privacy preserving data mining. PODS Information Theory for Data Management - Divesh & Suresh

147 147 References: Data Integration  [AMT04] Periklis Andritsos, Renee J. Miller, Panayiotis Tsaparas: Information-theoretic tools for mining database structure from large data sets. SIGMOD  [DKOSV06] Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, Suresh Venkatasubramanian: Rapid identification of column heterogeneity. ICDM  [DKSTV08] Bing Tian Dai, Nick Koudas, Divesh Srivastava, Anthony K. H. Tung, Suresh Venkatasubramanian: Validating multi-column schema matchings by type. ICDE  [KN03] Jaewoo Kang, Jeffrey F. Naughton: On schema matching with opaque column names and data values. SIGMOD  [PPH05] Patrick Pantel, Andrew Philpot, Eduard Hovy: An information theoretic model for database alignment. SSDBM Information Theory for Data Management - Divesh & Suresh

148 148 References: Database Design  [AL03] Marcelo Arenas, Leonid Libkin: An information theoretic approach to normal forms for relational and XML data. PODS  [AL05] Marcelo Arenas, Leonid Libkin: An information theoretic approach to normal forms for relational and XML data. JACM 52(2), ,  [DR00] Mehmet M. Dalkilic, Edward L. Robertson: Information dependencies. PODS  [KL06] Solmaz Kolahi, Leonid Libkin: On redundancy vs dependency preservation in normalization: an information-theoretic study of XML. PODS Information Theory for Data Management - Divesh & Suresh

149 149 References: Computing IT quantities  [P03] Liam Panninski. Estimation of entropy and mutual information. Neural Computation 15:  [GT53] I. J. Good. Turing’s anticipation of Empirical Bayes in connection with the cryptanalysis of the Naval Enigma. Journal of Statistical Computation and Simulation, 66(2),  [KT81] R. E. Krichevsky and V. K. Trofimov. The performance of universal encoding. IEEE Trans. Inform. Th. 27 (1981),  [CCM07] Amit Chakrabarti, Graham Cormode and Andrew McGregor. A near-optimal algorithm for computing the entropy of a stream. Proc. SODA  [HNO] Nich Harvey, Jelani Nelson, Krzysztof Onak. Sketching and Streaming Entropy via Approximation Theory. FOCS 2008  [ATMS04] Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller and Kenneth C. Sevcik. LIMBO: Scalable Clustering of Categorical Data. EDBT 2004 Information Theory for Data Management - Divesh & Suresh

150 150 References: Computing IT quantities  [S] Noam Slonim. The Information Bottleneck: theory and applications. Ph.D Thesis. Hebrew University,  [GMV08] Sudipto Guha, Andrew McGregor, Suresh Venkatasubramanian. Streaming and sublinear approximations for information distances. ACM Trans Alg  [GIM08] Sudipto Guha, Piotr Indyk, Andrew McGregor. Sketching Information Distances. JMLR, Information Theory for Data Management - Divesh & Suresh


Download ppt "Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian."

Similar presentations


Ads by Google