Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou.

Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou

The Problem Given: Given: Samples from two hidden distributions P 1 and P 2 Samples from two hidden distributions P 1 and P 2 Unknown labels Unknown labels Each sample/individual: Each sample/individual: k features: 0/1 values k features: 0/1 values Population P 1 :feature f is 1 w.p. p 1 f Population P 1 :feature f is 1 w.p. p 1 f Population P 2 : feature f is 1 w.p. p 2 f Population P 2 : feature f is 1 w.p. p 2 f Unknown feature probabilities Unknown feature probabilities

The Problem Given: Given: 2n samples from two hidden distributions P 1 and P 2 2n samples from two hidden distributions P 1 and P 2 Unknown labels Unknown labels Goal: Classify each individual correctly for most inputs Goal: Classify each individual correctly for most inputs

Applications Preprocessing step in statistical analysis: Preprocessing step in statistical analysis: Analyze the factors that cause a complex disease, such as cancer Analyze the factors that cause a complex disease, such as cancer Cluster the samples into populations, then apply statistical analysis Cluster the samples into populations, then apply statistical analysis Collaborative Filtering Collaborative Filtering Feature can be “likes Star Wars or not” Feature can be “likes Star Wars or not” Cluster users into types using the features Cluster users into types using the features

The Problem Given: Given: Samples from two hidden distributions P 1 and P 2 Samples from two hidden distributions P 1 and P 2 Unknown labels Unknown labels Need: Some Separation Between the Distributions

Our Results Need some separation between the distributions! Need some separation between the distributions! Measure of Separation : distance between means Measure of Separation : distance between means  = L 1 distance between means / k  = L 1 distance between means / k  = L 2 2 distance between means / k  = L 2 2 distance between means / k Our Results: Our Results: Optimization function and poly-time algorithm :  k =  (√k log n) Optimization function and poly-time algorithm :  k =  (√k log n) Optimization function :  k =  ( log n) Optimization function :  k =  ( log n)

Our Results This talk: This talk: Optimization function and poly-time algorithm :  k =  (√k log n) Optimization function and poly-time algorithm :  k =  (√k log n) Example: Example: P 1 : For each feature f, p 1 f = ½ P 1 : For each feature f, p 1 f = ½ P 2 : For each feature f, p 2 f = ½ + √log n/√k P 2 : For each feature f, p 2 f = ½ + √log n/√k Information-theoretically optimal: Information-theoretically optimal: There exists two distributions with this separation and constant overlap in probability mass There exists two distributions with this separation and constant overlap in probability mass

Optimization Function What measure to optimize to get the correct clustering? What measure to optimize to get the correct clustering? Need a robust measure which works for small separations

A Robust Measure Find the best balanced partition (S,S’) such that: Find the best balanced partition (S,S’) such that:  f |N f (S) – N f (S’)|  f |N f (S) – N f (S’)| is maximum N f (S), N f (S’) : # of individuals with feature f in S, S’

A Robust Measure Find the best balanced partition (S,S’) such that: Find the best balanced partition (S,S’) such that:  f |N f (S) – N f (S’)|  f |N f (S) – N f (S’)| is maximum N f (S), N f (S’) : # of individuals with feature f in S, S’ Theorem : Optimizing this measure provides the correct partition w.h.p. if  k =  (√k log n)

Proof Sketch: How does the optimal partition behave? How does the optimal partition behave? E[ f(P)] =  k n + k √n Pr[ | f(P) – E[f] | >n√k ] · 2 -n E[ f(Any partition)] = k √n Pr[ | f(P) – E[f] | > n√k] · 2 -n The partition with the optimal value of f in (I) dominates all the partitions in (II) w.h.p for the separation conditions

An Algorithm How can we find the partition which optimizes this measure? How can we find the partition which optimizes this measure? Theorem: There exists an algorithm which finds the correct partition when  k =  (√k log 2 n) Running Time : O(nk log 2 n)

An Algorithm Algorithm: 1. Divide individuals into two sets: A and B 2. Start with a random partition of A 3. Iterate log n times: 1. Classify B using current partition of A and a proximity score 2. And the same for A

An Algorithm Iterate: Classify B using current partition of A and a score Classify B using current partition of A and a score And vice versa. And vice versa.  Random Partition:  ( 1/2 + 1/√n) imbalance Each iteration produces a partition with more imbalance Each iteration produces a partition with more imbalance

Classification Score Our Score: For each feature f, Our Score: For each feature f, If N f (S) > N f (S’) If N f (S) > N f (S’) add 1 to the score if f is present, else subtract 1 add 1 to the score if f is present, else subtract 1 If N f (S) < N f (S’) If N f (S) < N f (S’) add 1 to the score if f is absent, else subtract 1 add 1 to the score if f is absent, else subtract 1 Classify: Classify: Individuals above the median score : S Individuals above the median score : S Individuals below the median score : S’ Individuals below the median score : S’

Classification Lemma: If the current partition has (1/2 +  )-imbalance, the next iteration produces a partition with (1/2 + 2  )-imbalance [for  < c] Lemma: If the current partition has (1/2 +  )-imbalance, the next iteration produces a partition with (1/2 + 2  )-imbalance [for  < c] Lemma: If the current partition has (1/2 + c)-imbalance, the next iteration produces the correct partition with our separation conditions. Lemma: If the current partition has (1/2 + c)-imbalance, the next iteration produces the correct partition with our separation conditions.  (log n) rounds needed to get the correct partition  (log n) rounds needed to get the correct partition Use a fresh set of features in each round to get independence Use a fresh set of features in each round to get independence

Proof Sketch: Lemma: If the current partition has (1/2 +  )- imbalance, the next iteration produces a partition with (1/2 + 2  )-imbalance [for  < c] Lemma: If the current partition has (1/2 +  )- imbalance, the next iteration produces a partition with (1/2 + 2  )-imbalance [for  < c] Initially: G ≈  (log n) X, Y ≈ Bin(k, ½) Population 1Population 2 G G =  (   2 k√n)

Proof Sketch: Lemma: If the current partition has (1/2 +  )- imbalance, the next iteration produces a partition with (1/2 + 2  )-imbalance [for  < c] Lemma: If the current partition has (1/2 +  )- imbalance, the next iteration produces a partition with (1/2 + 2  )-imbalance [for  < c] Population 1Population 2 G G =  (   2 k√n) Pr[ Correct Classification ] = ½ + Ga/√k /(½ + ½) > ½ + 2  [From separation conditions]

Proof Sketch: Lemma: If the current partition has (1/2 + c)- imbalance, the next iteration produces the correct partition with our separation conditions. Lemma: If the current partition has (1/2 + c)- imbalance, the next iteration produces the correct partition with our separation conditions. Population 1Population 2 G =  (   2 k√n) All but a 1/poly(n) fraction is correctly classified

Related Work Learning Mixtures of Gaussians [D99]: Learning Mixtures of Gaussians [D99]: Best performance by Spectral Algorithms [VW02, AM05,KSV05] Best performance by Spectral Algorithms [VW02, AM05,KSV05] Our algorithm : Our algorithm : Matches the bounds in [VW02] for two clusters Matches the bounds in [VW02] for two clusters Not a spectral algorithm ! Not a spectral algorithm !

Open Questions How to extend our algorithm to work for multiple clusters ? How to extend our algorithm to work for multiple clusters ? What is the relationship between our algorithm and spectral algorithms? What is the relationship between our algorithm and spectral algorithms? Matches spectral algorithms of [M01] for two-way graph partitioning Matches spectral algorithms of [M01] for two-way graph partitioning Can our algorithm do better? Can our algorithm do better?

Thank You! Thank You!

Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou.

Similar presentations

Presentation on theme: "Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou.

Similar presentations

Presentation on theme: "Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou."— Presentation transcript:

Similar presentations

About project

Feedback