Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributions cont.: Continuous and Multivariate

Similar presentations


Presentation on theme: "Distributions cont.: Continuous and Multivariate"— Presentation transcript:

1 Distributions cont.: Continuous and Multivariate

2 Distribution, numeric attribute
Continuous data potentially has infinite domain probability of specific values is zero probabilities over intervals, e.g. (-∞, x] Cumulative distribution function CDF FX(x) = P(X ≤ x) Probability density function PDF first derivative of CDF relative density of points for each value density is not probability

3 Histograms Estimate density in a discrete way
Define cut points and count occurrences within bins How to choose cut points equal width: cut domain (min->max) up in k equal size intervals equal height: select k cut points such that all bins contain (approximately) n/k data points

4 Kernel Density Estimation
Estimating the density (of the population) from the sample Observed data is smoothed over numeric domain by means of a kernel (often Gaussian)

5 Entropy of continuous attribute
Differential entropy Generalisation of entropy to continuous case somewhat problematic Uniform distribution over [0, a]: H(X) = lg(a) a = ½ => H(X) = lg(½) = -1 ?

6 Multivariate Distributions

7 Joint distributions How frequent are combinations of values?
Confusion matrix (contingency table, cross table) counts each combination complete information 2 attributes: how informative is one attribute about the other? Quantifying information between attributes: joint entropy, mutual information, information gain, … Y univariate distribution of X (marginal distribution) T F 0.42 0.13 0.55 0.12 0.33 0.45 0.54 0.46 1.0 X

8 Some joint distributions
X and Y are independent 0.48 = 0.60.8 0.12 = 0.60.2 0.32 = 0.40.8 0.08 = 0.40.2 Y depends on X higher counts along diagonal both diagonals possible X fully determines Y T F 0.48 0.32 0.8 0.12 0.08 0.2 0.6 0.4 1.0 T F 0.42 0.13 0.55 0.12 0.33 0.45 0.54 0.46 1.0 T F 0.4 0.6 1.0

9 Capturing multivariate continuous distributions
2-dimensions Problematic in higher dimensions

10 Joint distribution over numeric x binary
Of specific relevance in Data Mining classification How does the class (T/F) depend on a numeric attribute?


Download ppt "Distributions cont.: Continuous and Multivariate"

Similar presentations


Ads by Google