# Entropy Estimation and Applications to Decision Trees.

## Presentation on theme: "Entropy Estimation and Applications to Decision Trees."— Presentation transcript:

Entropy Estimation and Applications to Decision Trees

Estimation Distribution over K=8 classes Repeat 50,000 times: 1.Generate N samples 2.Estimate entropy from samples N=10 N=100 N=50000 H=1.289

Estimation Estimating the true entropy Goals: 1. Consistency: large N guarantees correct result 2. Low variance: variation of estimates small 3. Low bias: expected estimate should be correct

Discrete Entropy Estimators

UCI classification data sets Accuracy on test set Plugin vs. Grassberger Better trees Experimental Results Source: [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

In regression, differential entropy – measures remaining uncertainty about y – is a function of a distribution Differential Entropy Estimation Problem – q is not from a parametric family Solution 1: project onto a parametric family Solution 2: non-parametric entropy estimation

Solution 1: parametric family – Uniform minimum variance unbiased estimator (UMVUE) [Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989]

Solution 1: parametric family

Solution 2: Non-parametric entropy estimation [Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987] [Beirlant, Dudewicz, Győrfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001] [Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009]

Solution 2: Non-parametric estimation

Experimental Results [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

Streaming Decision Trees

Streaming Data “Infinite data” setting 10 possible splits and their scores When to stop and make a decision?

Streaming Decision Trees [Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000] [Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003] [Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013] Score splits on a subset of samples only Domingos/Hulten (Hoeffding Trees), 2000: – Compute sample count n for given precision – Streaming decision tree induction – Incorrect confidence intervals, but work well in practice Jin/Agralwal, 2003: – Tighter confidence interval, asymptotic derivation using delta method Loh/Nowozin, 2013: – Racing algorithm (bad splits are removed early) – Finite sample confidence intervals for entropy and gini

Multivariate Delta Method [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]

Delta Method for the Information Gain [Small, “Expansions and Asymptotics for Statistics”, CRC, 2010] [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]

Delta Method Example

Statistical problem Large body of literature exists on entropy estimation Better estimators yield better decision trees Distribution of estimate relevant in the streaming setting Conclusion on Entropy Estimation