Byron Roe1 Some Current Statistical Considerations in Particle Physics Byron P. Roe Department of Physics University of Michigan Ann Arbor, MI 48109.

Slides:



Advertisements
Similar presentations
Chapter 7 Sampling and Sampling Distributions
Advertisements

Ensemble Learning – Bagging, Boosting, and Stacking, and other topics
Chapter 7 Classification and Regression Trees
Linear Regression.
Brief introduction on Logistic Regression
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Concluding Talk: Physics Gary Feldman Harvard University PHYSTAT 05 University of Oxford 15 September, 2005.
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Model Assessment, Selection and Averaging
Searching for Single Top Using Decision Trees G. Watts (UW) For the DØ Collaboration 5/13/2005 – APSNW Particles I.
8. Statistical tests 8.1 Hypotheses K. Desch – Statistical methods of data analysis SS10 Frequent problem: Decision making based on statistical information.
Maximum likelihood (ML) and likelihood ratio (LR) test
Sparse vs. Ensemble Approaches to Supervised Learning
Data Mining Techniques Outline
Ensemble Learning: An Introduction
Induction of Decision Trees
Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.
Machine Learning CMPT 726 Simon Fraser University
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.
Boosted Decision Trees, a Powerful Event Classifier
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Maximum likelihood (ML)
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
by B. Zadrozny and C. Elkan
G. Cowan Lectures on Statistical Data Analysis Lecture 7 page 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem 2Random variables and.
Chapter 9 – Classification and Regression Trees
G. Cowan Statistical Methods in Particle Physics1 Statistical Methods in Particle Physics Day 3: Multivariate Methods (II) 清华大学高能物理研究中心 2010 年 4 月 12—16.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Benk Erika Kelemen Zsolt
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
B-tagging Performance based on Boosted Decision Trees Hai-Jun Yang University of Michigan (with X. Li and B. Zhou) ATLAS B-tagging Meeting February 9,
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
MiniBooNE Event Reconstruction and Particle Identification Hai-Jun Yang University of Michigan, Ann Arbor (for the MiniBooNE Collaboration) DNP06, Nashville,
G. Cowan CLASHEP 2011 / Topics in Statistical Data Analysis / Lecture 21 Topics in Statistical Data Analysis for HEP Lecture 2: Statistical Tests CERN.
Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Ensemble Methods in Machine Learning
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
2005 Unbinned Point Source Analysis Update Jim Braun IceCube Fall 2006 Collaboration Meeting.
Analysis of H  WW  l l Based on Boosted Decision Trees Hai-Jun Yang University of Michigan (with T.S. Dai, X.F. Li, B. Zhou) ATLAS Higgs Meeting September.
B-tagging based on Boosted Decision Trees
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
In Bayesian theory, a test statistics can be defined by taking the ratio of the Bayes factors for the two hypotheses: The ratio measures the probability.
G. Cowan Lectures on Statistical Data Analysis Lecture 12 page 1 Statistical Data Analysis: Lecture 12 1Probability, Bayes’ theorem 2Random variables and.
Search for H  WW*  l l Based on Boosted Decision Trees Hai-Jun Yang University of Michigan LHC Physics Signature Workshop January 5-11, 2008.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Bagging and Random Forests
Deep Feedforward Networks
Confidence Intervals and Limits
Boosting and Additive Trees
MiniBooNE Event Reconstruction and Particle Identification
Data Mining Lecture 11.
Roberto Battiti, Mauro Brunato
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ensemble learning.
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Presentation transcript:

Byron Roe1 Some Current Statistical Considerations in Particle Physics Byron P. Roe Department of Physics University of Michigan Ann Arbor, MI 48109

Byron Roe2 Outline Preliminaries Nuisance Variables Modern Classification Methods (especially boosting and related methods).

Byron Roe3 Preliminaries Try to determine a parameter l given a measurement x For each l draw line so probability x is within limits is 90% The probability of a result falling in region is 90% Given an x, then for 90% of experiments l is in that region. This is the Neyman Construction

Byron Roe4 Frequentist and Bayesian This construction is frequentist; no probability is assigned to l a physical quantity Bayesian point of view: probability refers to state of knowledge of parameter and l can have a probability Formerly a war between the two views. People starting to realize each side has some merits and uses; war abating

Byron Roe5 Ambiguities At a given l, 90% of the time x will fall in region, but do you want 5% on each side or 8% on lower and 2% on upper? Useful ordering principle introduced into physics by Feldman and Cousins: Choose the region to have the largest values of R=likelihood this l given x / best likelihood of any physical l given x Always gives a region; goes automatically from limits to regions

Byron Roe6 But is it new? Feldman and Cousins soon realized this was a standard statistical technique described in a text (Kendall and Stuart) Physicists have, in the past, often ignored statistical literature to the detriment of both physicists and statisticians In recent years, helped by conferences on statistics in physics since 2000, there has been more and more cooperation

Byron Roe7 Nuisance Parameters Experiments may depend on background, efficiency… which are not the targets of the experiment, but are needed to get to the physical parameter l These are called nuisance parameters The expectation values may be well known or have an appreciable uncertainty.

Byron Roe8 Problem with Feldman Cousins Karmen experiment in 1999 reported results. They were checking LSND expt. Background was known to be 2.83+/-0.13 events. They observed 0 events and set a limit on lambda using FC at 1.1 at 90% CL Common sense: if 0 signal, 2.3 is 90% CL FC ordering is P given data, BUT 90% CL is overall P, not P given data

Byron Roe9 Attempts to Improve Estimate With a statistician, Michael Woodroofe, I tried different methods Suppose try Bayesian method, taking a prior (initial) probability for l uniform. Obtain credible limit (Bayesian equivalent of CL) Go back to frequentist view and look at coverage Quite close to frequentist and with small modification, very close in almost all regions, but gives 2.5 for Karmen limit close to desired 2.3

Byron Roe10 Nuisance Variables with Significant Uncertainty Can draw a 90% CL region for joint probability of l and b (nuisance par.) Project onto l axis and take extreme values for CL Safe, but often grossly over-covers

Byron Roe11 Nuisance Parameters 2 1. Integrate over nuisance parameter b using measured probability of b. Introduces Bayesian concept for b. Tends to over-cover and there are claims of under- coverage 2. Suppose max. likelihood solution has values L, B. Suppose, given a l, the maximum likelihood solution for b is b l. Consider: R = likelihood(x|l,b l )/ likelihood(x|L,B) (Essentially FC ratio) Let G l,b = prob l,b (R>C) = CL Approximate: G l,b approx G l,b l

Byron Roe12 Nuisance Parameters 3 Use this and make a full Neyman construction. Good coverage for a number of examples, OR… Assume -2ln R is approximately a c 2 distribution. (It is asymptotically.) This is method of MINOS in MINUIT. Coverage good in some recent cases. Clipping required for nuisance parameters far from expected values.

Byron Roe13 Data Classification Given a set of events to separate into signal and background and some partial knowledge in the form of set of particle identification (PID) variables. Make a series of cuts on PID variables often inefficient. Neural net. Invented by John Hopfield as a method the brain might use to learn. Newer methodsboosting,…

Byron Roe14 Neural Nets and Modern Methods Use a training sample of events for which you know which are signal and which are background. Practice an algorithm on this set, updating it and trying to find best discrimination. Need second unbiased set to test result on, the test sample. If the test set was used to determine parameters or stopping point of algorithm, need a third set, verification sample Results here for testing samples. Verification samples in our tests gave essentially same results.

Byron Roe15 Neural Network Structure Combine the features in a non-linear way to a hidden layer and then to a final layer Use a training set to find the best w ik to distinguish signal and background

Byron Roe16 Intuition Neural nets and most modern methods use PID variables in complicated non- linear ways. Intuition is somewhat difficult However, they are often much more efficient than cuts and are used more and more. I will not discuss neural nets further, but will discuss modern methods boosting,etc.

Byron Roe17 Boosted Decision Trees What is a decision tree? What is boosting the decision trees? Two algorithms for boosting.

Byron Roe18 Decision Tree Go through all PID variables and find best variable and value to split events. For each of the two subsets repeat the process Proceeding in this way a tree is built. Ending nodes are called leaves. Background/Signal

Byron Roe19 Select Signal and Background Leaves Assume an equal weight of signal and background training events. If more than ½ of the weight of a leaf corresponds to signal, it is a signal leaf; otherwise it is a background leaf. Signal events on a background leaf or background events on a signal leaf are misclassified.

Byron Roe20 Criterion for Best Split Purity, P, is the fraction of the weight of a leaf due to signal events. Gini: Note that gini is 0 for all signal or all background. The criterion is to minimize gini left + gini right of the two children from a parent node

Byron Roe21 Criterion for Next Branch to Split Pick the branch to maximize the change in gini. Criterion = gini parent – gini right-child –gini left-child

Byron Roe22 Decision Trees This is a decision tree They have been known for some time, but often are unstable; a small change in the training sample can produce a large difference.

Byron Roe23 Boosting the Decision Tree Give the training events misclassified under this procedure a higher weight. Continuing build perhaps 1000 trees and do a weighted average of the results (1 if signal leaf, -1 if background leaf).

Byron Roe24 Two Commonly used Algorithms for changing weights 1. AdaBoost 2. Epsilon boost (shrinkage)

Byron Roe25 Definitions X i = set of particle ID variables for event i Y i = 1 if event i is signal, -1 if background T m (x i ) = 1 if event i lands on a signal leaf of tree m and -1 if the event lands on a background leaf.

Byron Roe26 AdaBoost Define err_m = weight wrong/total weight Increase weight for misidentified events

Byron Roe27 Scoring events with AdaBoost Renormalize weights Score by summing over trees

Byron Roe28 e-Boost (shrinkage) After tree m, change weight of misclassified events, typical e ~0.01 (0.03). For misclassfied events: Renormalize weights Score by summing over trees

Byron Roe29 Unwgted, Wgted Misclassified Event Rate vs No. Trees

Byron Roe30 Comparison of methods e-boost changes weights a little at a time Let y=1 for signal, -1 for bkrd, T=score summed over trees AdaBoost can be shown to try to optimize each change of weights. exp(-yT) is minimized; The optimum value is T=½ log odds probability that y is 1 given x

Byron Roe31 Tests of Boosting Parameters 45 Leaves seemed to work well for our application 1000 Trees was sufficient (or over-sufficient). AdaBoost with b about 0.5 and e-Boost with e about 0.03 worked well, although small changes made little difference. For other applications these numbers may need adjustment For MiniBooNE need around variables for best results. Too many variables degrades performance. Relative ratio = const.*(fraction bkrd kept)/ (fraction signal kept). Smaller is better!

Byron Roe32 Effects of Number of Leaves and Number of Trees Smaller is better! R = c X frac. sig/frac. bkrd.

Byron Roe33 Number of feature variables in boosting In recent trials we have used 182 variables. Boosting worked well. However, by looking at the frequency with which each variable was used as a splitting variable, it was possible to reduce the number to 86 without loss of sensitivity. Several methods for choosing variables were tried, but this worked as well as any After using the frequency of use as a splitting variable, some further improvement may be obtained by looking at the correlations between variables.

Byron Roe34 Effect of Number of PID Variables

Byron Roe35 Comparison of Boosting and ANN Relative ratio here is ANN bkrd kept/Boosting bkrd kept. Greater than one implies boosting wins! A. All types of background events. Red is 21 and black is 52 training var. B. Bkrd is p 0 events. Red is 22 and black is 52 training variables Percent nue CCQE kept

Byron Roe36 Robustness For either boosting or ANN, it is important to know how robust the method is, i.e. will small changes in the model produce large changes in output. In MiniBooNE this is handled by generating many sets of events with parameters varied by about 1s and checking on the differences. This is not complete, but, so far, the selections look quite robust for boosting.

Byron Roe37 How did the sensitivities change with a new optical model? In Nov. 04, a new, much changed optical model of the detector was introduced for making MC events The reconstruction tunings needed to be changed to optimize fits for this model Using the SAME PID variables as for the old model: For a fixed background contamination of p 0 events fraction of signal kept dropped by 8.3% for boosting and dropped by 21.4% for ANN

Byron Roe38 For ANN For ANN one needs to set temperature, hidden layer size, learning rate… There are lots of parameters to tune. For ANN if one a. Multiplies a variable by a constant, var(17) 2.var(17) b. Switches two variables var(17) var(18) c. Puts a variable in twice The result is very likely to change.

Byron Roe39 For Boosting Only a few parameters and once set have been stable for all calculations within our experiment. Let y=f(x) such that if x 1 >x 2 then y 1 >y 2, then the results are identical as they depend only on the ordering of values. Putting variables in twice or changing the order of variables has no effect.

Byron Roe40 Tests of Boosting Variants None clearly better than AdaBoost or EpsilonBoost I will not go over most, except Random Forests For Random Forest, one uses only a random fraction of the events (WITH replacement) per tree and only a random fraction of the variables per node. NO boosting is usedjust many trees. Each tree should go to completionevery node very small or pure signal or background Our UM programs werent designed well for this many leaves and better results (Narsky) have been obtainedbut not better than boosting.

Byron Roe41

Byron Roe42 Can Convergence Speed be Improved? Removing correlations between variables helps. Random Forest WHEN combined with boosting. Softening the step function scoring: y=(2*purity-1); score = sign(y)*sqrt(|y|).

Byron Roe43 Soft Scoring and Step Function

Byron Roe44 Performance of AdaBoost with Step Function and Soft Scoring Function

Byron Roe45 Conclusions for Nuisance Variables Likelihood ratio methods seem very useful as an organizing principle with or without nuisance variables Some problems in extreme cases where data is much smaller than is expected Several tools for handling nuisance variables were described. The method using approx. likelihood to construct Neyman region seems to have good performance.

Byron Roe46 References for Nuisance Variables 1 J. Neyman, Phil. Trans. Royal Soc., London A333 (1937). G.J. Feldman and R.D. Cousins, Phys. Rev. D57,3873 (1998). A. Stuart, K. Ord, and S. Arnold, Kendalls Advanced Theory of Statistics, Vol 2A, 6 th ed., (London: 1999). R. Eitel and B. Zeitnitz, hep-ex/ The LSND collaboration, C. Athanassopoulos et. al. Phys. Rev. Lett. 75, 2650(1995); Phys. Rev. Lett. 77, 3082(1996); Phys. Rev. C54, 2685 (1996); Phys. Rev. D64, (2001). B. P. Roe and M.B. Woodroofe, Phys. Rev. D60, (1999). B.P. Roe and M.B. Woodroofe, Phys. Rev. D63, (2001).

Byron Roe47 References for Nuisance Variables 2 R. Cousins and V.L. Highland, Nucl. Instrum. Meth. A 320, 331 (1992). J. Conrad, O. Botner, A. Hallgren and C. Perez de los Heros, Phys. Rev. D67, (2003). R.D. Cousins, to appear in Proceedings of PHYSTAT2005: Statistical Problems in Particle Physics, Astrophysics, and Cosmology (2005). K.S. Cranmer, Proceedings of PHYSTAT2003: Statistical Problems in Particle Physics, Astrophysics and Cosmology, 261 (2003). G, Punzi, to appear in Proceedings of PHYSTAT2005. F. James and M. Roos, Nucl. Phys. B172, 475 (1980). W.A. Rolke and A.M. Lopez, Nucl. Intrum. Meth. A458, 745 (2001). W.A. Rolke, A.M. Lopez and J. Conrad, Nucl. Instrum. Meth. A551, 493 (2005).

Byron Roe48 Conclusions For Classification Boosting is very robust. Given a sufficient number of leaves and trees AdaBoost or EpsilonBoost reaches an optimum level, which is not bettered by any variant tried. Boosting was better than ANN in our tests by There are ways (such as the smooth scoring function) to increase convergence speed in some cases. Several techniques can be used for weeding variables. Examining the frequency with which a given variable is used works reasonably well. Downloads in FORTRAN or C++ available at:

Byron Roe49 References for Boosting R.E. Schapire ``The strength of weak learnability. Machine Learning 5 (2), (1990). First suggested the boosting approach for 3 trees taking a majority vote Y. Freund, ``Boosting a weak learning algorithm by majority, Information and Computation 121 (2), (1995) Introduced using many trees Y. Freund and R.E. Schapire, ``Experiments with an new boosting algorithm, Machine Learning: Proceedings of the Thirteenth International Conference, Morgan Kauffman, SanFrancisco, pp (1996). Introduced AdaBoost J. Friedman, Recent Advances in Predictive (Machine) Learning, Proceedings of PHYSTAT2003: Statistical Problems in Particle Physics, Astrophyswics and Cosmology, 196 (2003). J. Friedman, T. Hastie, and R. Tibshirani, ``Additive logistic regression: a statistical view of boosting, Annals of Statistics 28 (2), (2000). Showed that AdaBoost could be looked at as successive approximations to a maximum likelihood solution. T. Hastie, R. Tibshirani, and J. Friedman, ``The Elements of Statistical Learning Springer (2001). Good reference for decision trees and boosting. B.P. Roe et. al., Boosted decision trees as an alternative to artificial neural networks for particle identification, NIM A543, pp (2005). Hai-Jun Yang, Byron P. Roe, and Ji Zhu, Studies of Boosted Decision Trees for MiniBooNE Particle Identification, Physics/ , NIM A555, (2005).

Byron Roe50 Adaboost Output for Training and Test Samples

Byron Roe51 The MiniBooNE Collaboration

Byron Roe52 40 D tank, mineral oil, surrounded by about 1280 photomultipliers. Both Cher. and scintillation light. Geometrical shape and timing distinguishes events

Byron Roe53 Numerical Results from sfitter (a second reconstruction program) Extensive attempt to find best variables for ANN and for boosting starting from about 3000 candidates Train against pi0 and related backgrounds22 ANN variables and 50 boosting variables For the region near 50% of signal kept, the ratio of ANN to boosting background was about 1.2

Byron Roe54 Post-Fitting Post-Fitting is an attempt to reweight the trees when summing tree scores after all the trees are made Two attempts produced only a very modest (few %), if any, gain.

Byron Roe55