Presentation on theme: "S.Towers TerraFerMA TerraFerMA A Suite of Multivariate Analysis tools Sherry Towers SUNY-SB Version 1.0 has been released! useable by anyone with access."— Presentation transcript:
S.Towers TerraFerMA TerraFerMA A Suite of Multivariate Analysis tools Sherry Towers SUNY-SB Version 1.0 has been released! useable by anyone with access to the CLHEP and Root libraries www-d0.fnal.gov/~smjt/multiv.html
S.Towers TerraFerMA TerraFerMA=Fermilab Multivariate Analysis (aka FerMA) Convenient interface to various multivariate analysis packages (ex: MLPfit, Jetnet, PDE/GEM, Fisher discriminant,binned likelihood etc) User fills signal and background (and data) Samples, which are then used as input to FerMA methods… Includes method to sort variables to determine which are best discriminators between signal and background.
S.Towers TerraFerMA Also includes useful stats tools (correlations, means, RMSs), and method to detect outliers. Using a multivariate package chosen by user, FerMA will yield prob that data event is signal or background. TerraFerMA makes it trivial to compare performance of different multivariate techniques! Also makes it easy to reduce the number of discriminators used in an analysis!
S.Towers TerraFerMA Simple techniques: Ingore all correlations between discriminators… For example; simple techniques based on square cuts, or likelihood techniques which obtain multi-D likelihood from product of 1-D likelihoods. Advantage: fast, easy understand. Easy to tell if modelling of data is sound. Disadvantage: useful discriminating info is lost if correlations are ignored FerMA includes a method to determine optimal square cuts in a multidimensional parameter space. Overview of common multivariate analysis techniques:
S.Towers TerraFerMA More powerful... More complicated techniques take into account simple (linear) correlations between discriminators u ANOVA/MANCOVA s H-Matrix s Fisher-discriminant * s Principal component analysis * u Projection correlation transformations * u Optimal Observables u and many, many more… Advantage: fast, more powerful Disadvantage: can be a bit harder to understand, systematics can be harder to assess. Harder to tell if modelling of data is sound.
S.Towers TerraFerMA Probability correlation transformations (ProCor) ProCor is default multivariate package in TerraFerMA. u Very fast u (Relatively) easy to understand Essentially, ProCor maps every point in signal (or background) MC onto a multi-dimensional Gaussian PDF. u Mapping is optimal for MC sets with linear correlations between variables u If mapping is not optimal, ProCor tells you!
S.Towers TerraFerMA Most powerful... Analytic/binned likelihood u Advantage: easy to understand u Disadvantage: difficult to implement with many variables Neural Networks u Advantage: powerful, reasonably fast u Disadvantage: Black box! Many parameters of method, and systematics can be difficult to assess Kernel Estimation u (Gaussian Expansion Method=GEM) u (Static-Kernal Probability Density Estimation=PDE) u Advantage: powerful, easy to understand. Unbinned estimate of original PDF. Few parameters of method. u Disadvantage: a bit slow.
S.Towers TerraFerMA Gaussian Expansion Method/ Probability Density Estimation All kernal PDF estimation methods are developed from a very simple idea… If a data point lies in a region where clustering of signal MC points is relatively tight,and bkgnd MC points is relatively loose, then that point is more likely to be signal.
S.Towers TerraFerMA GEM Whether the clustering is relatively tight can be determined from the local covariance matrix, calculated from nearest neighbours to a point
S.Towers TerraFerMA GEM/PDE But we also want estimate of probability density... GEM/PDE uses idea that any continuous function can be modelled from the sum of kernel functions (similar to idea behind Fourier series) GEM/PDE use multi-dimensional Gaussian kernels Each Gaussian kernel is centred about an MC point…widths of Gaussian come from local covariance matrix at that point
S.Towers TerraFerMA GEM: 1-D Gaussian
S.Towers TerraFerMA GEM/PDE: 1-D Gaussian
S.Towers TerraFerMA Boring details...
S.Towers TerraFerMA The case for fewer discriminators… Using a large number of variables indiscriminantly can indicate a lack of forethought in the design and conceptualization of an analysis
S.Towers TerraFerMA The case for fewer discriminators… Also, each added variable makes it more difficult to determine if modelling of data is sound, and makes analysis more difficult to understand And, each added variable adds statistical noise…This can degrade overall discrimination power!
S.Towers TerraFerMA The curse of too many variables Signal 5D Gaussian = (1,0,0,0,0) = (1,1,1,1,1) Bkgnd 5D Gaussian = (0,0,0,0,0) = (1,1,1,1,1) Only difference between signal and background is in first dimension. Other four dimensions are `useless discriminators
S.Towers TerraFerMA The curse of too many variables
S.Towers TerraFerMA The curse of too many variables
S.Towers TerraFerMA A real-world example… A Tevatron RunI analysis used a 7 variable NN to discriminate between signal and background. Were all 7 needed? Ran the signal and background n-tuples through the TerraFerMA interface to the sorting method…
S.Towers TerraFerMA A real-world example…
S.Towers TerraFerMA Another real-world example… A Tevatron physics-object- ID method uses 9 variables in the analysis. How many are actually needed?
S.Towers TerraFerMA Another real-world example…
S.Towers TerraFerMA Summary Careful examination of discriminators used in a multivariate analysis is always a good idea! Reduction of number of variables can simplify analysis considerably, and can even increase discrimination power!
S.Towers TerraFerMA TerraFerMA Version 1.0 TerraFerMA documentation: u www-d0.fnal.gov/~smjt/ferma.ps TerraFerMA users guide: u www-d0.fnal.gov/~smjt/guide.ps TerraFerMA package: u …/ferma.tar.gz u (includes an example program in examples/simple/simple.cpp)
S.Towers TerraFerMA TerraFerMA Version 1.0 Soon to be included: Support Vector Machines Methods to fit for the fraction of signal and bkgrnd in a data sample Ensembles (many samples grouped together) Enhanced ability to write- out/read-in NN weights Want more? Let me know!
S.Towers TerraFerMA Summary TerraFerMA is: A platform of very powerful multivariate analysis tools. In all test applications to- date, TerraFerMA has signficantly improved existing analyses!