Presentation is loading. Please wait.

Presentation is loading. Please wait.

SFT group meeting 1 Phystat05 : Trip Report A. Kreshuk, L. Moneta.

Similar presentations


Presentation on theme: "SFT group meeting 1 Phystat05 : Trip Report A. Kreshuk, L. Moneta."— Presentation transcript:

1 SFT group meeting 1 Phystat05 : Trip Report A. Kreshuk, L. Moneta

2 SFT group meeting 2 Phystat History  Started in Jan. 2000 at CERN  Workshop on Confidence Limits  organized by F. James and L. Lyons  Only particle physicists  Fermilab (2000)  Still focused on limits  Durham (2002)  wider range of statistical topics in HEP (also partons dist.)  SLAC (2003 )  partecipation from Astronomists and many statisticians

3 SFT group meeting 3 Phystat05  Oxford 12-15 September 2005  Treated various topics related to statistics (including software)  Contributions with people from the high energy physics, astronomy and statistics community ~ 80 people

4 SFT group meeting 4 Conference Program Plenary sections half physicists half statisticians

5 SFT group meeting 5 Conference Program (2) Parallel sections (Monday + Wednesday afternoons) Software Event classification Limits

6 SFT group meeting 6 Conference Topics  Frequentist vs baysian  Confidence Limits  Nuisance parameters problem  Multivariate analysis (event classification)  Statistical software and tools  Astrophysics  Goodness of Fit  Unfolding, Time Series,...

7 SFT group meeting 7 Frequentist vs Baysian  Nice review from Sir Cox (Oxford)  Frequentist and Baysian approach to statistical inference  Problems working with Baysian analysis  LeDiberder (BaBar)  Analysis of B   Problems with prior choice frequentist Bayesian (II)

8 SFT group meeting 8 Nuisance Parameters  Problem with statistical treatment of uncertainties in nuisance parameter  Typical problem:  N obs =  * L * A + b  Uncertainty in background and acceptance affect estimate of physical parameter .  Statistical uncertainties  Number of events in side bands  Systematic uncertainties  Shape of background  Coverage of these parameters is required in a frequentist analysis  Importance for LHC ( see Kyle Krammer talk)

9 SFT group meeting 9 Kyle Cranmer : Statistical Challenges of the LHC

10 Gary Feldman PHYSTAT 05 15 September 2005 10 Why 5  ? l LHC searches: 500 searches each of which has 100 resolution elements (mass, angle bins, etc.) x 5 x 10 4 chances to find something. One experiment: False positive rate at 5   (5 x 10 4 ) (3 x 10 -7 ) = 0.015. OK. l Two experiments: l Allowable false positive rate: 10. 2 (5 x 10 4 ) (1 x 10 -4 ) = 10  3.7  required. Required other experiment verification: (1 x 10 -3 )(10) = 0.01  3.1  required. l Caveats: Is the significance real? Are there common systematic errors?

11 SFT group meeting 11 Setting Limits with Nuisance parameters  Various techniques presented to set limits with nuisance parameters  Baysian methods (used by CDF)  Profile likelihood (Rolke)  Method used in MINUIT (Minos)  Full Neyman construction (Punzi, Cranmer)  Importance to check coverage whatever method is chosen  Important for claiming 5  discoveries at LHC  Comparison with Cousins-Highland technique used at LEP

12 14th OctoberSFT group meeting12

13 14th OctoberSFT group meeting13

14 14th OctoberSFT group meeting14

15 14th OctoberSFT group meeting15

16 Gary Feldman PHYSTAT 05 15 September 2005 16 Bayesian with Coverage l Joel Heinrich presented a decision by CDF to do Bayesian analyses with priors that cover. Advantage is Bayesian conditioning with frequentist coverage. Possibly the maximum amount of work for the experimenter. l Example of coverage with a single Poisson with normalization and background nuisance parameters: Flat priors

17 SFT group meeting 17 Profile Likelihood Method  Rolke:  eliminating the nuisance parameters via profile likelihood  Neyman construction replaced by the -  lnL hill-climbing approximation.  Same method present in MINUIT (MINOS)  The coverage is good with some minor undercoverage.  Present also in ROOT in class TRolke Bkg rate signal rate

18 14th OctoberSFT group meeting18

19 Gary Feldman PHYSTAT 05 15 September 2005 19 Full Neyman Constructions l Both Giovanni Punzi and Kyle Cranmer attempted full Neyman constructions for both signal and nuisance parameters. l I don’t recommend you try this at home for the following reasons: l The ordering principle is not unique. Both Punzi and Cranmer ran into some problems. l The technique is not feasible for more than a few nuisance parameters. l It is unnecessary since removing the nuisance parameters through profile likelihood works quite well.

20 14th OctoberSFT group meeting20

21 Gary Feldman PHYSTAT 05 15 September 2005 21 Event Classification l The problem: Given a measurement of an event X = (x 1,x 2,…x n ), find the function F(X) which returns 1 if the event is signal (s) and 0 if the event is background (b) to optimize a figure of merit, say signal.

22 Gary Feldman PHYSTAT 05 15 September 2005 22 Theoretical Solution l In principle the solution is straightforward: Use a Monte Carlo simulation to calculate the likelihood ratio L s (X)/L b (X) and derive F(X) from it. By the Neyman-Pearson Theorem, this is the optimum solution. l Unfortunately, this does not work due to the “curse of dimensionality.” In a high-dimension space, even the largest data set is sparse with the distance between neighboring events comparable to the radius of the space.

23 SFT group meeting 23 Practical Solution  use brute force from computers.  One gives the computer samples of signal and background events and lets the computer figure out what F(X) is.  Artificial Neural networks  Decision Trees  Interested sparked by J. Friedman talk at Phystat03  Recent techniques to increase decision power by combining effectively many trees  i.e. Boosted decision trees

24 14th OctoberSFT group meeting24 Decision Tree Go through all PID variables and find best variable and value to split events. For each of the two subsets repeat the process Proceeding in this way a tree is built. Ending nodes are called leaves. Background/Signal

25 Gary Feldman PHYSTAT 05 15 September 2005 25 Rules and Bagging Trees l Jerry Friedman gave a talk on rules, which effectively combines a series of trees. l Harrison Prosper gave a talk (for Ilya Narsky) on bagging (Bootstrap AGGregatING) trees. In this technique, one builds a collection of trees by selecting a sample of the training data and, optionally, a subset of the variables. Results on significance of B   e at BaBar Single decision tree 2.16  Boosted decision trees 2.62  (not optimized) Bagging decision trees 2.99 

26 SFT group meeting 26 Boosted Decision Trees  use of boosted trees in MiniBooNE (B. Roe)  Misclassified events in one tree are given a higher weight and a new tree is generated.  Repeat to generate 1000 trees.  The final classifier is a weighted sum of all of the trees.  Comparison with neural networks (ANN):  Boosting better than ANN by 1.2-1.8  More robust % of signal retained 52 variables 21 variables ANN/ Boosted Trees bkg events

27 SFT group meetingPhystat 2005 27 14th OctoberHarrison Prosper StatPatternRecognition: A C++ package for multivariate classification Implemented classifiers and algorithms:  binary split  linear and quadratic discriminant analysis  decision trees  bump hunting algorithm (PRIM, Friedman & Fisher)  AdaBoost  bagging and random forest algorithms  AdaBoost and Bagger are capable of boosting/bagging any classifier implemented in the package Described in: I. Narsky, physics/0507143 and physics/0507157

28 SFT group meeting 28 More on classification  Gray: how to do Baysian optimal classification with massive dataset:  nonparametric baysian classifiers Optimal decision boundary Star density Quasar density density f(x)

29 14th OctoberSFT group meeting29 Trip report (Part 2) PHYSTAT 05 - Oxford 12th - 15th September 2005 Statistical problems in Particle Physics, Astrophysics Cosmology

30 14th OctoberSFT group meeting30 Outline Statistical software for physics Some new algorithms for physics Astronomy

31 14th OctoberSFT group meeting31 Software for Statistics (for Physics) by Jim Linnemann (1) “R is a language and environment for statistical computing and graphics” R - standard tool of professional research statisticians: Elegant data manipulation language Command prompt and macros, interpreted, no GUI yet Very broad package library, trivial download and extension An interface between R and ROOT: ROOT TTrees can be read from R prompt Vice versa doesn’t work yet

32 14th OctoberSFT group meeting32 Software for Statistics (for Physics) by Jim Linnemann (2) Web page of statistical resources: http://www.pa.msu.edu/people/linnemann/stat_resources.html Contains links to High Energy Physics analysis software Astrophysics analysis software General purpose statistical resources Multivariate analysis and statistical learning

33 14th OctoberSFT group meeting33 Software for Statistics (for Physics) by Jim Linnemann (3) Proposed to create a physics-oriented repository of statistical software Discussing now with Fermilab Computing Division Hierarchy of purposes: Archive for software associated with papers Small packages: calculation of significance, limits, goodness-of- fit tests Packages Medium-sized packages: MC, TerraFerma, StatPatternRecognition Component library

34 14th OctoberSFT group meeting34 R “Easy data analysis using R” by Marc Paterno R is “an implementation of the S language” John Chambers, the author of S, received 1998 ACM Software System award for “the S system, which has forever altered the way people analyze, visualize, and manipulate data... S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers. “ Available as Free Software under the GNU General Public License

35 14th OctoberSFT group meeting35 R – statistical plots R provides a variety of useful plot types which are not widely known to the physics community, including: dot plot: replacement for pie charts and bar charts. Splom: scatter plot matrix, showing all pairwise correlations for a set of variables box-and-whisker plot: for summary comparison of a large number of 1d distributions quantile and QQ plot: for sensitive comparison of 2 distribution There are more special-purpose plots, and many statistical tools come with dedicated plot styles

36 14th OctoberSFT group meeting36 R – a boxplot Multiple boxplots are more informative than profile histograms in case of asymmetric distributions or outliers in data

37 14th OctoberSFT group meeting37 R – a scatter plot matrix The scatter plot matrix is an interesting tool for quickly identifying pairs of quantities with interesting relationships Interesting correlations are easily visible Unbinned – no features lost Toy jet resolution simulation

38 14th OctoberSFT group meeting38 R – QQ plots  Studies show that human perception is poor at evaluating similar histograms  Quantile-quantile plots are simpler to analyze  We clearly see even a small difference – second jet’s NoCs distribution has a larger high-end tail

39 14th OctoberSFT group meeting39 R An R session can be saved to disk and application state recovered at a later time The saved session is platform neutral R can read many data formats: Text files, common spreadsheet formats Oracle, MySQL, SQLite or any ODBC database DCOM and CORBA Other statistical packages format Even ROOT TTrees now – local development at Fermilab, allows to read “simple” trees

40 14th OctoberSFT group meeting40 R Additional functionality comes in packages Users have all tools to create and distribute their own packages Discovery and installation of new packages is easy Uniform documentation model is observed At this moment there are 590 add-on packages available in the main repository CRAN Many of these packages present not just one tool, but a large family of tools

41 14th OctoberSFT group meeting41 Goodness-of-Fit toolkit Maria Grazia Pia presented an update on the Goodness-of-Fit tookit Algorithms for binned distributions Algorithms for binned distributions Anderson-Darling test Chi-squared test Fisz-Cramer-von Mises test Tiku test (Cramer-von Mises test in chi-squared approximation) Algorithms for unbinned distributions Algorithms for unbinned distributions Anderson-Darling test Cramer-von Mises test Goodman test (Kolmogorov-Smirnov test in chi-squared approximation) Kolmogorov-Smirnov test Kuiper test Tiku test (Cramer-von Mises test in chi-squared approximation) Goal: provide all 2-sample GoF tests existing in statistical literature Goal: provide all 2-sample GoF tests existing in statistical literature

42 14th OctoberSFT group meeting42 sPlot (1) A statistical tool to unfold data distributions To be added to ROOT soon Several publications from BaBaR using sPlot physics/0402083, to be published in NIM

43 14th OctoberSFT group meeting43 sPlot (2)

44 14th OctoberSFT group meeting44 Data sifting A new algorithm for outlier detection and fitting presented by Martin Block To be used in case of Gaussian signal with Gaussian errors, with outliers “far away” from the good points (no “swamping”) First, Lorentzian minimization is performed for all data points Then this Lorentzian fit is used as the initial estimate of the theoretical curve and the chisquare of each point w.r.t. this curve is computed A cut is applied to reject the points too far from the curve and the chisquare fit of the remaining points is performed. Parameters and parameter errors are estimated, with renormalization to take into account that the dataset has been truncated

45 14th OctoberSFT group meeting45 Astrophysics

46 14th OctoberSFT group meeting46 Similarity to HEP HEP Van de Graaf Cyclotrons National Labs International (CERN) SSC vs LHC Optical Astronomy 2.5m telescopes 4m telescopes 8m class telescopes Surveys/Time Domain 30-100m telescopes Similar trends with a 20 year delay, fewer and ever bigger projects… increasing fraction of cost is in software… more conservative engineering… Can the exponential continue, or will be logistic? What can astronomy learn from High Energy Physics? Alex Szalay, John Hopkins University

47 14th OctoberSFT group meeting47 Why is astronomy different? Especially attractive for the wide public It has no commercial value No privacy concerns, freely share results with others Great for experimenting with algorithms Data has more dimensions Spatial, temporal, cross-correlations Diverse and distributed Many different instruments from many different places and many different times Many different interesting questions Alex Szalay, John Hopkins University

48 14th OctoberSFT group meeting48 Data in astronomy Astronomers have a few hundred TB now Data doubles every year Data is public after 1 year Same access for everyone

49 14th OctoberSFT group meeting49 Today’s questions Discoveries need fast outlier detection Spatial statistics Fast correlation and power spectrum codes (CMB + galaxies) Cross-correlations among different surveys (sky pixelization + fast harmonic transforms on sphere) Time-domain: Transients, supernovae, periodic variables Moving objects, killer’ asteroids, Kuiper-belt objects….

50 14th OctoberSFT group meeting50 Other challenges Statistical noise is smaller and smaller Error matrix larger and larger (Planck…) Systematic errors becoming dominant De-sensitize against known systematic errors Optimal subspace filtering (…SDSS stripes…) Comparisons of spectra to models 10 6 spectra vs 10 8 models (Charlot…) Detection of faint sources in multi-spectral images How to use all information optimally (QUEST…) Efficient visualization of ensembles of 100M+ data points

51 14th OctoberSFT group meeting51 Virtual Observatory International Virtual Observatory Alliance: formed in June 2002 with a mission to facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory. Aim: all astronomical data accessible from a desktop

52 14th OctoberSFT group meeting52 Virtual observatory

53 14th OctoberSFT group meeting53 Summary for astronomy Databases became an essential part of astronomy: most data access will soon be via digital archives Data at separate locations, distributed worldwide, evolving in time: move analysis, not data! Good scaling of statistical algorithms essential Many outstanding problems in astronomy are statistical, current techniques inadequate, we need help! The Virtual Observatory is a new paradigm for doing science: the science of Data Exploration!

54 14th OctoberSFT group meeting54 Conclusions The conference gave a good picture of the general trends in statistics for HEP and astronomy There are more interesting algorithms that their authors would like to see in ROOT, discussions are going on There are things that people find useful in other systems and that we don’t have in ROOT yet and should add in the near future A very interesting conference!

55 14th OctoberSFT group meeting55 References Conference Web site Talk slides are attached to the program http://www.physics.ox.ac.uk/phystat05/programme.htm For more information on the subjects look at the recommended readings page http://www.physics.ox.ac.uk/phystat05/reading.htm Expected soon to have the conference proceedings available online


Download ppt "SFT group meeting 1 Phystat05 : Trip Report A. Kreshuk, L. Moneta."

Similar presentations


Ads by Google