Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maria Grazia Pia, INFN Genova A Toolkit for Statistical Data Analysis M.G. Pia S. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B. Mascialino, A. Pfeiffer,

Similar presentations


Presentation on theme: "Maria Grazia Pia, INFN Genova A Toolkit for Statistical Data Analysis M.G. Pia S. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B. Mascialino, A. Pfeiffer,"— Presentation transcript:

1 Maria Grazia Pia, INFN Genova A Toolkit for Statistical Data Analysis M.G. Pia S. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo http://www.ge.infn.it/geant4/analysis/HEPstatistics LCG Application Area Meeting CERN, 5 May 2004

2 Maria Grazia Pia, INFN Genova History and background

3 Maria Grazia Pia, INFN Genova The motivation from Geant4 Validation of Geant4 physics models through comparison of simulation vs experimental data or reference databases Fluorescence spectrum from Icelandic basalt (Mars-like rock): experimental data and simulation ESA Bepi Colombo mission to Mercury Test beam at Bessy Photon attenuation coefficient, Al Geant4 Standard Geant4 LowE NIST Electromagnetic models in Geant4 w.r.t. NIST reference

4 Maria Grazia Pia, INFN Genova Historical introduction to EDF tests empirical distribution function enquired how close this would be to the true distribution In 1933 Kolmogorov published a short, but landmark paper on the Italian Giornale dell’Istituto degli Attuari. He formally defined the empirical distribution function (EDF) and then enquired how close this would be to the true distribution F(x), when this is continuous. interesting probability problem statistical methodology. It must be noticed that Kolmogorov himself regarded his paper as the solution of an interesting probability problem, following the general interest of the time, rather than a paper on statistical methodology. foundations Smirnov, Cramer, Von Mises, Anderson, Darling After Kolmogorov article, over a period of about 10 years, the foundations were laid by a number of distinguished mathematicians of methods of testing fit to a distribution based on the EDF ( Smirnov, Cramer, Von Mises, Anderson, Darling, …). continues with great strength today The ideas in this paper have formed a platform for vast literature, both of interesting and important probability problems, and also concerning methods of using the Kolmogorov statistics for testing fit to a distribution. The literature production continues with great strength today showing no sign to decrease.

5 Maria Grazia Pia, INFN Genova Typical use cases in HEP Regression testing –Throughout the software life-cycle Online DAQ –Monitoring detector behaviour w.r.t. a reference Simulation validation –Comparison with experimental data Reconstruction –Comparison of reconstructed vs. expected distributions Physics analysis –Comparisons of experimental distributions (ATLAS vs. CMS Higgs?) –Comparison with theoretical distributions (data vs. Standard Model)

6 Maria Grazia Pia, INFN Genova Software tools Commercial products used by “professional” statisticians –SPSS, NCSS... In HEP: A lot of activity: –workshops/conferences (CERN, Durham, SLAC etc.) –books (F. James et al., L. Lyons, R. Barlow etc.) –sophisticated statistical algorithms applied in various data analyses...but, in spite of the relevant role played by statistics in HEP, very limited availability of software tools for statistics in our field –and in open-source software in general

7 Maria Grazia Pia, INFN Genova Let’s do it ourselves... Provide tools for the statistical comparison of distributions Create a hub to aggregate expertise and collaborative contributions from scientists interested in statistical methods A project to develop an open-source software system for statistical analysis A project to develop an open-source software system for statistical analysis see presentation at LCG-AA meeting, 27 November 2002

8 Maria Grazia Pia, INFN Genova Vision: the basics software process Rigorous software process vision Have a vision for the project –General purpose tool for statistical analysis –Toolkit approach (choice open to users) –Open source product architecture Build on a solid architecture Clearly define scopeobjectives scope, objectives Flexible, extensible, maintainable Flexible, extensible, maintainable system quality Software quality

9 Maria Grazia Pia, INFN Genova Architectural guidelines architectural The project adopts a solid architectural approach functionalityquality –to offer the functionality and the quality needed by the users maintainable –to be maintainable over a large time scale extensible –to be extensible, to accommodate future evolutions of the requirements Component-based architecture –to facilitate re-use and integration in diverse frameworksDependencies –adopt a standard (AIDA) for the user layer –no dependence on any specific analysis toolPython –the “glue” for interactivity LCG Architecture Blueprint Report The approach adopted is compatible with the recommendations of the LCG Architecture Blueprint Report

10 Maria Grazia Pia, INFN Genova Software process United Software Development Process, specifically tailored to the project –practical guidance and tools from the RUP –both rigorous and lightweight –mapping onto ISO 15504 –significant experience gained in the group from other projects Incremental and iterative life-cycle model

11 Maria Grazia Pia, INFN Genova The Goodness-of-Fit component

12 Maria Grazia Pia, INFN Genova User Requirements User requirementselicitedanalysedformally specified User requirements elicited, analysed and formally specified –Functional (capability) and not-functional (constraint) requirements –User Requirements Document available from the web site Requirements Design Implementation Test & test results Documentation Requirement traceability

13 Maria Grazia Pia, INFN Genova

14

15 Simple user layer Shields the user from the complexity of the underlying algorithms and design AIDA objectscomparison algorithm Only deal with AIDA objects and choice of comparison algorithm

16 Maria Grazia Pia, INFN Genova GoF algorithms Algorithms for binned distributions – Anderson-Darling test – Chi-squared test – Fisz-Cramer-von Mises test – Tiku test (Cramer-von Mises test in chi-squared approximation) Algorithms for unbinned distributions – Anderson-Darling test – Fisz-Cramer-von Mises test – Goodman test (Kolmogorov-Smirnov test in chi-squared approximation) – Kolmogorov-Smirnov test – Kuiper test – Tiku test (Cramer-von Mises test in chi-squared approximation)

17 Maria Grazia Pia, INFN Genova Chi-squared test Applies to binned distributions It can be useful also in case of unbinned distributions, but the data must be grouped into classes Cannot be applied if the counting of the theoretical frequencies in each class is < 5 –When this is not the case, one could try to unify contiguous classes until the minimum theoretical frequency is reached

18 Maria Grazia Pia, INFN Genova EMPIRICAL DISTRIBUTION FUNCTION ORIGINAL DISTRIBUTIONS Kolmogorov-Smirnov test Goodman approximation of KS test Kuiper test D mn Unbinned distributions SUPREMUM STATISTICS More sophisticated algorithms

19 Maria Grazia Pia, INFN Genova Cramer-von Mises test Anderson-Darling test Fisz-Cramer-von Mises test k-sample Anderson-Darling test Unbinned distributions Binned distributions TESTS CONTAINING A WEIGHTING FUNCTION More powerful algorithms

20 Maria Grazia Pia, INFN Genova Anderson-DarlingHighSensitive to tails 22 LowGeneral Fisz-Cramer-von MisesHighSymmetric, right-skewed distributions GoodmanMedium Approximation of K-S to  2 test statistics Kolmogorov-SmirnovMediumDerives from Kolmogorov statistics KuiperMediumSensitive to tails and median TikuHighConverts CvM statistics to a chi2 TestPowerCharacteristics More about a comparative evaluation of tests in the User Documentation on our web Topic still subject to research activity in the domain of statistics Comparative documentation of tests

21 Maria Grazia Pia, INFN Genova  2 loses information in a test for unbinned distribution by grouping the data into cells  Kac, Kiefer and Wolfowitz (1955) showed that Kolmogorov- Smirnov test requires n 4/5 observations compared to n observations for  2 to attain the same power Cramer-von Mises and Anderson-Darling statistics are expected to be superior to Kolmogorov-Smirnov’s, since they make a comparison of the two distributions all along the range of x, rather than looking for a marked difference at one point 2222 2222 Supremum statistics tests Tests containing a weight function < < The power of a test is the probability of rejecting the null hypothesis correctly In terms of power: Power of tests

22 Maria Grazia Pia, INFN Genova

23 Unit test:  2 (1) EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)  2 test-statistics = 15.8 Expected  2 = 15.8 Exact p-value=0.200758 Expected p-value=0.200757 Months The study concerns monthly birth and death distributions (binned data)

24 Maria Grazia Pia, INFN Genova Unit test:  2 (2) EXAMPLE FROM CRAMER BOOK (MATHEMATICAL METHODS OF STATISTICS - page 447) The study concerns the sex distribution of children born in Sweden in 1935  2 test-statistics = 123.203 Expected  2 = 123.203 Exact p-value=0 Expected p-value=0

25 Maria Grazia Pia, INFN Genova Unit test: K-S Goodman (1) EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)  2 test-statistics = 3.9 Expected  2 = 3.9 Exact p-value=0.140974 Expected p-value=0.140991 Months The study concerns monthly birth and death distributions (unbinned data) Cumulative Function

26 Maria Grazia Pia, INFN Genova Unit test: K-S Goodman (2)  2 test-statistics = 1.5 Expected  2 = 1.5 EXAMPLE FROM LANDENNA BOOK (NONPARAMETRIC TESTS BASED ON FREQUENCIES - page 287) We consider body lengths of two independent groups of anopheles Exact p-value=0.472367 Expected p-value=0.472367 Body lengths

27 Maria Grazia Pia, INFN Genova Unit test: Kolmogorov-Smirnov(1) EXAMPLE FROM http://www.physics.csbsju.edu/stats/KS-test.html D test-statistics =0.2204 Expected D =0.2204 Exact p-value=0.0354675 Expected p-value=0.035 The study concerns how long a bee stays near a particular tree (Redwell/Whitney) Cumulative

28 Maria Grazia Pia, INFN Genova Unit test: Kolmogorov-Smirnov (2) EXAMPLE FROM LANDENNA BOOK (NONPARAMETRIC STATISTICAL METHODS - page 318-325) We consider one clinical parameter of two independent groups of patients D test-statistics = 0.65 Expected D = 0.65 Exact p-value=2 10 -19 Expected p-value=8 10 -19 Cumulative

29 Maria Grazia Pia, INFN Genova Example of application results Anderson-Darling A c (95%) =0.752 Fluorescence spectrum from Icelandic basalt (Mars-like rock): experimental data and simulation ESA Bepi Colombo mission to Mercury test beam at Bessy Photon attenuation coefficient, Al Geant4 Standard Geant4 LowE NIST  2 N-L =13.1 – =20 p=0.87  2 N-S =23.2 – =15 p=0.08 Electromagnetic models in Geant4 w.r.t. NIST reference

30 Maria Grazia Pia, INFN Genova Latest release: 30 March 2004 GPL License

31 Maria Grazia Pia, INFN Genova User Documentation Download Installation User Guide Statistics Reference Guide

32 Maria Grazia Pia, INFN Genova A toolkit for modeling multi-parametric fit problems F. Fabozzi, L. Lista INFN Napoli Initially developed while rewriting a fortran fitter for BaBar analysis – Simultaneous estimate of: B(B   J/   ) / B(B   J/  K  ) direct CP asymmetry – More control on the code was needed to justify a bias appeared in the original fitter

33 Maria Grazia Pia, INFN Genova Requirements Provide Tools for modeling parametric fit problems Unbinned Maximum Likelihood (UML [*] ) fit of: –PDF parameters –Yields of different sub-samples –Both, mixed  2 fits Toy Monte Carlo to study the fit properties –Fitted parameter distributions Pulls, Bias, Confidence level of fit results [*] not Unified Modeling Language … … New components included in the Statistical Toolkit Architecture open to extension and evolution

34 Maria Grazia Pia, INFN Genova For LCG users The Statistical Toolkit is distributed with PI as an external product –Currently the previous release - not the latest yet - is distributed –Update foreseen Integration in the Savannah system for problem reporting foreseen Open to collaboration to facilitate the usage in the LGC community –feedback, user requirements, suggestions are welcome, of course! Please contact Andreas.Pfeiffer@cern.ch for further information about the Statistical Toolkit in PI distributionAndreas.Pfeiffer@cern.ch

35 Maria Grazia Pia, INFN Genova References Conference Proceedings: –PhyStat Conference, SLAC, 2003 –IEEE Nuclear Science Symposium, Portland, 2003 Papers: –S. Donadio et al., A toolkit for statistical data comparison To be published in IEEE Trans. Nucl. Sci. (August 2004) More papers in preparation References kept up-to-date on the web site

36 Maria Grazia Pia, INFN Genova http://www.ge.infn.it/geant4/analysis/HEPstatistics/ Will be moved to a new area out of Geant4-INFN web (automatic re-direction)

37 Maria Grazia Pia, INFN Genova Acknowledgments Work supported and partially funded by the European Space Agency (ESA) under Contract No.16339/02/NL/FM Geant4 beta testing –P. Cirrone (INFN-LNS), S. Guatelli (INFN Genova), S. Parlati (INFN-LNGS) Fred James (CERN) and Louis Lyons (Oxford) –many useful suggestions, discussions, encouragement...

38 Maria Grazia Pia, INFN Genova Conclusions A project to develop an open source, general purpose software toolkit for statistical data analysis is in progress –to provide a product of common interest to user communities Rigorous software process –to contribute to the quality of the product Component-based architecture, OO methods + generic programming –to ensure openness to evolution, maintainability, ease of use GoF component Component for modeling multi-parametric fit problems Software released and results available –toolkit in use for Geant4 physics validation –incremental and iterative life-cycle


Download ppt "Maria Grazia Pia, INFN Genova A Toolkit for Statistical Data Analysis M.G. Pia S. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B. Mascialino, A. Pfeiffer,"

Similar presentations


Ads by Google