RooFit A tool kit for data modeling in ROOT

Slides:



Advertisements
Similar presentations
Statistical Methods for Data Analysis Modeling PDF’s with RooFit
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
S.Towers TerraFerMA TerraFerMA A Suite of Multivariate Analysis tools Sherry Towers SUNY-SB Version 1.0 has been released! useable by anyone with access.
Programming Paradigms and languages
Structured Design. 2 Design Quality – Simplicity “There are two ways of constructing a software design: One is to make it so simple that there are obviously.
Wouter Verkerke, NIKHEF RooFit A tool kit for data modeling in ROOT (W. Verkerke, D. Kirkby) RooStats A tool kit for statistical analysis (K. Cranmer,
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
FIN 685: Risk Management Topic 5: Simulation Larry Schrenk, Instructor.
1/15 Sensitivity to  with B  D(KK  )K Decays CP Working Group Meeting - Thursday, 19 th April 2007 Introduction B  DK  Dalitz Analysis Summary.
Slide 1 Statistics for HEP Roger Barlow Manchester University Lecture 3: Estimation.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Open and save files directly from Word, Excel, and PowerPoint No more flash drives or sending yourself documents via Stop manually merging versions.
Statistical aspects of Higgs analyses W. Verkerke (NIKHEF)
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 11 Some materials adapted from Prof. Keith E. Gubbins:
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Statistics software for the LHC
MCTS Guide to Microsoft Windows 7
Statistical Methods for Data Analysis Parameter estimates with RooFit Luca Lista INFN Napoli.
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
RooFit/RooStats Tutorial CAT Meeting, June 2009
RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
RooFit A tool kit for data modeling in ROOT
1 The Software Development Process  Systems analysis  Systems design  Implementation  Testing  Documentation  Evaluation  Maintenance.
Analysis of E852 Data at Indiana University Ryan Mitchell PWA Workshop Pittsburgh, PA February 2006.
Wouter Verkerke, UCSB Limits on the Lifetime Difference  of Neutral B Mesons and CP, T, and CPT Violation in B 0 B 0 mixing Wouter Verkerke (UC Santa.
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Chapter 2.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
Wouter Verkerke, NIKHEF RooFit A tool kit for data modeling in ROOT Wouter Verkerke (NIKHEF) David Kirkby (UC Irvine)
Nick Draper 05/11/2008 Mantid Manipulation and Analysis Toolkit for ISIS data.
RooFit – tools for ML fits Authors: Wouter Verkerke and David Kirkby.
SE: CHAPTER 7 Writing The Program
Outline 3  PWA overview Computational challenges in Partial Wave Analysis Comparison of new and old PWA software design - performance issues Maciej Swat.
Chapter 8 Object Design Reuse and Patterns. Object Design Object design is the process of adding details to the requirements analysis and making implementation.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
RooUnfold unfolding framework and algorithms Tim Adye Rutherford Appleton Laboratory ATLAS RAL Physics Meeting 20 th May 2008.
RooUnfold unfolding framework and algorithms Tim Adye Rutherford Appleton Laboratory Oxford ATLAS Group Meeting 13 th May 2008.
Statistical Methods for Data Analysis Introduction to the course Luca Lista INFN Napoli.
The Software Development Process
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Mantid Stakeholder Review Nick Draper 01/11/2007.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Wouter Verkerke, NIKHEF RooFit – status & plans Wouter Verkerke (NIKHEF)
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #24.
Statistical feature extraction, calibration and numerical debugging Marian Ivanov.
Banaras Hindu University. A Course on Software Reuse by Design Patterns and Frameworks.
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
8th December 2004Tim Adye1 Proposal for a general-purpose unfolding framework in ROOT Tim Adye Rutherford Appleton Laboratory BaBar Statistics Working.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
A different cc/nc oscillation analysis Peter Litchfield  The Idea:  Translate near detector events to the far detector event-by-event, incorporating.
Wouter Verkerke, UCSB RooFitTools A general purpose tool kit for data modeling, developed in BaBar Wouter Verkerke (UC Santa Barbara) David Kirkby (Stanford.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
1 G4UIRoot Isidro González ALICE ROOT /10/2002.
Conditional Observables Joe Tuggle BaBar ROOT/RooFit Workshop December 2007.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
RooFit Tutorial – Topical Lectures June 2007
Hands-on exercises *. Getting started – ROOT 5.25/02 setup Start a ROOT 5.25/02 session –On your local laptop installation, or –On lxplus (SLC4) or lx64slc5.
Software - RooFit/RooStats W. Verkerke Wouter Verkerke, NIKHEF What is it Where is it used Experience, Lessons and Issues.
Wouter Verkerke, UCSB Data Analysis Exercises - Day 2 Wouter Verkerke (NIKHEF)
Analysis Tools interface - configuration Wouter Verkerke Wouter Verkerke, NIKHEF 1.
RooFit A tool kit for data modeling in ROOT
(Day 3).
OPERATING SYSTEMS CS 3502 Fall 2017
RooFit A general purpose tool kit for data modeling
Statistical Methods for Data Analysis Modeling PDF’s with RooFit
Presentation transcript:

RooFit A tool kit for data modeling in ROOT Wouter Verkerke (NIKHEF) David Kirkby (UC Irvine) Wouter Verkerke, NIKHEF

Focus: coding a probability density function Focus on one practical aspect of many data analysis in HEP: How do you formulate your p.d.f. in ROOT For ‘simple’ problems (gauss, polynomial), ROOT built-in models well sufficient But if you want to do unbinned ML fits, use non-trivial functions, or work with multidimensional functions you are quickly running into trouble Wouter Verkerke, NIKHEF

A recent example BaBar experiment at SLAC: Extract sin(2b) from time dependent CP violation of B decay: e+e-  Y(4s)  BB Reconstruct both Bs, measure decay time difference Physics of interest is in decay time dependent oscillation Many issues arise Standard ROOT function framework clearly insufficient to handle such complicated functions  must develop new framework Normalization of p.d.f. not always trivial to calculate  may need numeric integration techniques Unbinned fit, >2 dimensions, many events  computation performance important  must try optimize code for acceptable performance Simultaneous fit to control samples to account for detector performance Wouter Verkerke, NIKHEF

A recent example Initial approach BaBar: write it from scratch (in FORTRAN!) Does its designated job quite well, but took a long time to develop Possible because sin(2b) effort supported by O(50) people. Optimization of ML calculations hand-coded  error prone and not easy to maintain Difficult to transfer knowledge/code from one analysis to another. A better solution: A modeling language in C++ that integrates seamlessly into ROOT Recycle code and knowledge Development of RooFit package Started 5 years ago. Very successful  virtually everybody in BaBar uses it  now in standard ROOT distribution Wouter Verkerke, NIKHEF

RooFit – a data modeling language for ROOT Extension to ROOT – (Almost) no overlap with existing functionality C++ command line interface & macros Data management & histogramming Graphics interface I/O support MINUIT ToyMC data Generation Data/Model Fitting Data Modeling Model Visualization Wouter Verkerke, NIKHEF

Data modeling - Desired functionality Building/Adjusting Models Easy to write basic PDFs ( normalization) Easy to compose complex models (modular design) Reuse of existing functions Flexibility – No arbitrary implementation-related restrictions A n a l y s i s w o r k c y c l e Using Models Fitting : Binned/Unbinned (extended) MLL fits, Chi2 fits Toy MC generation: Generate MC datasets from any model Visualization: Slice/project model & data in any possible way Speed – Should be as fast or faster than hand-coded model Wouter Verkerke, NIKHEF

Data modeling – OO representation Idea: represent math symbols as C++ objects Result: 1 line of code per symbol in a function (the C++ constructor) rather than 1 line of code per function Mathematical concept RooFit class variable RooRealVar function RooAbsReal PDF RooAbsPdf space point RooArgSet integral RooRealIntegral list of space points RooAbsData Wouter Verkerke, NIKHEF

Data modeling – Constructing composite objects Straightforward correlation between mathematical representation of formula and RooFit code Math  RooGaussian g RooFit diagram  RooRealVar x RooRealVar m RooFormulaVar sqrts    RooRealVar s RooFit code  RooRealVar x(“x”,”x”,-10,10) ;  RooRealVar m(“m”,”mean”,0) ;  RooRealVar s(“s”,”sigma”,2,0,10) ;  RooFormulaVar sqrts(“sqrts”,”sqrt(s)”,s) ;  RooGaussian g(“g”,”gauss”,x,m,sqrts) ; Wouter Verkerke, NIKHEF

Model building – (Re)using standard components RooFit provides a collection of compiled standard PDF classes RooBMixDecay Physics inspired ARGUS,Crystal Ball, Breit-Wigner, Voigtian, B/D-Decay,…. RooPolynomial RooHistPdf Non-parametric Histogram, KEYS RooArgusBG RooGaussian Basic Gaussian, Exponential, Polynomial,… Chebychev polynomial Easy to extend the library: each p.d.f. is a separate C++ class Wouter Verkerke, NIKHEF

Importance of a good and extendable model library If ‘new’ p.d.f.s are made available in a easy-to-use form, more people will use them Example 1: non-parametric kernel estimation technique See article by Kyle Cranmer People like it, but are put off by the prospect of thoroughly reading the entire article, coding the concept from scratch, and then using it. In RooFit switching from a histogram-based shape to a KEYS shape is a one-line code change (RooHistPdf  RooKeysPdf) Example 2: Chebychev polynomials Its easy to convince people to try them if all they need to do is a one-line code change (RooPolynomial  RooChebychev) Wouter Verkerke, NIKHEF

Model building – (Re)using standard components Library p.d.f.s can be adjusted on the fly. Just plug in any function expression you like as input variable Works universally, even for classes you write yourself Maximum flexibility of library shapes keeps library small g(x;m,s) m(y;a0,a1) g(x,y;a0,a1,s) RooPolyVar m(“m”,y,RooArgList(a0,a1)) ; RooGaussian g(“g”,”gauss”,x,m,s) ; Wouter Verkerke, NIKHEF

Handling of p.d.f normalization Normalization of (component) p.d.f.s to unity is often a good part of the work of writing a p.d.f. RooFit handles most normalization issues transparently to the user P.d.f can advertise (multiple) analytical expressions for integrals When no analytical expression is provided, RooFit will automatically perform numeric integration to obtain normalization More complicated that it seems: even if gauss(x,m,s) can be integrated analytically over x, gauss(f(x),m,s) cannot. Such use cases are automatically recognized. Multi-dimensional integrals can be combination of numeric and p.d.f-provided analytical partial integrals Variety of numeric integration techniques is interfaced Adaptive trapezoid, Gauss-Kronrod, VEGAS MC… User can override parameters globally or per p.d.f. as necessary Wouter Verkerke, NIKHEF

Model building – (Re)using standard components Most physics models can be composed from ‘basic’ shapes RooBMixDecay RooPolynomial RooHistPdf RooArgusBG RooGaussian + RooAddPdf Wouter Verkerke, NIKHEF

Model building – (Re)using standard components Most physics models can be composed from ‘basic’ shapes RooBMixDecay RooProdPdf h(“h”,”h”, RooArgSet(f,g)) RooPolynomial RooHistPdf RooArgusBG RooProdPdf k(“k”,”k”,g, Conditional(f,x)) RooGaussian * RooProdPdf Wouter Verkerke, NIKHEF

Using models - Overview All RooFit models provide universal and complete fitting and Toy Monte Carlo generating functionality Model complexity only limited by available memory and CPU power Fitting/plotting a 5-D model as easy as using a 1-D model Most operations are one-liners Fitting Generating data = gauss.generate(x,1000) RooAbsPdf gauss.fitTo(data) RooDataSet RooAbsData Wouter Verkerke, NIKHEF

Fitting functionality pdf.fitTo(data) Binned or unbinned ML fit? For RooFit this is the essentially the same! Nice example of C++ abstraction through inheritance Fitting interface takes abstract RooAbsData object Supply unbinned data object (created from TTree)  unbinned fit Supply histogram object (created from THx)  binned fit Binned Unbinned x y z 1 3 5 2 4 6 RooDataSet RooDataHist RooAbsData Wouter Verkerke, NIKHEF

Automated vs. hand-coded optimization of p.d.f. Automatic pre-fit PDF optimization Prior to each fit, the PDF is analyzed for possible optimizations Optimization algorithms: Detection and precalculation of constant terms in any PDF expression Function caching and lazy evaluation Factorization of multi-dimensional problems where ever possible Optimizations are always tailored to the specific use in each fit. Possible because OO structure of p.d.f. allows automated analysis of structure No need for users to hard-code optimizations Keeps your code understandable, maintainable and flexible without sacrificing performance Optimization concepts implemented by RooFit are applied consistently and completely to all PDFs Speedup of factor 3-10 reported in realistic complex fits Fit parallelization on multi-CPU hosts Option for automatic parallelization of fit function on multi-CPU hosts (no explicit or implicit support from user PDFs needed) Wouter Verkerke, NIKHEF

MC Event generation pdf.generate(vars,nevt) Sample “toy Monte Carlo” samples from any PDF Accept/reject sampling method used by default MC generation method automatically optimized PDF can advertise internal generator if more efficient methods exists (e.g. Gaussian) Each generator request will use the most efficient accept-reject / internal generator combination available Operator PDFs (sum,product,…) distribute generation over components whenever possible Generation order for products with conditional PDFs is sorted out automatically Possible because OO structure of p.d.f. allows automated analysis of structure Can also specify template data as input to generator Look more carefully at accept/reject side bar pdf.generate(vars,data) Wouter Verkerke, NIKHEF

Using models – Plotting Model visualization geared towards ‘publication plots’ not interactive browsing  emphasis on 1-dimensional plots Simplest case: plotting a 1-D model over data Modular structure of composite p.d.f.s allows easy access to components for plotting Can show Poisson confidence intervals instead of sqrt(N) errors RooPlot* frame = mes.frame() ; data->plotOn(frame) ; pdf->plotOn(frame) ; pdf->plotOn(frame,Components(“bkg”)) frame->Draw() ; Adaptive spacing of curve points to achieve 1‰ precision Can store plot with data and all curves as single object Wouter Verkerke, NIKHEF

Visualizing multi-dimensional models IMHO: Visualization of multi-dim. models much more complicated than visualization of multi-dim. data Simplest scenario: projection plots Simple example M(m,t) = f[S(m)*T(t)]+(1-f)[B(m)*U(t)] Provide intuitive behavior: projection of p.d.f. automatically follows projection of data: 1-D plot frame remembers ‘hidden’ variables in dataset. Subsequent p.d.f.s in same frame are automatically projected Works identically for non-factorizable p.d.f.s! Necessary integrals are calculated automatically either analytical or numerically mB distribution Dt distribution RooPlot* frame = dt.frame() ; data.plotOn(frame) ; model.plotOn(frame) ; model.plotOn(frame,Components(“bkg”)) ;  Wouter Verkerke, NIKHEF

Using models – plotting multi-dimensional models Projection plot usually doesn’t capture full information content of your p.d.f.  equally easy to plot a ‘slice’ mB distribution Dt distribution RooPlot* frame = dt.frame() ; data.plotOn(frame) ; model.plotOn(frame) ; model.plotOn(frame,Components(“bkg”)) ;  RooPlot* frame = dt.frame() ; dt.setRange(“sel”,5.27,5.30) ; data.plotOn(frame,CutRange(“sel”)) ; model.plotOn(frame,ProjectionRange(“sel”)); model.plotOn(frame,ProjectionRange(“sel”), Components(“bkg”)) ; More complicated ‘slice’ views such as a likelihood ratio plot only take O(3) lines of extra code Wouter Verkerke, NIKHEF

Using models – plotting Can also use template data to show ‘average’ curve Convenient for visualization of conditional p.d.f.s Example plot t distribution of following model over data, averaging over dt values of data model.plotOn(frame,ProjWData(dtData)) ; model.plotOn(frame,Components("bkg"), ProjWData(dtData),LineStyle(kDashed)) ; Wouter Verkerke, NIKHEF

Advanced features – Task automation Support for routine task automation, e.g. useful in finding confidence intervals Accumulate fit statistics Input model Generate toy MC Fit model Distribution of - parameter values - parameter errors - parameter pulls Repeat N times // Instantiate MC study manager RooMCStudy mgr(inputModel) ; // Generate and fit 100 samples of 1000 events mgr.generateAndFit(100,1000) ; // Plot distribution of sigma parameter mgr.plotParam(sigma)->Draw() Wouter Verkerke, NIKHEF

RooFit at SourceForge - roofit.sourceforge.net RooFit moved to SourceForge to facilitate access and communication with non-BaBar users Code access CVS repository via pserver File distribution sets for production versions Wouter Verkerke, NIKHEF

RooFit at SourceForge - Documentation Five separate tutorials More than 250 slides and 20 macros in total Documentation Comprehensive set of tutorials (PPT slide show + example macros) Also available by end of 2005: Pedagogical User Guide + Reference Guide (~100 pages document) Class reference in THtml style Wouter Verkerke, NIKHEF

Selection of BaBar publications in 2004 using RooFit Improved Measurement of the CKM angle alpha using B0r+r- Measurement of branching fractions and charge asymmetries in B+ decays to hp+, hK+, hr+ and h'p+, and search for B0 decays to hK0 and hw Branching Fraction and CP Asymmetries in B0KSKSKS Measurement of CP Asymmetries in B0fK0 and B0K+K-K0S Measurement of Branching Fractions and Time-Dependent CP-Violating Asymmetries in Bh' K Decays Improved Measurement of Time-Dependent CP Violation in B0 to (ccbar)K0 Decays (‘sin2b’) Measurements of the Branching Fraction and CP-Violating Asymmetries in B0f0(980)KS Decays Measurement of Time-dependent CP-Violating Asymmetries in B0K*g, K*KSp0 Decays Study of the decay B0r+r- and constraints on the CKM angle alpha. Measurement of CP-violating Asymmetries in B0K0Sp0 Decays Measurement of Time-Dependent CP Asymmetries in B0fK0 Search for B+/-  [K-/+ p+/-]D K+/- and upper limit on the bu amplitude in B+/-  DK+/- Limits on the Decay-Rate Difference of Neutral B Mesons and on CP, T, and CPT Violation in B0B0bar Oscillations Wouter Verkerke, NIKHEF

Summary RooFit adds a powerful data modeling language to ROOT Mathematical objects (variables, functions…) represented as C++ objects Complex models are easily composed from library of standard components New fundamental components can be written easily Powerful tools for fitting, Toy MC generation and visualization are easy to use Compact syntax Universal functionality (almost no arbitrary / implementation related restrictions) Automated function optimization analysis and implementation delivers industrial strength performance RooFit makes the road from a ‘simple’ to a really elaborate/complex p.d.f. less steep: RooFit is distributed with ROOT starting with ROOT version v5.02-00 Wouter Verkerke, NIKHEF