RooFit A tool kit for data modeling in ROOT

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Statistical Methods for Data Analysis Modeling PDF’s with RooFit
Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.
S.Towers TerraFerMA TerraFerMA A Suite of Multivariate Analysis tools Sherry Towers SUNY-SB Version 1.0 has been released! useable by anyone with access.
Wouter Verkerke, NIKHEF RooFit A tool kit for data modeling in ROOT (W. Verkerke, D. Kirkby) RooStats A tool kit for statistical analysis (K. Cranmer,
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Copyright 2000, Stephan Kelley1 Estimating User Interface Effort Using A Formal Method By Stephan Kelley 16 November 2000.
MATLAB Presented By: Nathalie Tacconi Presented By: Nathalie Tacconi Originally Prepared By: Sheridan Saint-Michel Originally Prepared By: Sheridan Saint-Michel.
DA/SAT Training Course, March 2006 Variational Quality Control Erik Andersson Room: 302 Extension: 2627
Introduction To System Analysis and Design
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
RooFit A tool kit for data modeling in ROOT
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
Multiple Data Structuring I. N. Skopin Possibilities of working with data possessing several structures are discussed. Such work is shown to require development.
Statistical analysis tools for the Higgs discovery and beyond
C++ fundamentals.
Statistical aspects of Higgs analyses W. Verkerke (NIKHEF)
Introduction 01_intro.ppt
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Statistics software for the LHC
Statistical Methods for Data Analysis Parameter estimates with RooFit Luca Lista INFN Napoli.
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
RooFit/RooStats Tutorial CAT Meeting, June 2009
RooFit – Open issues W. Verkerke. Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted]
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
Introduction To System Analysis and Design
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Charles Leggett A Lightweight Histogram Interface Layer CHEP 2000 Session F (F320) Thursday.
Wouter Verkerke, NIKHEF RooFit A tool kit for data modeling in ROOT Wouter Verkerke (NIKHEF) David Kirkby (UC Irvine)
Nick Draper 05/11/2008 Mantid Manipulation and Analysis Toolkit for ISIS data.
RooFit – tools for ML fits Authors: Wouter Verkerke and David Kirkby.
SE: CHAPTER 7 Writing The Program
RooUnfold unfolding framework and algorithms Tim Adye Rutherford Appleton Laboratory ATLAS RAL Physics Meeting 20 th May 2008.
RooUnfold unfolding framework and algorithms Tim Adye Rutherford Appleton Laboratory Oxford ATLAS Group Meeting 13 th May 2008.
Statistical Methods for Data Analysis Introduction to the course Luca Lista INFN Napoli.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Written by Changhyun, SON Chapter 5. Introduction to Design Optimization - 1 PART II Design Optimization.
Introduction to RooFit W. Verkerke (NIKHEF) 1.Introduction and overview 2.Creation and basic use of models 3.Composing models 4.Working with (profile)
Wouter Verkerke, NIKHEF RooFit – status & plans Wouter Verkerke (NIKHEF)
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #24.
Banaras Hindu University. A Course on Software Reuse by Design Patterns and Frameworks.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Wouter Verkerke, UCSB RooFitTools A general purpose tool kit for data modeling, developed in BaBar Wouter Verkerke (UC Santa Barbara) David Kirkby (Stanford.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Conditional Observables Joe Tuggle BaBar ROOT/RooFit Workshop December 2007.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Getting started – ROOT setup Start a ROOT 5.34/17 or higher session Load the roofit libraries If you see a message that RooFit v3.60 is loaded you are.
RooFit Tutorial – Topical Lectures June 2007
Hands-on exercises *. Getting started – ROOT 5.25/02 setup Start a ROOT 5.25/02 session –On your local laptop installation, or –On lxplus (SLC4) or lx64slc5.
Introduction to RooFit
Software - RooFit/RooStats W. Verkerke Wouter Verkerke, NIKHEF What is it Where is it used Experience, Lessons and Issues.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Wouter Verkerke, UCSB Data Analysis Exercises - Day 2 Wouter Verkerke (NIKHEF)
Analysis Tools interface - configuration Wouter Verkerke Wouter Verkerke, NIKHEF 1.
Max Baak (CERN) 1 Summary of experiences with HistFactory and RooStats Max Baak (CERN) (on behalf of list of people) RooFit / RooStats meeting January.
RooFit A tool kit for data modeling in ROOT
Hands-on Session RooStats and TMVA Exercises
(Day 3).
Chapter 7. Classification and Prediction
Distributed Shared Memory
CMS RooStats Higgs Combination Package
RooFit A general purpose tool kit for data modeling
Chapter 2: Operating-System Structures
Statistical Methods for Data Analysis Parameter estimates with RooFit
Statistical Methods for Data Analysis Modeling PDF’s with RooFit
Chapter 2: Operating-System Structures
Presentation transcript:

RooFit A tool kit for data modeling in ROOT Wouter Verkerke (NIKHEF) 1 Wouter Verkerke, NIKHEF

Introduction What is RooFit ? 1 2 Wouter Verkerke, NIKHEF

Focus: coding a probability density function Focus on one practical aspect of many data analysis in HEP: How do you formulate your p.d.f. in ROOT For ‘simple’ problems (gauss, polynomial), ROOT built-in models well sufficient But if you want to do unbinned ML fits, use non-trivial functions, or work with multidimensional functions you are quickly running into trouble 3 Wouter Verkerke, NIKHEF

Why RooFit was developed BaBar experiment at SLAC: Extract sin(2b) from time dependent CP violation of B decay: e+e-  Y(4s)  BB Reconstruct both Bs, measure decay time difference Physics of interest is in decay time dependent oscillation Many issues arise Standard ROOT function framework clearly insufficient to handle such complicated functions  must develop new framework Normalization of p.d.f. not always trivial to calculate  may need numeric integration techniques Unbinned fit, >2 dimensions, many events  computation performance important  must try optimize code for acceptable performance Simultaneous fit to control samples to account for detector performance 4 Wouter Verkerke, NIKHEF

RooFit – a data modeling language for ROOT Extension to ROOT – (Almost) no overlap with existing functionality C++ command line interface & macros Data management & histogramming Graphics interface I/O support MINUIT ToyMC data Generation Data/Model Fitting Data Modeling Model Visualization 5 Wouter Verkerke, NIKHEF

Project timeline 4 6 1999 : Project started First application: ‘sin2b’ measurement of BaBar (model with 5 observables, 37 floating parameters, simultaneous fit to multiple CP and control channels) 2000 : Complete overhaul of design based on experience with sin2b fit Very useful exercise: new design is still current design 2003 : Public release of RooFit with ROOT 2004 : Over 50 BaBar physics publications using RooFit 2007 : Integration of RooFit in ROOT CVS source 2008 : Upgrade in functionality as part of RooStats project Improved analytical and numeric integration handling, improved toy MC generation, addition of workspace 2009 : Now ~100K lines of code (For comparison RooStats proper is ~5000 lines of code) 2012 : Higgs discovery models formulated in RooFit using workspace concept lines of code 6 last modification before date

Data modeling - Desired functionality Building/Adjusting Models Easy to write basic PDFs ( normalization) Easy to compose complex models (modular design) Reuse of existing functions Flexibility – No arbitrary implementation-related restrictions A n a l y s i s w o r k c y c l e Using Models Fitting : Binned/Unbinned (extended) MLL fits, Chi2 fits Toy MC generation: Generate MC datasets from any model Visualization: Slice/project model & data in any possible way Speed – Should be as fast or faster than hand-coded model 7 Wouter Verkerke, NIKHEF

Data modeling – OO representation Idea: represent math symbols as C++ objects Result: 1 line of code per symbol in a function (the C++ constructor) rather than 1 line of code per function Mathematical concept RooFit class variable RooRealVar function RooAbsReal PDF RooAbsPdf space point RooArgSet integral RooRealIntegral list of space points RooAbsData 8 Wouter Verkerke, NIKHEF

Data modeling – Constructing composite objects Straightforward correlation between mathematical representation of formula and RooFit code Math  RooGaussian g RooFit diagram  RooRealVar x RooRealVar m RooFormulaVar sqrts    RooRealVar s RooFit code  RooRealVar x(“x”,”x”,-10,10) ;  RooRealVar m(“m”,”mean”,0) ;  RooRealVar s(“s”,”sigma”,2,0,10) ;  RooFormulaVar sqrts(“sqrts”,”sqrt(s)”,s) ;  RooGaussian g(“g”,”gauss”,x,m,sqrts) ; 9 Wouter Verkerke, NIKHEF

Model building – (Re)using standard components RooFit provides a collection of compiled standard PDF classes RooBMixDecay Physics inspired ARGUS,Crystal Ball, Breit-Wigner, Voigtian, B/D-Decay,…. RooPolynomial RooHistPdf Non-parametric Histogram, KEYS RooArgusBG RooGaussian Basic Gaussian, Exponential, Polynomial,… Chebychev polynomial Easy to extend the library: each p.d.f. is a separate C++ class 10 Wouter Verkerke, NIKHEF

Model building – (Re)using standard components Library p.d.f.s can be adjusted on the fly. Just plug in any function expression you like as input variable Works universally, even for classes you write yourself Maximum flexibility of library shapes keeps library small g(x;m,s) m(y;a0,a1) g(x,y;a0,a1,s) RooPolyVar m(“m”,y,RooArgList(a0,a1)) ; RooGaussian g(“g”,”gauss”,x,m,s) ; 11 Wouter Verkerke, NIKHEF

Special pdfs – Kernel estimation model 38 Special pdfs – Kernel estimation model Kernel estimation model Construct smooth pdf from unbinned data, using kernel estimation technique Example Also available for n-D data Adaptive Kernel: width of Gaussian depends on local event density Gaussian pdf for each event Summed pdf for all events Sample of events w.import(myData,Rename(“myData”)) ; w.factory(“KeysPdf::k(x,myData)”) ; 12

Special pdfs – Morphing interpolation 39 Special pdfs – Morphing interpolation Special operator pdfs can interpolate existing pdf shapes Ex: interpolation between Gaussian and Polynomial Two morphing algorithms available IntegralMorph (Alex Read algorithm). CPU intensive, but good with discontinuities MomentMorph (Max Baak). Fast, can handling multiple observables (and soon multiple interpolation parameters), but doesn’t work well for all pdfs w.factory(“Gaussian::g(x[-20,20],-10,2)”) ; w.factory(“Polynomial::p(x,{-0.03,-0.001})”) ; w.factory(“IntegralMorph::gp(g,p,x,alpha[0,1])”) ; a = 0.812 ± 0.008 Fit to data 13

Handling of p.d.f normalization Normalization of (component) p.d.f.s to unity is often a good part of the work of writing a p.d.f. RooFit handles most normalization issues transparently to the user P.d.f can advertise (multiple) analytical expressions for integrals When no analytical expression is provided, RooFit will automatically perform numeric integration to obtain normalization More complicated that it seems: even if gauss(x,m,s) can be integrated analytically over x, gauss(f(x),m,s) cannot. Such use cases are automatically recognized. Multi-dimensional integrals can be combination of numeric and p.d.f-provided analytical partial integrals Variety of numeric integration techniques is interfaced Adaptive trapezoid, Gauss-Kronrod, VEGAS MC… User can override parameters globally or per p.d.f. as necessary 14 Wouter Verkerke, NIKHEF

Model building – (Re)using standard components Most physics models can be composed from ‘basic’ shapes RooBMixDecay RooPolynomial RooHistPdf RooArgusBG RooGaussian + RooAddPdf 15 Wouter Verkerke, NIKHEF

Model building – (Re)using standard components Most physics models can be composed from ‘basic’ shapes RooBMixDecay RooProdPdf h(“h”,”h”, RooArgSet(f,g)) RooPolynomial RooHistPdf RooArgusBG RooProdPdf k(“k”,”k”,g, Conditional(f,x)) RooGaussian * RooProdPdf 16 Wouter Verkerke, NIKHEF

(FFT) Convolution – works for all models 30 (FFT) Convolution – works for all models Example FFT usually best Fast: unbinned ML fit to 10K events take ~5 seconds w.factory(“Landau::L(x[-10,30],5,1)”) : w.factory(“Gaussian::G(x,0,2)”) ; w::x.setBins(“cache”,10000) ; // FFT sampling density w.factory(“FCONV::LGf(x,L,G)”) ; // FFT convolution w.factory(“NCONV::LGb(x,L,G)”) ; // Numeric convolution (x) RooFFTConvPdf 17

Automated vs. hand-coded optimization of p.d.f. Automatic pre-fit PDF optimization Prior to each fit, the PDF is analyzed for possible optimizations Optimization algorithms: Detection and precalculation of constant terms in any PDF expression Function caching and lazy evaluation Factorization of multi-dimensional problems where ever possible Optimizations are always tailored to the specific use in each fit. Possible because OO structure of p.d.f. allows automated analysis of structure No need for users to hard-code optimizations Keeps your code understandable, maintainable and flexible without sacrificing performance Optimization concepts implemented by RooFit are applied consistently and completely to all PDFs Speedup of factor 3-10 reported in realistic complex fits Fit parallelization on multi-CPU hosts Option for automatic parallelization of fit function on multi-CPU hosts (no explicit or implicit support from user PDFs needed) 18 Wouter Verkerke, NIKHEF

Introduction Sharing models 2 19 Wouter Verkerke, NIKHEF

Sharing data and functions Sharing data is in HEP is relatively easy – ROOT TTrees, THx histograms almost universal standard Sharing functions (likelihood / probability density) much more difficult No standard protocol for defining p.d.f.s and likelihood functions Structurally functions are much more complicated than data Essentially all methods start with the basic probability density function or likelihood function L(x|qr,qs) Building a good model is the hard part! want to re-use it for multiple methods Language to common tools Common language for probability density functions and likelihood functions very desirable for easy exchange of information  RooFit 20 Wouter Verkerke, NIKHEF

RooFit core design philosophy - Workspace 6 RooFit core design philosophy - Workspace The workspace serves a container class for all objects created f(x,y,z) Math RooWorkspace RooAbsReal f RooFit diagram RooRealVar x RooRealVar y RooRealVar z RooFit code RooRealVar x(“x”,”x”,5) ; RooRealVar y(“y”,”y”,5) ; RooRealVar z(“z”,”z”,5) ; RooBogusFunction f(“f”,”f”,x,y,z) ; RooWorkspace w(“w”) ; w.import(f) ; 21

Using the workspace Workspace Creating a workspace A generic container class for all RooFit objects of your project Helps to organize analysis projects Creating a workspace Putting variables and function into a workspace When importing a function or pdf, all its components (variables) are automatically imported too RooWorkspace w(“w”) ; RooRealVar x(“x”,”x”,-10,10) ; RooRealVar mean(“mean”,”mean”,5) ; RooRealVar sigma(“sigma”,”sigma”,3) ; RooGaussian f(“f”,”f”,x,mean,sigma) ; // imports f,x,mean and sigma w.import(myFunction) ; 22

Using the workspace Looking into a workspace Getting variables and functions out of a workspace w.Print() ; variables --------- (mean,sigma,x) p.d.f.s ------- RooGaussian::f[ x=x mean=mean sigma=sigma ] = 0.249352 // Variety of accessors available RooPlot* frame = w.var(“x”)->frame() ; w.pdf(“f”)->plotOn(frame) ; 23

Persisting workspaces Workspaces can be trivially written to file And be read back into another ROOT session // Write workspace contents to file w.writeToFile(“myworkspace.root”) ; // Open ROOT file TFile* f = TFile::Open(“myworkspace.root”) ; // Retrieve workspace RooWorkspace* w = f->Get(“w”) ; 24 Wouter Verkerke, NIKHEF

Sharing models using RooFit workspaces Internal sharing of likelihood models between analysis groups has been common within Higgs effort Aided by RooFit workspace concept What you share is not only a description of model, but an actual implementation (a callable C++ function) Virtually zero overhead in getting things to work Works for any model // Setup [4 lines ] TFile* f = TFile::Open(“myfile.root”) ; RooWorkspace* w = f->Get(“mywspace”) ; RooAbsPdf* model = w->pdf(“mymodel”) ; RooAbsData* data = w->data(“obsData”) ; // Core business model->fitTo(*data) ; RooAbsReal* nll = model->createNLL(*data) ; 25

Scalability – an extreme example A Higgs model (23.000 functions, 1600 parameters) F(x,p) x p 26 Wouter Verkerke, NIKHEF

A workspace provides you with a model implementation All RooFit models provide universal and complete fitting and Toy Monte Carlo generating functionality Model complexity only limited by available memory and CPU power Fitting/plotting a 5-D model as easy as using a 1-D model Most operations are one-liners Fitting Generating data = gauss.generate(x,1000) RooAbsPdf gauss.fitTo(data) RooDataSet RooAbsData 27 Wouter Verkerke, NIKHEF

Probability density function  Likelihood 42 Probability density function  Likelihood Easy to create a likelihood from a model and a dataset Easy to manipulate with ROOT minimizers Can also insert likelihood function in a workspace // Create likelihood (calculation parallelized on 8 cores) RooAbsReal* nll = w::model.createNLL(data,NumCPU(8)) ; RooMinuit m(*nll) ; // Create MINUIT session m.migrad() ; // Call MIGRAD m.hesse() ; // Call HESSE m.minos(w::param) ; // Call MINOS for ‘param’ RooFitResult* r = m.save() ; // Save status (cov matrix etc) w.import(*nll) ; // for direct use by others 28

Working with profile likelihood 47 Working with profile likelihood Best L for given p A profile likelihood ratio can be represent by a regular RooFit function (albeit an expensive one to evaluate) Best L RooAbsReal* ll = model.createNLL(data,NumCPU(8)) ; RooAbsReal* pll = ll->createProfile(params) ; RooPlot* frame = w::frac.frame() ; nll->plotOn(frame,ShiftToZero()) ; pll->plotOn(frame,LineColor(kRed)) ; 29

(Not) sharing the (unbinned) data Potential discussion item when sharing workspaces is that you share not only the model, but also the (unbinned) data – which a collaboration for various reason may not want to make public No easy iron-clad solutions to this issue – likelihood must have access to the data One simple solution currently provided are ‘sealed’ likelihood functions in workspace  These refuse access to internal data. Not iron-clad since a good programmer with a debugger can still extract this But sealed likelihoods also offer opportunity to include ‘copyright’ message – printed whenever workspace with sealed likelihoods is loaded into memory nll->seal(“your copyright message goes here”) ; w->import(*nll) ; 30 Wouter Verkerke, NIKHEF

Interfacing RooFit functions to other code Binding exist to represent RooFit likelihood and probability density functions as ‘simple’ C++ functions // RooFit pdf object RooAbsPdf* pdf ; // Binding object RooFunctor lfunc(*pdf,observables,parameters) ; // Evaluate pdf through binding // takes variables as array of doubles double obs[n] ; double par[m] ; double val = lfunc.eval(obs,par) ; 31 Wouter Verkerke, NIKHEF

Summary RooFit is a object-oriented data modeling language for HEP, part of ROOT distribution Key concept is representing mathematical entities as C++ objects Extensively used since nearly 13 years, highly scalable with good performance Ability to persist these models into ‘workspaces’ in ROOT files allows to trivially share implementations of models You read and use parametric likelihoods from other scientists with almost zero effort Very effectively used in Higgs discovery effort Long-term retention ability of workspaces explicit goal ROOT schema evolution framework provides tools to guarantee backward compatibility for reading existing workspaces 32 Wouter Verkerke, NIKHEF