Wouter Verkerke, UCSB Data Analysis Wouter Verkerke (NIKHEF)

Slides:



Advertisements
Similar presentations
Welcome to PHYS 225a Lab Introduction, class rules, error analysis Julia Velkovska.
Advertisements

Statistics review of basic probability and statistics.
Sampling Distributions (§ )
Chapter 4 Discrete Random Variables and Probability Distributions
Maximum likelihood (ML) and likelihood ratio (LR) test
1 MF-852 Financial Econometrics Lecture 4 Probability Distributions and Intro. to Hypothesis Tests Roy J. Epstein Fall 2003.
Statistics: Intro Statistics or What’s normal about the normal curve, what’s standard about the standard deviation, and what’s co-relating in a correlation?
Statistics. Large Systems Macroscopic systems involve large numbers of particles.  Microscopic determinism  Macroscopic phenomena The basis is in mechanics.
Evaluating Hypotheses
Statistics.
Chapter 2 Simple Comparative Experiments
Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Maximum likelihood (ML)
Inferential Statistics
Lecture II-2: Probability Review
Modern Navigation Thomas Herring
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Review of Probability.
Statistical Analysis of Systematic Errors and Small Signals Reinhard Schwienhorst University of Minnesota 10/26/99.
Statistical aspects of Higgs analyses W. Verkerke (NIKHEF)
LINEAR REGRESSION Introduction Section 0 Lecture 1 Slide 1 Lecture 5 Slide 1 INTRODUCTION TO Modern Physics PHYX 2710 Fall 2004 Intermediate 3870 Fall.
Chapter 5 Sampling Distributions
© Copyright McGraw-Hill CHAPTER 6 The Normal Distribution.
Machine Learning Queens College Lecture 3: Probability and Statistics.
880.P20 Winter 2006 Richard Kass Propagation of Errors Suppose we measure the branching fraction BR(Higgs  +  - ) using the number of produced Higgs.
Random variables Petter Mostad Repetition Sample space, set theory, events, probability Conditional probability, Bayes theorem, independence,
Short Resume of Statistical Terms Fall 2013 By Yaohang Li, Ph.D.
Chapter 6: Probability Distributions
1 Probability and Statistics  What is probability?  What is statistics?
Statistical Decision Theory
Overview course in Statistics (usually given in 26h, but now in 2h)  introduction of basic concepts of probability  concepts of parameter estimation.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Multinomial Distribution
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
1 Chapter 18 Sampling Distribution Models. 2 Suppose we had a barrel of jelly beans … this barrel has 75% red jelly beans and 25% blue jelly beans.
Lab 3b: Distribution of the mean
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Estimation This is our introduction to the field of inferential statistics. We already know why we want to study samples instead of entire populations,
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
1 2 nd Pre-Lab Quiz 3 rd Pre-Lab Quiz 4 th Pre-Lab Quiz.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #25.
Inference: Probabilities and Distributions Feb , 2012.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Sampling and estimation Petter Mostad
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #24.
G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.
Introduction to Inference Sampling Distributions.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
In Bayesian theory, a test statistics can be defined by taking the ratio of the Bayes factors for the two hypotheses: The ratio measures the probability.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Chapter 6 Sampling and Sampling Distributions
Basic Counting StatisticsBasic Counting StatisticsBasic Counting StatisticsBasic Counting Statistics.
Chapter 7 Review.
Stat 31, Section 1, Last Time Sampling Distributions
Statistical Methods For Engineers
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Wouter Verkerke, UCSB Data Analysis Wouter Verkerke (NIKHEF)

Wouter Verkerke, NIKHEF HEP and data analysis — General introduction

Particle physics Bestudeert de natuur op afstanden < m atom nucleus Looking at the smallest constituents of matter  Building a consistent theory that describe matter and elementary forces m

Newton Maxwell Einstein Theory of Relativity Quantum Mechanics Bohr Zwaartekracht en Electromagnetisme “zelfde krachten … nieuwe modellen”

High Energy Physics Working model: ‘the Standard Model’ (a Quantum Field Theory) –Describes constituents of matter, 3 out of 4 fundamental forces Wouter Verkerke, NIKHEF

The standard model has many open issues Wouter Verkerke, NIKHEF

Temperature fluctuations in Cosmic Microwave Background Rotation CurvesGravitational Lensing What is dark matter? Link with Astrophysics

Particle Physic today – Large Machines Wouter Verkerke, NIKHEF

Detail of Large Hadron Collider Wouter Verkerke, NIKHEF

And large experiments underground Wouter Verkerke, NIKHEF

One of the 4 LHC experiments – ATLAS Wouter Verkerke, NIKHEF

View of ATLAS during construction Wouter Verkerke, NIKHEF

Collecting data at the LHC Wouter Verkerke, NIKHEF

Data reduction, processing and storage are big issues Wouter Verkerke, NIKHEF

Analyzing the data – The goal Wouter Verkerke, NIKHEF proton What we see in the detectorFundamental physics picture Extremely difficult (and not possible on an event-by-event basis anyway due to QM)

Analyzing the data – in practice Wouter Verkerke, NIKHEF Physics simulation detector simulation event reconstruction data analysis The LHC ATLAS detector ?

‘Easy stuff’ Wouter Verkerke, NIKHEF

‘Difficult stuff’ Wouter Verkerke, NIKHEF

Quantify what we observe and what we expect to see Methods and details are important – for certain physics we only expect a handful of events after years of data taking

Wouter Verkerke, NIKHEF Tools for data analysis in HEP Nearly all HEP data analysis happens in a single platform –ROOT (1995-now) –And before that PAW ( ) Large project with many developers, contributors, workshops

Wouter Verkerke, NIKHEF Choice of working environment R vs. ROOT ROOT has become de facto HEP standard analysis environment –Available and actively used for analyses in running experiments (Tevatron, B factories etc..) –ROOT is integrated LHC experimental software releases –Data format of LHC experiments is (indirectly) based on ROOT  Several experiments have/are working on summary data format directly usable in ROOT –Ability to handle very large amounts of data ROOT brings together a lot of the ingredients needed for (statistical) data analysis –C++ command line, publication quality graphics –Many standard mathematics, physics classes: Vectors, Matrices, Lorentz Vectors Physics constants… Line between ‘ROOT’ and ‘external’ software not very sharp –Lot of software developed elsewhere, distributed with ROOT (TMVA, RooFit) –Or thin interface layer provided to be able to work with external library (GSL, FFTW) –Still not quite as nice & automated as ‘R’ package concept

Wouter Verkerke, NIKHEF (Statistical) software repositories ROOT functions as moderated repository for statistical & data analysis tools –Examples TMVA, RooFit Several HEP repository initiatives, some contain statistical software –PhyStat.org (StatPatternRecognition, TMVA,LepStats4LHC) –HepForge (mostly physics MC generators), –FreeHep Excellent summary of non-HEP statistical repositories on Jim Linnemans statistical resources web page –From Phystat 2005 –

Roadmap for this course In this course I aim to follow the ‘HEP workflow’ to organize the discussion of statistics issues –My experience is that this most intuitive for HEP Phd students HEP Introduction & Basics –A short overview of HEP –Distributions, the Central Limit Theorem Event classification –Hypothesis testing –Machine learning Parameter estimation –Estimators: Maximum Likelihood and Chi-squared –Mathematical tools for model building –Practical issues arising with minimization Confidence intervals, limits, significance –Hypothesis testing (again), Bayes Theorem –Frequentist statistics, Likelihood-based intervals –Likelihood principle and conditioning Discovery, exclusion and nuisance parameters –Formulating discovery and exclusion –Systematic uncertainties as nuisance parameters –Treatment of nuisance parameters in statistical inference Wouter Verkerke, NIKHEF

Basic Statistics — Mean, Variance, Standard Deviation — Gaussian Standard Deviation — Covariance, correlations — Basic distributions – Binomial, Poisson, Gaussian — Central Limit Theorem — Error propagation

Wouter Verkerke, NIKHEF Describing your data – the Average Given a set of unbinned data (measurements) { x 1, x 2, …, x N } then the mean value of x is For binned data –where n i is bin count and x i is bin center –Unbinned average more accurate due to rounding xixi nini x

Wouter Verkerke, NIKHEF Describing your data – Spread Variance V(x) of x expresses how much x is liable to vary from its mean value x Standard deviation

Wouter Verkerke, NIKHEF Different definitions of the Standard Deviation Presumably our data was taken from a parent distributions which has mean  and S.F.  is the S.D. of the data sample x – mean of our sample  – mean of our parent dist  – S.D. of our parent dist – S.D. of our sample Beware Notational Confusion! x   Data Sample Parent Distribution (from which data sample was drawn)

Wouter Verkerke, NIKHEF Different definitions of the Standard Deviation Which definition of  you use,  data or  parent, is matter of preference, but be clear which one you mean! In addition, you can get an unbiased estimate of  parent from a given data sample using x  data  parent  Data Sample Parent Distribution (from which data sample was drawn)

Wouter Verkerke, NIKHEF More than one variable Given 2 variables x,y and a dataset consisting of pairs of numbers { (x 1,y 1 ), (x 2,y 2 ), … (x N,y N ) } Definition of x, y,  x,  y as usual In addition, any dependence between x,y described by the covariance The dimensionless correlation coefficient is defined as (has dimension D(x)D(y))

Visualization of correlation = 0= 0.1= 0.5 = -0.7 = -0.9= 0.99

Wouter Verkerke, NIKHEF Correlation & covariance in >2 variables Concept of covariance, correlation is easily extended to arbitrary number of variables so that takes the form of a n x n symmetric matrix This is called the covariance matrix, or error matrix Similarly the correlation matrix becomes

Linear vs non-linear correlations Correlation coeffients used here are (linear) Pearson product-moment correlation coefficient –Data can have more subtle (non-linear correlations) that expressed in these coefficient Wouter Verkerke, UCSB

Basic Distributions – The binomial distribution Simple experiment – Drawing marbles from a bowl –Bowl with marbles, fraction p are black, others are white –Draw N marbles from bowl, put marble back after each drawing –Distribution of R black marbles in drawn sample: Binomial distribution Probability of a specific outcome e.g. ‘BBBWBWW’ Number of equivalent permutations for that outcome p=0.5 N=4 R P(R;0.5,4)

Wouter Verkerke, UCSB Properties of the binomial distribution Mean: Variance: p=0.1, N=4p=0.5, N=4p=0.9, N=4 p=0.1, N=1000p=0.5, N=1000p=0.9, N=1000

HEP example – Efficiency measurement Example: trigger efficiency –Usually done on simulated data so that also untriggered events are available Wouter Verkerke, NIKHEF

Wouter Verkerke, UCSB Basic Distributions – the Poisson distribution Sometimes we don’t know the equivalent of the number of drawings –Example: Geiger counter –Sharp events occurring in a (time) continuum What distribution to we expect in measurement over fixed amount of time? –Divide time interval in n finite chunks, –Take binomial formula with p=/n and let n    Poisson distribution

Wouter Verkerke, UCSB Properties of the Poisson distribution =0.1=0.5=1 =2=5=10 =20=50=200

Wouter Verkerke, UCSB More properties of the Poisson distribution Mean, variance: Convolution of 2 Poisson distributions is also a Poisson distribution with ab = a + b

HEP example – counting experiment Any distribution plotted on data (particularly in case of low statistics Wouter Verkerke, NIKHEF

Wouter Verkerke, UCSB Basic Distributions – The Gaussian distribution Look at Poisson distribution in limit of large N Take log, substitute, r = + x, and use Take exp Familiar Gaussian distribution, (approximation reasonable for N>10) =1 =10 =200

Wouter Verkerke, UCSB Properties of the Gaussian distribution Mean and Variance Integrals of Gaussian 68.27% within 190%  1.645 95.43% within 295%  1.96 99.73% within 399%  2.58 99.9%  3.29

HEP example – high statistics counting expt High statistics distributions from data Wouter Verkerke, NIKHEF

Wouter Verkerke, UCSB Errors Doing an experiment  making measurements Measurements not perfect  imperfection quantified in resolution or error Common language to quote errors –Gaussian standard deviation = sqrt(V(x)) –68% probability that true values is within quoted errors [NB: 68% interpretation relies strictly on Gaussian sampling distribution, which is not always the case, more on this later] Errors are usually Gaussian if they quantify a result that is based on many independent measurements

Wouter Verkerke, UCSB The Gaussian as ‘Normal distribution’ Why are errors usually Gaussian? The Central Limit Theorem says –If you take the sum X of N independent measurements x i, each taken from a distribution of mean m i, a variance V i = i 2, the distribution for x (a) has expectation value (b) has variance (c ) becomes Gaussian as N   –Small print: tails converge very slowly in CLT, be careful in assuming Gaussian shape beyond 2

Wouter Verkerke, UCSB Demonstration of Central Limit Theorem 5000 numbers taken at random from a uniform distribution between [0,1]. –Mean = 1 / 2, Variance = 1 / 12 5000 numbers, each the sum of 2 random numbers, i.e. X = x 1 +x 2. –Triangular shape Same for 3 numbers, X = x 1 + x 2 + x 3 Same for 12 numbers, overlaid curve is exact Gaussian distribution N=1 N=2 N=3 N=12

Wouter Verkerke, UCSB Central Limit Theorem – repeated measurements Common case 1 : Repeated identical measurements i.e.  i =  i  for all i  Famous sqrt(N) law C.L.T

Wouter Verkerke, UCSB Central Limit Theorem – repeated measurements Common case 2 : Repeated measurements with identical means but different errors (i.e weighted measurements,  i = ) ‘Sum-of-weights’ formula for error on weighted measurements Weighted average

Wouter Verkerke, UCSB Error propagation – one variable Suppose we have How do you calculate V(f) from V(x)? More general: –But only valid if linear approximation is good in range of error  i.e.  f = |a| x

Wouter Verkerke, UCSB Error Propagation – Summing 2 variables Consider More general The correlation coefficient  [-1,+1] is 0 if x,y uncorrelated Familiar ‘add errors in quadrature’ only valid in absence of correlations, i.e. cov(x,y)=0 But only valid if linear approximation is good in range of error

Wouter Verkerke, UCSB Error propagation – multiplying, dividing 2 variables Now consider –Result similar for Other useful formulas (math omitted) Relative error on x,1/x is the same Error on log is just fractional error

Coffee break Wouter Verkerke, NIKHEF