Presentation is loading. Please wait.

Presentation is loading. Please wait.

Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

Similar presentations


Presentation on theme: "Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund."— Presentation transcript:

1 Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund

2 2 Outline  Data mining is more...  Why is IceCube interesting (from a machine learning point of view)  Data preprocessing and dimensionality reduction  Training and validation of a learning algorithm  Results  Other Detector configuration?  Summary & Outlook

3 3 Data Mining is more... Model Beis Examples (annotated) Historical data, simulations New data (not annotated) Learning Algorithm Application I I Information, knowledge Nobel prize(s)

4 4 Data Mining is more... Model Beis Examples (annotated) Historical data, simulations New data (not annotated) Learning Algorithm Application I I Information, knowledge Nobel prize(s) Preprocessing Garbage in/ Garbage out

5 5 Data Mining is more... Model Beis Examples (annotated) Historical data, simulations New data (not annotated) Learning Algorithm Application I I Information, knowledge Nobel prize(s) Preprocessing Garbage in/ Garbage out Validation

6 6 Why is IceCube interesting from a machine learning point of view?  Huge amount of data  Highly imbalanced distribution of event classes (signal and background)  Huge amount of data to be processed by the learner (Big Data)  Real life problem

7 7 Preprocessing (1): Reducing the Data Volume Through Cuts Background Rejection: 91.4% Signal Efficiency: 57.1% BUT: Remaining Background is significantly harder to reject!

8 8 Preprocessing (2): Variable Selection Tim Ruhe | Statistische Methoden der Datenanalyse Check for missing values. Check for potential bias. Check for correlations. Exclude if number of missing values exceed a 30%. Exclude everything that is useless, redundant or a source of potential bias. Exclude everything that has a correlation of 1.0. Automated Feature Selection 2600 variables 477 variables

9 9 Relevance vs. Redundancy: MRMR (continuous case) Relevance: Redundancy: MRMR: or

10 10 Feature Selection Stability Jaccard: Average over many sets of variables:

11 11 Comparing Forward Selection and MRMR

12 12 Training and Validation of a Random Forest  use an ensemble of simple decision trees  Obtain final classification as an average over all trees

13 13 Training and Validation of a Random Forest  use an ensemble of simple decision trees  Obtain final classification as an average over all trees 5-fold cross validation to validate the performance of the forest.

14 14 Random Forest and Cross Validation in Detail (1) Background Muons 750,000 in total CORSIKA, Polygonato Neutrinos 70,000 in total NuGen, E -2 Spectrum 600,000 available for training 56,000 available for training 27,000 Sampling

15 15 Random Forest and Cross Validation in Detail (2) 150,000 available for testing 14,000 available for testing 27,000 Train Apply Repeat (x5) 500 Trees

16 16 Random Forest Output

17 17 Random Forest Output We need an additional cut on the output of the Random Forest!

18 18 Random Forest Output: Cut at 500 trees We need an additional cut on the output of the Random Forest!  28830 ± 480 expected neutrino candidates  28830 ± 480 expected background muons 27,771 neutrino candidates  Background Rejection: 99.9999%  Signal Efficiency 18.2%  Estimated Purity: (99.59±0.37)% Apply to experimental data This yields

19 19 Unfolding the spectrum TRUEE This is no Data Mining......but it ain‘t magic either

20 20 Moving on... IC79  212 neutrino candidates per day  66885 neutrino candidates in total  330±200 background muons  Entire analysis chain can be applied on other detector configurations ...with minor changes (e.g. ice model)

21 21 Summary and Outlook 99.9999% Background Rejection Purities above 99% are routinely achieved Future Improvements??? By starting at an earlier analysis level... MRMR Random Forest

22 22 Backup Slides

23 23 RapidMiner in a Nutshell  Developed at the Department of Computer Science at TU Dortmund(YALE)  Operator based, written in Java  It used to be open source   Many, many plugins due to a rather active community  One of the most widely used data mining tools

24 24 What I like about it  Data flow is nicely visualized and can be easily followed and comprehended  Rather easy to learn, even without programming experience  Large Community (Updates, Bugfixes, Plugins)  Professional Tool (They actually make money with that!)  Good support  Many tutorials can be found online, even special one  Most operators work like a charm  Extendable

25 25 Relevance vs. Redundancy: MRMR (discrete case) Relevance: Redundancy: MRMR: or Mutual Information

26 26 Feature Selection Stability Jaccard: Kuncheva:

27 27 Ensemblemethoden Tim Ruhe | Statistische Methoden der Datenanalyse Ensemble methods With Weight (e.g. Boosting) Without Weight (e.g. Random Forest)

28 28 Random Forest: What is randomized? Randomness 1: Events the tree is trained on (bagging) Randomness 2: Variables that are available for a split

29 29 Are we actually better, than simpler methods?


Download ppt "Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund."

Similar presentations


Ads by Google