Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Random Forest Predrag Radenković 3237/10
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Sparse vs. Ensemble Approaches to Supervised Learning
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Ensemble Learning: An Introduction
Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sparse vs. Ensemble Approaches to Supervised Learning
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ensemble Learning (2), Tree and Forest
Machine Learning CS 165B Spring 2012
Face Detection using the Viola-Jones Method
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
Hands-on predictive models and machine learning for software Foutse Khomh, Queen’s University Segla Kpodjedo, École Polytechnique de Montreal PASED - Canadian.
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
BOF Trees Visualization  Zagreb, June 12, 2004 BOF Trees Visualization  Zagreb, June 12, 2004 “BOF” Trees Diagram as a Visual Way to Improve Interpretability.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CLASSIFICATION: Ensemble Methods
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Measurement of the Atmospheric Muon Neutrino Energy Spectrum with IceCube in the 79- and 86-String Configuration Tim Ruhe, Mathis Börner, Florian Scheriau,
Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Konstantina Christakopoulou Liang Zeng Group G21
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
Validation methods.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Kaggle Winner Presentation Template. Agenda 1.Background 2.Summary 3.Feature selection & engineering 4.Training methods 5.Important findings 6.Simple.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Experience Report: System Log Analysis for Anomaly Detection
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Introduction to Machine Learning
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Boosting and Additive Trees (2)
Deep learning possibilities for GERDA
Roberto Battiti, Mauro Brunato
ISTEP 2016 Final Project— Project on
Point Sources Jacob Feintzeig WIPAC − May 21, 2014
Unfolding atmospheric neutrino spectrum with IC9 data (second update)
Hidden Markov Models Part 2: Algorithms
Roberto Battiti, Mauro Brunato
Introduction to Data Mining, 2nd Edition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensemble learning.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Ensemble learning Reminder - Bagging of Trees Random Forest
Lecture 10 – Introduction to Weka
Recitation 10 Oznur Tastan
Chapter 7: Transformations
Feature Selection Methods
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund

2 Outline  Data mining is more...  Why is IceCube interesting (from a machine learning point of view)  Data preprocessing and dimensionality reduction  Training and validation of a learning algorithm  Results  Other Detector configuration?  Summary & Outlook

3 Data Mining is more... Model Beis Examples (annotated) Historical data, simulations New data (not annotated) Learning Algorithm Application I I Information, knowledge Nobel prize(s)

4 Data Mining is more... Model Beis Examples (annotated) Historical data, simulations New data (not annotated) Learning Algorithm Application I I Information, knowledge Nobel prize(s) Preprocessing Garbage in/ Garbage out

5 Data Mining is more... Model Beis Examples (annotated) Historical data, simulations New data (not annotated) Learning Algorithm Application I I Information, knowledge Nobel prize(s) Preprocessing Garbage in/ Garbage out Validation

6 Why is IceCube interesting from a machine learning point of view?  Huge amount of data  Highly imbalanced distribution of event classes (signal and background)  Huge amount of data to be processed by the learner (Big Data)  Real life problem

7 Preprocessing (1): Reducing the Data Volume Through Cuts Background Rejection: 91.4% Signal Efficiency: 57.1% BUT: Remaining Background is significantly harder to reject!

8 Preprocessing (2): Variable Selection Tim Ruhe | Statistische Methoden der Datenanalyse Check for missing values. Check for potential bias. Check for correlations. Exclude if number of missing values exceed a 30%. Exclude everything that is useless, redundant or a source of potential bias. Exclude everything that has a correlation of 1.0. Automated Feature Selection 2600 variables 477 variables

9 Relevance vs. Redundancy: MRMR (continuous case) Relevance: Redundancy: MRMR: or

10 Feature Selection Stability Jaccard: Average over many sets of variables:

11 Comparing Forward Selection and MRMR

12 Training and Validation of a Random Forest  use an ensemble of simple decision trees  Obtain final classification as an average over all trees

13 Training and Validation of a Random Forest  use an ensemble of simple decision trees  Obtain final classification as an average over all trees 5-fold cross validation to validate the performance of the forest.

14 Random Forest and Cross Validation in Detail (1) Background Muons 750,000 in total CORSIKA, Polygonato Neutrinos 70,000 in total NuGen, E -2 Spectrum 600,000 available for training 56,000 available for training 27,000 Sampling

15 Random Forest and Cross Validation in Detail (2) 150,000 available for testing 14,000 available for testing 27,000 Train Apply Repeat (x5) 500 Trees

16 Random Forest Output

17 Random Forest Output We need an additional cut on the output of the Random Forest!

18 Random Forest Output: Cut at 500 trees We need an additional cut on the output of the Random Forest!  ± 480 expected neutrino candidates  ± 480 expected background muons 27,771 neutrino candidates  Background Rejection: %  Signal Efficiency 18.2%  Estimated Purity: (99.59±0.37)% Apply to experimental data This yields

19 Unfolding the spectrum TRUEE This is no Data Mining......but it ain‘t magic either

20 Moving on... IC79  212 neutrino candidates per day  neutrino candidates in total  330±200 background muons  Entire analysis chain can be applied on other detector configurations ...with minor changes (e.g. ice model)

21 Summary and Outlook % Background Rejection Purities above 99% are routinely achieved Future Improvements??? By starting at an earlier analysis level... MRMR Random Forest

22 Backup Slides

23 RapidMiner in a Nutshell  Developed at the Department of Computer Science at TU Dortmund(YALE)  Operator based, written in Java  It used to be open source   Many, many plugins due to a rather active community  One of the most widely used data mining tools

24 What I like about it  Data flow is nicely visualized and can be easily followed and comprehended  Rather easy to learn, even without programming experience  Large Community (Updates, Bugfixes, Plugins)  Professional Tool (They actually make money with that!)  Good support  Many tutorials can be found online, even special one  Most operators work like a charm  Extendable

25 Relevance vs. Redundancy: MRMR (discrete case) Relevance: Redundancy: MRMR: or Mutual Information

26 Feature Selection Stability Jaccard: Kuncheva:

27 Ensemblemethoden Tim Ruhe | Statistische Methoden der Datenanalyse Ensemble methods With Weight (e.g. Boosting) Without Weight (e.g. Random Forest)

28 Random Forest: What is randomized? Randomness 1: Events the tree is trained on (bagging) Randomness 2: Variables that are available for a split

29 Are we actually better, than simpler methods?