Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Mining Classification: Alternative Techniques
CMPUT 466/551 Principal Source: CMU
Exploratory Data Mining and Data Preparation
Slides for “Data Mining” by I. H. Witten and E. Frank
Sparse vs. Ensemble Approaches to Supervised Learning
Feature Selection for Regression Problems
Evaluation.
Ensemble Learning: An Introduction
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
Three kinds of learning
Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.
Issues with Data Mining
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
© 2002 IBM Corporation IBM Research 1 Policy Transformation Techniques in Policy- based System Management Mandis Beigi, Seraphin Calo and Dinesh Verma.
Data Mining and Decision Support
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.
Decision tree and random forest
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Machine Learning with Spark MLlib
Data Mining Practical Machine Learning Tools and Techniques
Boosting and Additive Trees (2)
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H.
Data preprocessing and transformation
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Machine Learning in Practice Lecture 23
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning.
Machine Learning in Practice Lecture 22
Intro to Machine Learning
Ensemble learning Reminder - Bagging of Trees Random Forest
Chapter 7: Transformations
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1

Feature seleciton 2 Machine Learning and Bioinformatics

Related issues Feature selection –scheme­ independent/­specific Feature discretization Feature transformations –Principal Component Analysis (PCA), text and time series Dirty data (data cleaning and outlier detection) Meta­-learning –bagging (with costs), randomization, boosting, … Using unlabeled data Clustering for classification, co-­training and EM Engineering the input and output Machine Learning and Bioinformatics 3

Just apply a learner? Please DON’T As scheme/parameter selection –treat selection process as part of the learning process Modifying the input –data engineering to make learning possible or easier Modifying the output –combining models to improve performance Machine Learning and Bioinformatics 4

Feature selection Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performance –attribute selection based on smaller and smaller amounts of data Instance-based learning is very susceptible to irrelevant attributes –number of training instances required increases exponentially with number of irrelevant attributes Relevant attributes can also be harmful Machine Learning and Bioinformatics 5 why

Machine Learning and Bioinformatics 6 “What’s the difference between theory and practice?” an old question asks. “There is no difference,” the answer goes, “—in theory. But in practice, there is.”

Scheme-­independent selection Assess based on general characteristics (relevance) of the feature Find smallest subset of features that separates data Use different learning scheme –e.g. use attributes selected by a decision tree for KNN KNN can also select features –weight features according to “near hits” and “near misses” Machine Learning and Bioinformatics 7

Redundant (but relevant) features Machine Learning and Bioinformatics 8

Feature subsets for weather data Machine Learning and Bioinformatics 9

Searching feature space Number of feature subsets is –exponential in number of features Common greedy approaches –forward selection –backward elimination More sophisticated strategies –bidirectional search –best-­first search  can find optimum solution –beam search  approximation to best­-first search –genetic algorithms Machine Learning and Bioinformatics 10

Scheme-­specific selection Wrapper approach to attribute selection –implement “wrapper” around learning scheme –evaluate by cross-­validation performance Time consuming –prior ranking of attributes Can use significance test to stop branches if it is unlikely to “win” (race search) –can be used with forward, backward selection, prior ranking or special­ purpose schemata search Machine Learning and Bioinformatics 11

Feature selection Machine Learning and Bioinformatics 12 itself is a research topic in machine learning

Random forest 13 Machine Learning and Bioinformatics

Random forest Breiman (1996,1999) Classification and regression Algorithm Bootstrap aggregation of classification trees Attempt to reduce bias of single tree Cross-validation to assess misclassification rates –out-of-bag (OOB) error rate Permutation to determine feature importance Assumes all trees are independent draws from an identical distribution, minimizing loss function at each node in a given tree– randomly drawing data for each tree and features for each node Machine Learning and Bioinformatics 14

Random forest The algorithm Bootstrap sample of data Using 2/3 of the sample, fit a tree to its greatest depth determining the split at each node through minimizing the loss function considering a random sample of covariates (size is user specified) For each tree –predict classification of the leftover 1/3 using the tree, and calculate the misclassification rate  OOB error rate –for each feature in the tree, permute the feature values and compute the OOB error, compare to the original OOB error, the increase is a indication of the feature’s importance Aggregate OOB error and importance measures from all trees to determine overall OOB error rate and feature Importance measure Machine Learning and Bioinformatics 15

Today’s exercise Machine Learning & Bioinformatics 16

Feature selection Machine Learning & Bioinformatics 17 Uses feature selection tricks to refine your feature program. Upload and test them in our simulation system. Finally, commit your best version and send TA Jang a report before 23:59 11/19 (Mon).simulation system TA Jang