Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

Slides:

Advertisements

Similar presentations

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab

Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

SVM—Support Vector Machines

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,

A Novel Knowledge Based Method to Predicting Transcription Factor Targets

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Discriminative and generative methods for bags of features

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

[Bejerano Fall10/11] 1 Thank you for the midterm feedback! Projects will be assigned shortly.

Reduced Support Vector Machine

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Hybrid Classiﬁers for Object Classiﬁcation with a Rich Background M. Osadchy, D. Keren, and B. Fadida-Specktor, ECCV 2012 Computer Vision and Video Analysis.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

The Chinese University of Hong Kong Learning Larger Margin Machine Locally and Globally Dept. of Computer Science and Engineering The Chinese University.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

Fast Training on Large Genomics Data using Distributed Support Vector Machines Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi,

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Genomic Data Clustering on FPGAs for Compression

Basic machine learning background with Python scikit-learn

Regulation of Gene Expression by Eukaryotes

Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee

Deep Learning in Bioinformatics

Presentation transcript:

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed Support Vector Machines Department of Computer Science, Electrical and Computer Engineering Purdue University

Motivation Amount of genomics data is increasing rapidly ML classifiers are slow in their training phase We want to speed up training – New datasets are being generated through genomics experiments at a fast rate – Diverse datasets need separate models to be trained Can we make use of large distributed clusters to speed up training? 2

Contributions 1.We show how to build a machine learning (ML) classifier (an SVM classifier) for a biologically important use case, i.e., prediction of DNA elements (called “enhancers”) that amplify gene expression and can be located at great distances from the target genes, making the prediction problem challenging. 2.We show that a serial SVM is not able to handle training of even a fraction of the experimentally available dataset. We then go on to apply a previously theoretically proposed technique called Cascade SVM to the problem and adapt it to create our own computationally efficient classifier called EP-SVM. 3.We do a detailed empirical characterization of the effect of number of cores, communication cost, and number of partitions on the runtime of training of Cascade SVM. 3

Background: The Epigenetic Code 4 Epigenetic mechanisms involved in regulation of gene expression. Cytosine residues within DNA can be methylated, and lysine and arginine residues of histone proteins can be modified.

Background: Histone Modifications 5 The interaction of DNA methylation, histone modification, nucleosome positioning, and other factors, such as small RNAs, contribute to an overall epigenome.

Enhancer Prediction Problem Predict the genome-wide locations of intergenic enhancers Based on pattern matching of proteins flanking the DNA, i.e., the patterns of the histone modifications Specifically, we look at locations where specific transcription factors bind to the DNA base pairs We look at histone modification patterns at those locations to predict if enhancers are active 6 Transcription factors are proteins that control which genes are turned on or off in the genome. They do so by binding to DNA and other proteins. Once bound to DNA, these proteins can promote or block the enzyme that controls the reading, or “transcription,” of genes, making genes more or less active. Epigenetic Regulation by TFs

Support Vector Machine (SVM) Popular binary classification method Finds maximum-margin hyperplane that separates two classes Linear SVM 7

SVM — Kernel Trick “Kernel trick” allows non-linear decision boundary – Kernel function maps input features to higher- dimensional space Time complexity for training: O(n 3 ) Running serial version on entire dataset (300 GB) will take 45.4  10 3 years! Kernel SVM 8

Cascade SVM (1) Proposed by Graf et al. in NIPS 2004 SVM learning involves finding support vectors (SVs) Training data split into disjoint subsets SVs created independently for each subset SVs from one layer are fed as input to next layer A hierarchical arrangement of layers of SV creation finally leads to a single integrated set of SVs 9

Cascade SVM (2) Example schematic with 8 partitions 10

SV Creation in Cascade SVM 11 A toy problem illustrating the filtering process 1.SVs are calculated independently (and concurrently) for partitions of the entire data (Figs (a) and (b)). 2.These SVs are merged at the next stage (Fig (c)). 3.Result is close to this if SVs had been computed in one go on the entire dataset (dashed curve in Fig. (c)). (a) (b) (c)

Cascade SVM (3) Multiple iterations are needed to ensure optimal solution – Empirically for us, one iteration is enough to produce a good model Last layer is the serial bottleneck – Training time of this step depends on number of support vectors, which is dependent on dataset High memory consumption of SVM is alleviated due to partitioning 12

Data Set 13 From ENCODE (ENCyclopedia Of DNA Elements) 24 histone modifications from ChIP-seq – Binned into 100bp intervals TFBS as positive samples, TSS as negative samples Disjoint training and test sets – 135 MB data size – 5.5k negative samples – 25k positive samples – Negative : Positive = 18 : 82

Experimental Testbed Cluster of 8 machines connected by 1 Gigabit Ethernet 2.4GHz Intel Xeon X3430 CPU with 4 cores 8GB memory 14

Results – Accuracy Precision-recall curve – At equal weight, precision = 94.6%, recall = 96.2% Conclusion: SVM is an appropriate classifier for the problem For DEEP and RFECS, the two latest approaches that use sophisticated statistical models, the highest recall is less than 93% and 88% 15

Results – Partitioning Training time vs. # partitions of Cascade SVM Setup: 2 machines x 1 core Conclusion: # partitions should be multiple times # cores Even 24X number of cores shows decreasing trend of time 16

Results – Distributed SVM (1) Training time vs. # cores (# partitions = # cores) Superlinear speedup at 2 cores, due to partitioning of cascade SVM Sublinear speedup beyond 2 cores due to serial bottleneck Number of support vectors = 28.2% of training data points 17

Results – Distributed SVM (2) Training time vs. # cores (# partitions = 96) Conclusion: More partitions = higher speedup Speedup of 4 even with a single core! 18

Conclusion We applied a distributed SVM algorithm, Cascade SVM, to a large genomics dataset SVM gives high accuracy and is thus an appropriate classifier for this domain (Recall = 0.96, Precision = 0.94) We achieved speedup of 8 with 32 cores – Limited by number of support vectors at final stage Number of partitions should be set larger than number of cores Speedup can be obtained even with a single core! Code at: 19

While we are here.. 20

Extra 21

Extra 2 22