Application of Hadoop to Proteomic Searches Steven Lewis 1, Attila Csordas 2, Sarah Killcoyne 1, Henning Hermjakob 2, John Boyle 1 1 Institute for Systems.

Slides:



Advertisements
Similar presentations
MapReduce Simplified Data Processing on Large Clusters
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Restore : Reusing results of mapreduce jobs Jun Fan.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Spaghetti: Visualization of Observed Peptides in Tandem Mass Spectrometry Steven Lewis 1, Terry Farrah 1, Eric W Deutsch 1, John Boyle 1 1 Institute for.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Map reduce Cs 595 Lecture 11.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
Software Systems Development
An Open Source Project Commonly Used for Processing Big Data Sets
Ministry of Higher Education
Cloud Distributed Computing Environment Hadoop
MapReduce: Data Distribution for Reduce
Hadoop Basics.
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Pig Hive HBase Zookeeper
Presentation transcript:

Application of Hadoop to Proteomic Searches Steven Lewis 1, Attila Csordas 2, Sarah Killcoyne 1, Henning Hermjakob 2, John Boyle 1 1 Institute for Systems Biology, Seattle, Washington, USA. 2 PRIDE Group Proteomics Services Team EMBL European Bioinformatics Institute. Introduction Shotgun Proteomics involves large search problems comparing many spectra with possible peptides. As researchers apply modifications and consider alternate cleavages, the search space grows by a few orders of magnitude. Modern searches strain the resources of a single machine. We have an implementation which uses the Hadoop version of Google's Map-Reduce algorithm to search Proteomics databases. This project is supported by Award Number R01GM from NIGMS and R01CA from NCI Performance Running on a 10 node, 80 CPU cluster, the Hadoop job (15,000 proteins, 12 million semitryptic peptides) took 26 minutes scoring 400 million times compared to the same job running on X!Tandem over 24 hours on a 4 CPU single processor machine. We are in the process of testing against multiprocessor and alternate Hadoop streaming implementations of X!Tandem. Funding Summary The advantages of using the Hadoop framework are: the infrastructure is widely used and well tested, mechanisms for dealing with failures and retries are built into the framework, and resources may be expanded by simply expanding the size of the cluster. Two other advantages of the specific algorithms are the ability to handle multiple algorithms in a single run and the use of databases and other caching. In addition the performance scales with the size of the cluster allowing scaling as the complexity of the fit and the size of the data scales. Map-Reduce is an algorithm developed by Google for processing large data sets on massive clusters. The algorithm has been implemented by apache. Data processing proceeds in two steps: Map and Reduce. In the map step a series of values are read and sent out as key/value pairs. After all maps complete processing, the keys are sorted and sent to a reduce step. In many cases including this one multiple Map-Reduce jobs are chained. SQL Database Population Map-Reduce Algorithm Mapper Contract There is no guarantee of the order in which keys will be received or which map process will handle them.. Reducer Contract All items tagged with a specific key will be sent to a single reducer in a single step. All keys sent to a specific reducer will be received in a known sort order. Hadoop infrastructure Contract Tasks will be distributed to processors in a ‘fair’ fashion. Tasks which fail or run slowly will be restarted on another machine Failed tasks will be retried before failing the entire job. Hardware failures will be handled. Fasta files are converted into tables in a SQL database. Peptides possibly with modifications are stored with the MZ ratio as an integer as a key. Separate tables hold tryptic, semitryptic and modified peptides. Normally databases need to be generated infrequently since they can be reused in a pipeline. Multiple Scoring Algorithms Most of the infrastructure brings peptides and spectra together in the scoring reducer. The scoring algorithm is a small and interchangeable portion of that code. The architecture allows multiple algorithms – say Sequest, K_Score and XTandem to be run in this step and combined in the output. Architecture Future Directions A static database is good for standard peptides. When isotopic labeling, posttranslational modifications and unconstrained searches are added, the maintenance of this database becomes expensive. We are working on adding another Hadoop job to generate theoretical spectra or allow scoring against a SpectracaST style database of measure spectra. Dynamically Generate In Silico Spectra