Presentation is loading. Please wait.

Presentation is loading. Please wait.

Roundup on Speed Dennis Wall, PhD Parul Kudtarkar Kris St. Gabriel Todd DeLuca.

Similar presentations


Presentation on theme: "Roundup on Speed Dennis Wall, PhD Parul Kudtarkar Kris St. Gabriel Todd DeLuca."— Presentation transcript:

1 Roundup on Speed Dennis Wall, PhD Parul Kudtarkar Kris St. Gabriel Todd DeLuca

2 Introduction Roundup is a one of the largest repositories of orthologs covering ~250 genomes Built using the Reciprocal Smallest Distance Algorithm

3 Estimating relative evolutionary rates from sequence comparisons: 1. Identification of probable orthologs ABCDE S. cerevisiaeC. elegans species tree gene tree Admissible comparisons: A or B vs. D C vs. E Inadmissible comparisons: A or B vs. E C vs. D

4 Estimating relative evolutionary rates from orthologs: 1.orthologs found using the Reciprocal smallest distance algorithm 2.Build alignment between two orthologs ABCDE S. cerevisiaeC. elegans species tree gene tree Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ Thr µπ µπ µπ µπ 3. Estimate distance given a substitution matrix >Sequence C MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >Sequence E MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…

5 Reciprocal blast is incomplete when searching for orthologs The highest blast hit is often not the nearest phylogenetic neighbor (Koski and Golding, 2001) Reciprocal blast is more likely to fail in either forward or reverse direction, forcing a rejection of the pair. 1824 BBH orthologs, 2777 rsd orthologs

6 mgapmtnpargactt... > orf6.7505.prot MVLTIYPDELVQIVSDKIASNKGKITLNQLWDISGKYFDLSDKK KQFVLSCVILKKDIEVYCDGAITTKNVTDIIGDANHSYSVGITED SLWTLLTGYTKKESTIGNSAFELLLEVAKSGEKGINTMDLAQV TGQDPRSVTGRIKKINHLLTSSQLIYKGHVVKQLKLKKFSHDG VDSNGRIKKINHL 1 2 3 >ORFP:YKR080W MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… >orf6.7505.prot MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… distance obtained by maximum likelihood Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ Thr µπ µπ µπ µπ Reciprocal smallest distance algorithm (forward) Jones model of amino acid substitution (Jones et al. 1992) continuous gamma with alpha = 1.530 used. ORFP:YKR080W orf6.1984.prot 0.7786 ORFP:YKR080W orf6.2111.prot 2.7786 Smallest distance orf6.1984.prot 0.7786 HSPs

7 mgapmtnpargactt... > ORFP:YKR080W MVLTIYPDELVQIVSDKIASNKGKITLNQLWDISGKYFDLSDKK KQFVLSCVILKKDIEVYCDGAITTKNVTDIIGDANHSYSVGITED SLWTLLTGYTKKESTIGNSAFELLLEVAKSGEKGINTMDLAQV TGQDPRSVTGRIKKINHLLTSSQLIYKGHVVKQLKLKKFSHDG VDSNGRIKKINHL 1 2 3 >orf6.7505.prot MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >ORFP:YKR080W MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… distance obtained by maximum likelihood Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ Thr µπ µπ µπ µπ Jones model of amino acid substitution (Jones et al. 1992) continuous gamma with alpha = 1.530 used. orf6.1984.prot ORFP:YKR080W 1.26 orf6.2111.prot ORFP:YKR080W 2.7786 Smallest distance ORFP:YKR080W 1.26 Found RSD Compare with original query sequence Reciprocal smallest distance algorithm (reverse) HSPs

8 RSD algorithm summary

9 RSD vs BBH

10 RSD results Stored in MySQLAd hoc exploratory queries possible 1)Transitive closure 2)Any orthos between genome A and [b,c,d…] 3)All orthologs among genomes [a,b,c,d…] A v C A v B g i g j 0.3 A v D A.B.C. (1)(2)(3) (g1) 0000011000000000000011111110000 (g2) 0000011000000000000011111110000 (g3) 0000011000000000000011111110001 (g4) 0000011000000000000011111110001

11 Roundup usages discovery of functional linkages and uncharacterized cellular pathways deciphering network organization of the cell propensity for gene loss

12 Use of Logic Relationships to Decipher Protein Network Organization Peter M. Bowers, Shawn J. Cokus, David Eisenberg, and Todd O. Yeates Science 306 2246.

13 Use of Logic Relationships to Decipher Protein Network Organization Peter M. Bowers, Shawn J. Cokus, David Eisenberg, and Todd O. Yeates Science 306 2246.

14 0000011000000000000011111110000 0000011000000000000011111110001 Subgraph 1 (S1) Kin = [0,0,1,1,0,0] Subgraph 2 Subgraph 3...... Subgraph N(=885) 000111100… 000111000… 111001100… 000001100… Kout = [1,2,3,4,0,0,3,4,5,1,1,4] P=0.05 P=0.01 P=0.2 S1 S2 S3 S4... SN S1 *.05.03 S2 * S3 * S4 *. SN T test(kin, kout) MATRIX of P Values (885X885) EVERY S has 885-1 p values.

15 Number of Significant p values (Accuracy) Functional Modules Phylogenetic profiles predict function

16 Z1Z2Z3Z4 Accuracy Functional modules Profile size

17 Propensity for gene loss 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 D-PGLDist 4 0.939 1 1.103 6 2.778 5 0.977 Phylogenetic profiles drawn from Roundup (Deluca et al. 2006) Phylogeny built using maximum parsimony Profiles mapped via Dollo parsimony

18 Roundup Computational Demand Number of rsd processes scales quadratically with the number of genomes. Could take ~2700 years to complete on a single machine N 2 = (N)(N-1)/2 * 12 parameters combos 1000 genomes = 5,994,000 processes = 23,976,000 hours

19 Hadoop may help Apache Hadoop Core is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.MapReduce Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.

20 MapReduce: High Level

21 Getting Data To The Mapper

22 Partition And Shuffle

23 Writing The Output

24 HDFS -- Limitations No file update options (record append, etc); all files must be written only once. Does not implement demand replication Designed for streaming –Random seeks devastate performance

25 Vanilla Hadoop Assumptions –Data are the data to be processed –Data can be arbitrarily subdivided across the cluster –Mapper and reducer are Java Roundup cannot accommodate any of these assumptions without serious pain

26 Hadoop Streaming is an answer Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. For example: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /bin/cat - reducer /bin/wc Mapper and the Reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a map/reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

27 Hadoop Streaming We cannot think of stanard Roundup input (i.e. genomes) as input to Hadoop But we can “trick” hadoop into receiving and processing Roundup job commands as inputs Undocumented, uncharted use of Hadoop streaming

28 MapReduce RSD Code and Hadoop Streaming Mapper Script: Input: command lines (runstuff)forgenomes to be compared Function: runs RSD algorithm Output: gene pairs and their evolutionary distances Reducer Script: Bypassed (set 0 in the configuration) Streaming : Hadoop programs are typically written in Java Roundup Code base is in python Options: 1.Translate python code using Jython into jar files 2.Use Hadoop streaming to pass data to the mapper function via STDIN and STDOUT

29 HADOOP JOB DISTRIBUTION AWSMedium High CPUInstance Specs: 1.2.5 EC2 compute unit per core* 2.2virtual Core 3.1.7 GB memory 4.32- bit platform 5.1690 GB instance storage 6.$0.20 per hour Cost estimate: Cost per hour × Number of instances× Number of hours to run the job For this particular test: $0.20 × 6 medium instances × ~11 mins =$1.2 [Used 10 nodes for 15 comparisons across 6 genomes] *One EC2 Compute Unit = 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor

30 AWS access keys and other parameters excluded from config. file as they are private and specific to group HADOOP Configuration Setup to run RSD jobs on AWS

31 Mapper import sys For cmd in sys.stdin: os.system(cmd) Streaming Input file (on HDFS) “rsdrunner” Line 1: python /user/local/hadoop/RSD.py –thresh=0.8 –div=0.2 -s /user/local/hadoop/genome/genome1.aa -q /user/local/hadoop/genome/genome1.aa -output /user/local/hadoop/wank … Limitations of HDFS and Streaming forced us to print RSD output to stdout… but it worked…

32 Hadoop cluster launch on AWS To start the cluster Initializing the cluster

33 Roundup job run on AWS Copy input file to HDFS Run RSD jobs powered by HadoopStreming utility $bin/hadoop jar contrib/streaming/hadoop-0.18.3-streaming.jar -mapper /usr/local/hadoop-0.18.3/newmapper.py -reducer NONE -input rsdrunner -output RsdResult -jobconfmapred.map.tasks=10

34 Roundup job run on AWS Output of previous command in console

35 Monitoring Roundup Jobs JobTracker and cluster monitoring

36 Monitoring Roundup Jobs TaskTracker Monitoring

37 Monitoring Roundup Jobs MapReduce job summary

38 Monitoring Roundup Jobs Running Task = No. of Map task

39 Monitoring Roundup Jobs Time Split for job completion

40 Montioring Roundup Jobs Summary Post Job Completion

41 Shutting down the Hadoop cluster on AWS Terminate the cluster after job completion and copying data locally Terminating the cluster

42 Future Considerations Careful design of Roundup AWS jobs to minimize time and maximize instance resources. Could be prohibitively expensive –Medium HPC instance $0.20/hour –Medium HPC = 5 EC2 compute units (5 jobs per instance) –Assume 1 RSD run = 4 hours = $0.16 –All of Roundup would cost ~$1M, but would only take ~100 days to complete if using 10000 compute units

43 HADOOP Usage Examples IBM Blue cloud computing cluster at IBM uses Hadoop parallel work load scheduling Yahoo! 100,000 CPU’s running Hadoop: 2000 nodes currently used by Ad systems and Web Search(webgrep) The New York Times Converted ~11 million articles (scanned images) into PDF format using EC2 for computation and S3 for datastorage Apache Mahout Uses Hadoop to built scalable machine learning algorithms like canopy clustering and k-means Facebook Use Hadoop to store internal logs and dimension data sources which is used for reporting analytics and machine Able Grape Worlds smallest hadoop cluster: 2 node@ 8 CPUs/node, vertical search engine for wine knowledge base http://wiki.apache.org/hadoop/PoweredBy

44 Acknowledgements Parul Kudtarkar Kris St. Gabriel Todd DeLuca Rimma Pivovarov Vince Fusaro Prasad Patil Peter Kos Peter Tonellato Mike Banos All palaver participants


Download ppt "Roundup on Speed Dennis Wall, PhD Parul Kudtarkar Kris St. Gabriel Todd DeLuca."

Similar presentations


Ads by Google