Download presentation
Presentation is loading. Please wait.
1
Roundup on Speed Dennis Wall, PhD Parul Kudtarkar Kris St. Gabriel Todd DeLuca
2
Introduction Roundup is a one of the largest repositories of orthologs covering ~250 genomes Built using the Reciprocal Smallest Distance Algorithm
3
Estimating relative evolutionary rates from sequence comparisons: 1. Identification of probable orthologs ABCDE S. cerevisiaeC. elegans species tree gene tree Admissible comparisons: A or B vs. D C vs. E Inadmissible comparisons: A or B vs. E C vs. D
4
Estimating relative evolutionary rates from orthologs: 1.orthologs found using the Reciprocal smallest distance algorithm 2.Build alignment between two orthologs ABCDE S. cerevisiaeC. elegans species tree gene tree Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ Thr µπ µπ µπ µπ 3. Estimate distance given a substitution matrix >Sequence C MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >Sequence E MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…
5
Reciprocal blast is incomplete when searching for orthologs The highest blast hit is often not the nearest phylogenetic neighbor (Koski and Golding, 2001) Reciprocal blast is more likely to fail in either forward or reverse direction, forcing a rejection of the pair. 1824 BBH orthologs, 2777 rsd orthologs
6
mgapmtnpargactt... > orf6.7505.prot MVLTIYPDELVQIVSDKIASNKGKITLNQLWDISGKYFDLSDKK KQFVLSCVILKKDIEVYCDGAITTKNVTDIIGDANHSYSVGITED SLWTLLTGYTKKESTIGNSAFELLLEVAKSGEKGINTMDLAQV TGQDPRSVTGRIKKINHLLTSSQLIYKGHVVKQLKLKKFSHDG VDSNGRIKKINHL 1 2 3 >ORFP:YKR080W MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… >orf6.7505.prot MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… distance obtained by maximum likelihood Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ Thr µπ µπ µπ µπ Reciprocal smallest distance algorithm (forward) Jones model of amino acid substitution (Jones et al. 1992) continuous gamma with alpha = 1.530 used. ORFP:YKR080W orf6.1984.prot 0.7786 ORFP:YKR080W orf6.2111.prot 2.7786 Smallest distance orf6.1984.prot 0.7786 HSPs
7
mgapmtnpargactt... > ORFP:YKR080W MVLTIYPDELVQIVSDKIASNKGKITLNQLWDISGKYFDLSDKK KQFVLSCVILKKDIEVYCDGAITTKNVTDIIGDANHSYSVGITED SLWTLLTGYTKKESTIGNSAFELLLEVAKSGEKGINTMDLAQV TGQDPRSVTGRIKKINHLLTSSQLIYKGHVVKQLKLKKFSHDG VDSNGRIKKINHL 1 2 3 >orf6.7505.prot MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-… >ORFP:YKR080W MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL… distance obtained by maximum likelihood Phe Ala Pro Leu Thr Phe Ala µπ Pro µπ µπ µπ Leu µπ µπ µπ µπ Thr µπ µπ µπ µπ Jones model of amino acid substitution (Jones et al. 1992) continuous gamma with alpha = 1.530 used. orf6.1984.prot ORFP:YKR080W 1.26 orf6.2111.prot ORFP:YKR080W 2.7786 Smallest distance ORFP:YKR080W 1.26 Found RSD Compare with original query sequence Reciprocal smallest distance algorithm (reverse) HSPs
8
RSD algorithm summary
9
RSD vs BBH
10
RSD results Stored in MySQLAd hoc exploratory queries possible 1)Transitive closure 2)Any orthos between genome A and [b,c,d…] 3)All orthologs among genomes [a,b,c,d…] A v C A v B g i g j 0.3 A v D A.B.C. (1)(2)(3) (g1) 0000011000000000000011111110000 (g2) 0000011000000000000011111110000 (g3) 0000011000000000000011111110001 (g4) 0000011000000000000011111110001
11
Roundup usages discovery of functional linkages and uncharacterized cellular pathways deciphering network organization of the cell propensity for gene loss
12
Use of Logic Relationships to Decipher Protein Network Organization Peter M. Bowers, Shawn J. Cokus, David Eisenberg, and Todd O. Yeates Science 306 2246.
13
Use of Logic Relationships to Decipher Protein Network Organization Peter M. Bowers, Shawn J. Cokus, David Eisenberg, and Todd O. Yeates Science 306 2246.
14
0000011000000000000011111110000 0000011000000000000011111110001 Subgraph 1 (S1) Kin = [0,0,1,1,0,0] Subgraph 2 Subgraph 3...... Subgraph N(=885) 000111100… 000111000… 111001100… 000001100… Kout = [1,2,3,4,0,0,3,4,5,1,1,4] P=0.05 P=0.01 P=0.2 S1 S2 S3 S4... SN S1 *.05.03 S2 * S3 * S4 *. SN T test(kin, kout) MATRIX of P Values (885X885) EVERY S has 885-1 p values.
15
Number of Significant p values (Accuracy) Functional Modules Phylogenetic profiles predict function
16
Z1Z2Z3Z4 Accuracy Functional modules Profile size
17
Propensity for gene loss 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 D-PGLDist 4 0.939 1 1.103 6 2.778 5 0.977 Phylogenetic profiles drawn from Roundup (Deluca et al. 2006) Phylogeny built using maximum parsimony Profiles mapped via Dollo parsimony
18
Roundup Computational Demand Number of rsd processes scales quadratically with the number of genomes. Could take ~2700 years to complete on a single machine N 2 = (N)(N-1)/2 * 12 parameters combos 1000 genomes = 5,994,000 processes = 23,976,000 hours
19
Hadoop may help Apache Hadoop Core is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.MapReduce Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.
20
MapReduce: High Level
21
Getting Data To The Mapper
22
Partition And Shuffle
23
Writing The Output
24
HDFS -- Limitations No file update options (record append, etc); all files must be written only once. Does not implement demand replication Designed for streaming –Random seeks devastate performance
25
Vanilla Hadoop Assumptions –Data are the data to be processed –Data can be arbitrarily subdivided across the cluster –Mapper and reducer are Java Roundup cannot accommodate any of these assumptions without serious pain
26
Hadoop Streaming is an answer Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. For example: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /bin/cat - reducer /bin/wc Mapper and the Reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a map/reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
27
Hadoop Streaming We cannot think of stanard Roundup input (i.e. genomes) as input to Hadoop But we can “trick” hadoop into receiving and processing Roundup job commands as inputs Undocumented, uncharted use of Hadoop streaming
28
MapReduce RSD Code and Hadoop Streaming Mapper Script: Input: command lines (runstuff)forgenomes to be compared Function: runs RSD algorithm Output: gene pairs and their evolutionary distances Reducer Script: Bypassed (set 0 in the configuration) Streaming : Hadoop programs are typically written in Java Roundup Code base is in python Options: 1.Translate python code using Jython into jar files 2.Use Hadoop streaming to pass data to the mapper function via STDIN and STDOUT
29
HADOOP JOB DISTRIBUTION AWSMedium High CPUInstance Specs: 1.2.5 EC2 compute unit per core* 2.2virtual Core 3.1.7 GB memory 4.32- bit platform 5.1690 GB instance storage 6.$0.20 per hour Cost estimate: Cost per hour × Number of instances× Number of hours to run the job For this particular test: $0.20 × 6 medium instances × ~11 mins =$1.2 [Used 10 nodes for 15 comparisons across 6 genomes] *One EC2 Compute Unit = 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor
30
AWS access keys and other parameters excluded from config. file as they are private and specific to group HADOOP Configuration Setup to run RSD jobs on AWS
31
Mapper import sys For cmd in sys.stdin: os.system(cmd) Streaming Input file (on HDFS) “rsdrunner” Line 1: python /user/local/hadoop/RSD.py –thresh=0.8 –div=0.2 -s /user/local/hadoop/genome/genome1.aa -q /user/local/hadoop/genome/genome1.aa -output /user/local/hadoop/wank … Limitations of HDFS and Streaming forced us to print RSD output to stdout… but it worked…
32
Hadoop cluster launch on AWS To start the cluster Initializing the cluster
33
Roundup job run on AWS Copy input file to HDFS Run RSD jobs powered by HadoopStreming utility $bin/hadoop jar contrib/streaming/hadoop-0.18.3-streaming.jar -mapper /usr/local/hadoop-0.18.3/newmapper.py -reducer NONE -input rsdrunner -output RsdResult -jobconfmapred.map.tasks=10
34
Roundup job run on AWS Output of previous command in console
35
Monitoring Roundup Jobs JobTracker and cluster monitoring
36
Monitoring Roundup Jobs TaskTracker Monitoring
37
Monitoring Roundup Jobs MapReduce job summary
38
Monitoring Roundup Jobs Running Task = No. of Map task
39
Monitoring Roundup Jobs Time Split for job completion
40
Montioring Roundup Jobs Summary Post Job Completion
41
Shutting down the Hadoop cluster on AWS Terminate the cluster after job completion and copying data locally Terminating the cluster
42
Future Considerations Careful design of Roundup AWS jobs to minimize time and maximize instance resources. Could be prohibitively expensive –Medium HPC instance $0.20/hour –Medium HPC = 5 EC2 compute units (5 jobs per instance) –Assume 1 RSD run = 4 hours = $0.16 –All of Roundup would cost ~$1M, but would only take ~100 days to complete if using 10000 compute units
43
HADOOP Usage Examples IBM Blue cloud computing cluster at IBM uses Hadoop parallel work load scheduling Yahoo! 100,000 CPU’s running Hadoop: 2000 nodes currently used by Ad systems and Web Search(webgrep) The New York Times Converted ~11 million articles (scanned images) into PDF format using EC2 for computation and S3 for datastorage Apache Mahout Uses Hadoop to built scalable machine learning algorithms like canopy clustering and k-means Facebook Use Hadoop to store internal logs and dimension data sources which is used for reporting analytics and machine Able Grape Worlds smallest hadoop cluster: 2 node@ 8 CPUs/node, vertical search engine for wine knowledge base http://wiki.apache.org/hadoop/PoweredBy
44
Acknowledgements Parul Kudtarkar Kris St. Gabriel Todd DeLuca Rimma Pivovarov Vince Fusaro Prasad Patil Peter Kos Peter Tonellato Mike Banos All palaver participants
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.