MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research.

MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research Institutes in Israel, EU, US Superlink-Online: Harnessing the worlds computers to hunt for disease-provoking genes Computational Biology Laboratory Distributed Systems Laboratory

MS eScience Workshop 20082 Why Search ? Detection of diseases before birth Risk assessment and corresponding life style changes Finding the mutant proteins and developing medicine Understanding basic biological functions How to Search ? Find families segregating the disease (linkage analysis) or collect unrelated healthy and affected persons (Association analysis or LD mapping) Take a simple blood test from some individuals Analyze the DNA in the lab Compute the most likely location of disease gene

3 Linkage analysis (10 6 ~10 7 bp) Identify genes (10 4 ~10 5 bp) Resequencing (10 0 bp)

MS eScience Workshop 20084 Recombinant gametes Male or female

5 III-15IV-10 IV-7

MS eScience Workshop 20086

7 Id, dad, mom, sex, aff Marker 1 Marker 2 III-21 II-10 II-11 f h 0000 II-5 I-3 I-4 f h 155157AA III-7 II-4 II-5 f a 155157AT III-13 II-4 II-5 m a 151155AT III-14 II-1 II-2 f h 151155AA III-15 II-4 II-5 male a 151155AA III-16 II-10 II-11 f h 151159AA III-5 II-4 II-5 f h 151155AA IV-1 III-13 III-14 f h 151155AT IV-2 III-13 III-14 f a 151155AT IV-3 III-13 III-14 female a 155155AT. M1M1 M2M2 Chromosome pair:

MS eScience Workshop 20088 III-15 151,159 III-16 151,155 202,209 202,202 ahah 139,141 139,146 1,2 3,3 M1M1 M2M2 M3M3 M4M4 D1D1 θ The computational problem: find a value of θ maximizing Pr(data|θ) LOD score (to quantify how confident we are): Z(θ)=log 10 [Pr(data|θ) / Pr(data|θ=½)]. D2D2

MS eScience Workshop 20089 Position in centi-Morgans Position in centi-MorgansLn(Likelihood)LOD 0.0000 (Marker 3) -216.0217-14.74 0.5500-192.2385-4.41 1.1000 (Marker 4) -216.0210-14.74 3.6000-176.38102.47 6.1000 (Marker 5) -174.33923.35 8.6500-173.97433.51 11.2000 (Marker 6) -173.70303.63 16.5500-173.31063.80 21.9000 (Marker 9) -172.94973.96 25.2500-173.65403.65 28.6000 (Marker 10) -177.56221.95 40.3001-178.99461.33

MS eScience Workshop 200810 Locus 1 Locus 3 Locus 4 S i3m L i1f L i1m L i3m X i1 S i3f L i2f L i2m L i3f X i2 X i3 Locus 2 (Disease) Y 3 y2y2 Y1Y1 This model depicts the qualitative relations between the variables. We need also to specify the joint distribution over these variables.

MS eScience Workshop 200811 Computing Pr(data|θ) for a specific value of θ : Finding the best order is equivalent to finding the best order for sum-product operations for high dimensional matrices : Exponential time and space in: #variables five per person #markers #gene loci #values per variable #alleles non-typed persons table dimensionality cycles in pedigree

MS eScience Workshop 200812 3months Task length unknown upon submission From seconds to millenniums Computing task length? NP hard Estimate task length as we go

MS eScience Workshop 200813 non trivial parallelization overhead

Weak/no quality of serviceWeak/no quality of service –Random failures of execution machines –Preemption due to higher priority tasks –Hardware bugs may lead to incorrect results –Potentially unbounded execution/queue waiting time –Dynamic/abrupt changes of resource availability –High network delays (communication over WAN) –Multiple tasks MS eScience Workshop 200814

MS eScience Workshop 200815 Basic unit of execution – batch jobBasic unit of execution – batch job –Non-interactive mode: enqueue – wait – execute – return –Self-contained execution sandbox A linkage analysis request - a taskA linkage analysis request - a task –A bag (of millions) of jobs –Turnaround time is important

MS eScience Workshop 200816 The system must be geneticists-friendlyThe system must be geneticists-friendly –Interactive experience Low response time for short tasks Prompt user feedback –Simple, secure, reliable, stable, overload- resistant, concurrent tasks, multiple users... –Fast computation of previously infeasible long tasks via parallel execution Harness all available resources: grids, clouds, clusters Use them efficiently!

Remaining Jobs in Queue Time Cloud (k CPUs) Grid (k CPUs) Queue Waiting Time Small tasks are severely slow on grids Small tasks are severely slow on grids Takes 5 minutes on 10-nodes dedicated cluster May take several hours on a grid Should we move scientific loads on the cloud? YES! Long tail due to failures Queuing time in EGEE Error rate, UW Madison Preempted jobs, UW Madison 17MS eScience Workshop 2008

Consider 3.2x10 6 jobs, ~40 min each Consider 3.2x10 6 jobs, ~40 min each It took 21 days on ~6000-8000 CPUs It took 21 days on ~6000-8000 CPUs It would cost about $10K on Amazons EC2 It would cost about $10K on Amazons EC2 Should we move scientific loads on the cloud? NO! ? 18MS eScience Workshop 2008

Reliability Low High Performance predictibility Low High Low Potential amount of available resources High Low Reuse of existing infrastructure Throughput computing Burst computing 19MS eScience Workshop 2008 Dedicated Opportunistic

Task sensitivity to QoS differ in different stages Task sensitivity to QoS differ in different stages High throughput High performance Use cheap unreliable resources Grids Community grids Non-dedicated clusters Use expensive reliable resources Dedicated clusters Clouds Remaining jobs in queue Dynamically determine entering tail mode Switch to expensive resources (gracefully) 20MS eScience Workshop 2008

Virtual cluster maintainer Scheduling Server Scheduler Job queue Submitter to Grid 1 Submitter to Cloud 1 Submitter to Cloud 2 21 Submitter to Grid 2 Issues: granularity, load balancing, firewalls, failed resources, scheduler scalability…

Overlay scalability and firewall penetration Overlay scalability and firewall penetration Server may not initiate connect to the agent Compatibility with community grids Compatibility with community grids The server is based on BOINC Agents are upgraded BOINC clients Elimination of failed resources from scheduling Elimination of failed resources from scheduling Performance statistics is analyzed Resource allocation depending on the task state Resource allocation depending on the task state Dynamic policy update via Condor classad mechanism 22MS eScience Workshop 2008

Virtual cluster maintainer Submitter to Technion Submitter To EC2 Cloud Submitter to OSG Submitter to any grid/cluster/cloud BOINC clients submitter for EGEE BOINC clients submitter for Madison pool Dedicated cluster fallback Task execution and monitoring workflow Upgraded BOINC Server Database jobs, monitoring, system statistics Scheduler HTTP frontend SUPERLINK@TECHNION 23 Web Portal Task state

~1720 CPU years for ~18,000 tasks during 2006- 2008 (counting) ~1720 CPU years for ~18,000 tasks during 2006- 2008 (counting) ~37 citations (several mutations found) ~37 citations (several mutations found) Examples: Ichthyosis,"uncomplicated" hereditary spastic paraplegia (1-9 people per 100,000) Over 250 (counting) users: Israeli and international Over 250 (counting) users: Israeli and international Soroka H., Be'er Sheva, Galil Ma'aravi H., Nahariya, Rabin H., Petah Tikva, Rambam H., Haifa, Beney Tzion H., Haifa, Sha'arey Tzedek H., Jerusalem, Hadassa H., Jerusalem, Afula H. NIH, Universities and research centers in US, France, Germany, UK, Italy, Austria, Spain, Taiwan, Australia, and others... Task example Task example 250 days on single computer - 7 hours on 300-700 computers Short tasks: few seconds even during severe overload 26MS eScience Workshop 2008

27 Rabin Hospital, by Motti Shochats group New locus for mental retardation Infantile bilateral striatal necrosis Soroka Hospital, by Ohad Birks group Lethal congenital contractural syndrome Congenital cataract Rambam Hospital, by Eli Shprechers group Congenital recessive ichthyosis CEDNIK syndrome Galil Maaravi Hospital, by Tzipi Faliks group Familial Onychodysplasia and dysplasia Familial juvenile hypertrophy

Utilizing Community Computing ~3.4 TFLOPs, ~3000 users, from 75 countries 28

Submission server Dedicated cluster Technion Condor pools EGEE-II BIOMED VO Superlink@Technion Superlink@Campus Superlink-online V2(beta) deployment UW in Madison Condor pool OSG GLOW VO ~12,000 hosts operational during the last month 29MS eScience Workshop 2008

60 dedicated CPUs only 30MS eScience Workshop 2008

Our system integrates clusters, grids, clouds, community grids, etc. Our system integrates clusters, grids, clouds, community grids, etc. Geneticist friendly Minimizes use of expensive resources while providing QoS for tasks Minimizes use of expensive resources while providing QoS for tasks Generic mechanism for scheduling policy Generic mechanism for scheduling policy Can dynamically reroute jobs from one pool to another according to a given optimization function (budget, energy, etc.) 31MS eScience Workshop 2008

32 Memory BW: 88 GB/s peak 56GB/s observed on GTX8800 NVIDIA - $550 Memory BW: 21GB/s peak on 3.0 Ghz Intel Core2 Quad - $1100 CPUs: 1.4x annual growth GPUs: 1.7x annual growth

33MS eScience Workshop 2008 GPU Global Memory... 16MPX8SPX4 Cached Read-Only memory SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP Shared memory (16KB) MP Register file SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP Shared memory (16KB) MP Register file ~1 cycle ~TB/s

34MS eScience Workshop 2008 Software-managed cache We implement the cache replacement policy in software Maximization of data reuse Better compute/memory access ratio A simple model for performance bounds Yes, we are (optimal) Use special function units for hardware- assisted execution

35 Experiment setup CPU: single core Intel Core 2 2.4GHz, 4MB L2 GPU: NVIDIA G80 (GTX8800), 750MB GDDR4, 128 SP, 16K mem / 512 threads Only kernel runtime included (no memory transfers, no CPU setup time) 2500~ 2 x 25 x 25 x 2 Hardware Use of SFU: expf is about 6x slower than + on GPU, but ~200x slower on CPU Software managed Caching

Superlink-online team: Superlink-online team: Alumni: Anna Tzemach, Julia Stolin, Nikolay Dovgolevsky, Maayan Fishelson, Hadar Grubman, Ophir Etzion Current: Artyom Sharov, Oren Shtark Prof. Miron Livny (Condor pool UW Madison, OSG) Prof. Miron Livny (Condor pool UW Madison, OSG) EGEE BIOMED VO and OSG GLOW VO EGEE BIOMED VO and OSG GLOW VO Microsoft TCI program, NIH grant, SciDAC Institute for ultrascale visualization Microsoft TCI program, NIH grant, SciDAC Institute for ultrascale visualization If your grid is underutilized – let us know! Visit us at: http://bioinfo.cs.technion.ac.il/superlink-online http://bioinfo.cs.technion.ac.il/ Superlink@TECHNIONSuperlink@TECHNION project home page: http://cbl-boinc-server2.cs.technion.ac.il/superlinkattechnion 36MS eScience Workshop 2008

37 QUESTIONS??? Visit us at: http://bioinfo.cs.technion.ac.il/superlink-online

MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research.

Similar presentations

Presentation on theme: "MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research.

Similar presentations

Presentation on theme: "MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research."— Presentation transcript:

Similar presentations

About project

Feedback