Presentation is loading. Please wait.

Presentation is loading. Please wait.

MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research.

Similar presentations


Presentation on theme: "MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research."— Presentation transcript:

1 MS eScience Workshop Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research Institutes in Israel, EU, US Superlink-Online: Harnessing the worlds computers to hunt for disease-provoking genes Computational Biology Laboratory Distributed Systems Laboratory

2 MS eScience Workshop Why Search ? Detection of diseases before birth Risk assessment and corresponding life style changes Finding the mutant proteins and developing medicine Understanding basic biological functions How to Search ? Find families segregating the disease (linkage analysis) or collect unrelated healthy and affected persons (Association analysis or LD mapping) Take a simple blood test from some individuals Analyze the DNA in the lab Compute the most likely location of disease gene

3 3 Linkage analysis (10 6 ~10 7 bp) Identify genes (10 4 ~10 5 bp) Resequencing (10 0 bp)

4 MS eScience Workshop Recombinant gametes Male or female

5 5 III-15IV-10 IV-7

6 MS eScience Workshop 20086

7 7 Id, dad, mom, sex, aff Marker 1 Marker 2 III-21 II-10 II-11 f h 0000 II-5 I-3 I-4 f h AA III-7 II-4 II-5 f a AT III-13 II-4 II-5 m a AT III-14 II-1 II-2 f h AA III-15 II-4 II-5 male a AA III-16 II-10 II-11 f h AA III-5 II-4 II-5 f h AA IV-1 III-13 III-14 f h AT IV-2 III-13 III-14 f a AT IV-3 III-13 III-14 female a AT. M1M1 M2M2 Chromosome pair:

8 MS eScience Workshop III ,159 III , , ,202 ahah 139, ,146 1,2 3,3 M1M1 M2M2 M3M3 M4M4 D1D1 θ The computational problem: find a value of θ maximizing Pr(data|θ) LOD score (to quantify how confident we are): Z(θ)=log 10 [Pr(data|θ) / Pr(data|θ=½)]. D2D2

9 MS eScience Workshop Position in centi-Morgans Position in centi-MorgansLn(Likelihood)LOD (Marker 3) (Marker 4) (Marker 5) (Marker 6) (Marker 9) (Marker 10)

10 MS eScience Workshop Locus 1 Locus 3 Locus 4 S i3m L i1f L i1m L i3m X i1 S i3f L i2f L i2m L i3f X i2 X i3 Locus 2 (Disease) Y 3 y2y2 Y1Y1 This model depicts the qualitative relations between the variables. We need also to specify the joint distribution over these variables.

11 MS eScience Workshop Computing Pr(data|θ) for a specific value of θ : Finding the best order is equivalent to finding the best order for sum-product operations for high dimensional matrices : Exponential time and space in: #variables five per person #markers #gene loci #values per variable #alleles non-typed persons table dimensionality cycles in pedigree

12 MS eScience Workshop months Task length unknown upon submission From seconds to millenniums Computing task length? NP hard Estimate task length as we go

13 MS eScience Workshop non trivial parallelization overhead

14 Weak/no quality of serviceWeak/no quality of service –Random failures of execution machines –Preemption due to higher priority tasks –Hardware bugs may lead to incorrect results –Potentially unbounded execution/queue waiting time –Dynamic/abrupt changes of resource availability –High network delays (communication over WAN) –Multiple tasks MS eScience Workshop

15 MS eScience Workshop Basic unit of execution – batch jobBasic unit of execution – batch job –Non-interactive mode: enqueue – wait – execute – return –Self-contained execution sandbox A linkage analysis request - a taskA linkage analysis request - a task –A bag (of millions) of jobs –Turnaround time is important

16 MS eScience Workshop The system must be geneticists-friendlyThe system must be geneticists-friendly –Interactive experience Low response time for short tasks Prompt user feedback –Simple, secure, reliable, stable, overload- resistant, concurrent tasks, multiple users... –Fast computation of previously infeasible long tasks via parallel execution Harness all available resources: grids, clouds, clusters Use them efficiently!

17 Remaining Jobs in Queue Time Cloud (k CPUs) Grid (k CPUs) Queue Waiting Time Small tasks are severely slow on grids Small tasks are severely slow on grids Takes 5 minutes on 10-nodes dedicated cluster May take several hours on a grid Should we move scientific loads on the cloud? YES! Long tail due to failures Queuing time in EGEE Error rate, UW Madison Preempted jobs, UW Madison 17MS eScience Workshop 2008

18 Consider 3.2x10 6 jobs, ~40 min each Consider 3.2x10 6 jobs, ~40 min each It took 21 days on ~ CPUs It took 21 days on ~ CPUs It would cost about $10K on Amazons EC2 It would cost about $10K on Amazons EC2 Should we move scientific loads on the cloud? NO! ? 18MS eScience Workshop 2008

19 Reliability Low High Performance predictibility Low High Low Potential amount of available resources High Low Reuse of existing infrastructure Throughput computing Burst computing 19MS eScience Workshop 2008 Dedicated Opportunistic

20 Task sensitivity to QoS differ in different stages Task sensitivity to QoS differ in different stages High throughput High performance Use cheap unreliable resources Grids Community grids Non-dedicated clusters Use expensive reliable resources Dedicated clusters Clouds Remaining jobs in queue Dynamically determine entering tail mode Switch to expensive resources (gracefully) 20MS eScience Workshop 2008

21 Virtual cluster maintainer Scheduling Server Scheduler Job queue Submitter to Grid 1 Submitter to Cloud 1 Submitter to Cloud 2 21 Submitter to Grid 2 Issues: granularity, load balancing, firewalls, failed resources, scheduler scalability…

22 Overlay scalability and firewall penetration Overlay scalability and firewall penetration Server may not initiate connect to the agent Compatibility with community grids Compatibility with community grids The server is based on BOINC Agents are upgraded BOINC clients Elimination of failed resources from scheduling Elimination of failed resources from scheduling Performance statistics is analyzed Resource allocation depending on the task state Resource allocation depending on the task state Dynamic policy update via Condor classad mechanism 22MS eScience Workshop 2008

23 Virtual cluster maintainer Submitter to Technion Submitter To EC2 Cloud Submitter to OSG Submitter to any grid/cluster/cloud BOINC clients submitter for EGEE BOINC clients submitter for Madison pool Dedicated cluster fallback Task execution and monitoring workflow Upgraded BOINC Server Database jobs, monitoring, system statistics Scheduler HTTP frontend 23 Web Portal Task state

24 24

25 25

26 ~1720 CPU years for ~18,000 tasks during (counting) ~1720 CPU years for ~18,000 tasks during (counting) ~37 citations (several mutations found) ~37 citations (several mutations found) Examples: Ichthyosis,"uncomplicated" hereditary spastic paraplegia (1-9 people per 100,000) Over 250 (counting) users: Israeli and international Over 250 (counting) users: Israeli and international Soroka H., Be'er Sheva, Galil Ma'aravi H., Nahariya, Rabin H., Petah Tikva, Rambam H., Haifa, Beney Tzion H., Haifa, Sha'arey Tzedek H., Jerusalem, Hadassa H., Jerusalem, Afula H. NIH, Universities and research centers in US, France, Germany, UK, Italy, Austria, Spain, Taiwan, Australia, and others... Task example Task example 250 days on single computer - 7 hours on computers Short tasks: few seconds even during severe overload 26MS eScience Workshop 2008

27 27 Rabin Hospital, by Motti Shochats group New locus for mental retardation Infantile bilateral striatal necrosis Soroka Hospital, by Ohad Birks group Lethal congenital contractural syndrome Congenital cataract Rambam Hospital, by Eli Shprechers group Congenital recessive ichthyosis CEDNIK syndrome Galil Maaravi Hospital, by Tzipi Faliks group Familial Onychodysplasia and dysplasia Familial juvenile hypertrophy

28 Utilizing Community Computing ~3.4 TFLOPs, ~3000 users, from 75 countries 28

29 Submission server Dedicated cluster Technion Condor pools EGEE-II BIOMED VO Superlink-online V2(beta) deployment UW in Madison Condor pool OSG GLOW VO ~12,000 hosts operational during the last month 29MS eScience Workshop 2008

30 60 dedicated CPUs only 30MS eScience Workshop 2008

31 Our system integrates clusters, grids, clouds, community grids, etc. Our system integrates clusters, grids, clouds, community grids, etc. Geneticist friendly Minimizes use of expensive resources while providing QoS for tasks Minimizes use of expensive resources while providing QoS for tasks Generic mechanism for scheduling policy Generic mechanism for scheduling policy Can dynamically reroute jobs from one pool to another according to a given optimization function (budget, energy, etc.) 31MS eScience Workshop 2008

32 32 Memory BW: 88 GB/s peak 56GB/s observed on GTX8800 NVIDIA - $550 Memory BW: 21GB/s peak on 3.0 Ghz Intel Core2 Quad - $1100 CPUs: 1.4x annual growth GPUs: 1.7x annual growth

33 33MS eScience Workshop 2008 GPU Global Memory... 16MPX8SPX4 Cached Read-Only memory SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP Shared memory (16KB) MP Register file SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP Shared memory (16KB) MP Register file ~1 cycle ~TB/s

34 34MS eScience Workshop 2008 Software-managed cache We implement the cache replacement policy in software Maximization of data reuse Better compute/memory access ratio A simple model for performance bounds Yes, we are (optimal) Use special function units for hardware- assisted execution

35 35 Experiment setup CPU: single core Intel Core 2 2.4GHz, 4MB L2 GPU: NVIDIA G80 (GTX8800), 750MB GDDR4, 128 SP, 16K mem / 512 threads Only kernel runtime included (no memory transfers, no CPU setup time) 2500~ 2 x 25 x 25 x 2 Hardware Use of SFU: expf is about 6x slower than + on GPU, but ~200x slower on CPU Software managed Caching

36 Superlink-online team: Superlink-online team: Alumni: Anna Tzemach, Julia Stolin, Nikolay Dovgolevsky, Maayan Fishelson, Hadar Grubman, Ophir Etzion Current: Artyom Sharov, Oren Shtark Prof. Miron Livny (Condor pool UW Madison, OSG) Prof. Miron Livny (Condor pool UW Madison, OSG) EGEE BIOMED VO and OSG GLOW VO EGEE BIOMED VO and OSG GLOW VO Microsoft TCI program, NIH grant, SciDAC Institute for ultrascale visualization Microsoft TCI program, NIH grant, SciDAC Institute for ultrascale visualization If your grid is underutilized – let us know! Visit us at: project home page: 36MS eScience Workshop 2008

37 37 QUESTIONS??? Visit us at:


Download ppt "MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research."

Similar presentations


Ads by Google