Presentation is loading. Please wait.

Presentation is loading. Please wait.

HPC in the Human Genome Project James Cuff

Similar presentations


Presentation on theme: "HPC in the Human Genome Project James Cuff"— Presentation transcript:

1 HPC in the Human Genome Project James Cuff james@sanger.ac.uk

2 The Sanger Centre is a research centre funded primarily by the Wellcome Trust Located in 55 acres of parkland Also on site are the European Bioinformatics Institute (EBI) Human Genome Mapping Project Resource Centre (HGMP-RC)

3 The Sanger Centre Founded in 1993; >570 staff members now. Our purpose is to further the knowledge of the biology of organisms, particularly through large scale sequencing and analysis of their genomes. Our lead project is to sequence a third of the human genome as part of the international Human Genome Project.

4 Sanger Centre research programmes Pathogen sequencing programme Informatics – support data collection – analyse and present results – develop methodology: algorithms and data resources Cancer genome project Human genetic programme - study genetic variation (SNPs) and find disease genes

5 The order (or “sequence”) of the bases in the DNA chain codes for the genes BASE- PAIRS THYMINE GUANINE CYTOSINE ADENINE Four nucleic acids Guanine, Thymine Cytosine, Adenine are represented computationally by the characters A, C, T, G The Structure of DNA

6 Typical DNA Sequence A typist typing at 60 w.p.m for 8 hours a day would take around 50 years to type the book of life. Human DNA consists of 3000,000,000 ‘letters’

7 The era of genome sequencing SizeNo. of Genes Completion (Mbases) date H. influenzae 2 1,700 1/1kb Bacterium 1995 Yeast13 6,000 1/2kb Eukaryotic cell1996 Nematode 100 18,000 1/6kb Animal1998 Human 3000 ?40,000 1/60kb Mammal 2000/3 Sequence data production increase of >2000%

8 The Sequencing Facility

9 Sanger I.T. Sanger network:- more than 1600 devices 300 PC’s – various 150+ X-terms/Network Computers (NCD) 250 NT/Mac ABI Collection devices Various other servers, Linux desktop systems, printers, etc. Paracel, Compugen and Timelogic systems >350 Compaq Alpha systems (DS10, DS20, ES40,8400) +440 Node Sequence Annotation Farm (PC/DS10/DS10L) >750 Alpha processors in total

10 Raid,8400,DS20,PC Farm

11 Systems architecture hierarchy Compute Server Farm Raid Storage Raid Storage ASX-0BX A T M Front-end Compute Servers Front-end Compute Servers Desk top workgroup systems Desk top workgroup systems LSF LSF - Load Sharing Facility by Platform Computing Ltd F/C

12 Computer Systems Architecture Fibre Channel/Memory Channel Tru64 Clusters Implementing tightly coupled clustering with Tru64 V5.x We get: Improved disk I/O (fibre channel), scaleability (multi-cpu, multi-terabyte) Improved manageability - single system image, whole clusters are managed as single entities)

13 ES40 Clusters, F.C Storage

14 32 x CableTron 100Mbs switches, 16 x RS232 Terminal servers, 2 x 155Mb ATM fibre uplinks back to v5.0 cluster Two network subnets (multicast and backbone) 640x100Mb Fast Ethernet ports 1,920 UTP cable crimps, 8 cabinets ~ 100kW of power 8 Racks each with 40 x Tru64 v5.0 Alpha DS10L. 320x466Mhz Alpha EV6.7, 1U High Total of 320GB mem, spinning 19.2TB internal storage ca. Equivalent to 10 x GS320, perf around 355 Gflops Annotation Farms

15

16 Highly available NFS (Tru64 CAA) Fast I/O (ATM > switched full duplex ethernet) Socket data transfer (via rdist, rcp, and MySQL DBI sockets) Segmented network architecture via two elans 8 node ES40 M/C F/C cluster ATM 172.27 172.25 uplinks 172.27 Sanger 172.25 Farm Network Overview

17 Compute – systems architecture Web server 400 node S.A. farm Pathogen Sequence data processing Informatics Mapping SNP Ensembl Blast server Alta Vista FTP Large scale assembly & sequencing Firewall DMZ External Services ATM Trace server

18 Enterprise Clustering LSF is still key for job scheduling and batch operations LSF offers greater granularity of operation and functionality than Tru64 scheduling Schedule individual nodes, cluster-wide and cross-cluster scheduling With LSF we still have the capability to use many of the 750+ compute nodes as a single Sanger Compute Engine MODULAR SUPERCOMPUTING

19 Projects Will involve thousands of CPU’s - Large numbers of PC farm nodes - High-end, large memory SMP configurations All are computationally expensive Will require > 100 Terabytes of storage We need to continue scaling up and deal with the physical limitations

20 Immediate Future The Sanger Centre Genome Campus ATM Storage Area Network (SAN) The EBI Genome Campus LSF Clustering Institute to Institute Clustering Closer collaborations between Sanger, EBI and other organisations brings the need for site wide shared clusters. Implement Storage Area Network Install multi-TB to enable disk mirroring, controller/controller snapshots

21 Longer Term Future Wide Area Clusters Needed for large scale collaborations. GRID Technology - Global Distributed Computing International Cluster collaborations with other scientific institutes GLOBAL COMPUTE ENGINES Sanger is keen to keep abreast of this emerging technology

22 Questions ?


Download ppt "HPC in the Human Genome Project James Cuff"

Similar presentations


Ads by Google