Presentation is loading. Please wait.

Presentation is loading. Please wait.

Storage Solutions for Bioinformatics

Similar presentations


Presentation on theme: "Storage Solutions for Bioinformatics"— Presentation transcript:

1 Storage Solutions for Bioinformatics
Li Yan Director of FlexLab, Bioinformatics core technology laboratory Science and Technology Division, BGI-Shenzhen

2 OUTLINE Background Hardware Infrastructure of Data Storage
Data Management Data Storage Architecture In BGI Distributed Computing on Storage Server

3 Background: Fast Growing Big Data

4 Sequencing, sequencing and sequencing

5 Background Next generation sequencing (NGS) represents a revolution in data generation in the genetic world.  Compared to Sanger sequencing, NGS allows for sequencing the complete genomic content of a sample without the need to make clone libraries.  It allows a researcher or clinician to use a single test to examine a genome in great detail.  What took weeks or months to perform can now be completed in a matter of days.

6 Fast growing big data From small genomes to large complex genomes
E. coli Genome: 4.9M Caenorhaditis elegans Genome: 100M Human Genome: 3G Wheat Genome: 16G Salamander: 45G From one sample to populations Human Genome: 3 billion DNA subunits (A,T,C,G) 80~100X Sequencing: 600GB Raw data for individual study 1000 Genome Project: 600TB Raw data for population study From the first generation sequencing to the second generation sequencing

7 Long-Term Data Storage Needs
Properly secure the data Plan for data redundancy, which generally means we mirror data with two or more copies Available(24x7x365) for all kinds of uses Readily accessible and in the right format Fast Data Transfer for collaborations Fast Network server(Aspera) instead of mailing a hard drive Scalable, easy to scale up Choosing reliable file systems

8 Hardware infrastructure of data storage

9 Type of Storage infrastructure
Disk library A high-capacity storage system that holds a quantity of CD-ROM, DVD or magneto-optic (MO) disks in a storage rack and feeds them to one or more drives for reading and writing. Magnetic tape A high-capacity data storage system for storing, retrieving, reading and writing multiple magnetic tape cartridges. Redundant array of independent disks (RAID) RAID is a storage technology that combines multiple disk drive components into a logical unit Direct-attached storage (DAS) a digital storage system directly attached to a server or workstation, without a storage network in between Network-attached storage (NAS) Network-attached storage (NAS) is file-level computer data storage connected to a computer network providing data access to heterogeneous clients. Storage area network (SAN) A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage.

10 High data availability Not as easily accessible as DAS
Type of Storage Pros Cons General use Disk library Fast High storage capacity High data availability Not as easily accessible as DAS Intended for write once, read rarely info Disk-to-disk backup Archiving Near line storage Magnetic tape Low cost per megabytes Portable Unlimited capacity (with multiple tapes) Inconvenient for fast recovery of individual or group files Limited-budget businesses Offsite storage Redundant array of independent disks (RAID) Reliable Security Fault tolerance Possible false sense of security Some recovery difficulty on some systems High cost for optimum systems Swap files Internet service providers Redundant storage

11 Type of Storage Pros Cons General use Direct-attached storage (DAS) Simple Low starting cost Easy to use Needs separate storage for each server Not easy to transfer data in network Server takes application processing load Data and application sharing Data backup Archiving Network-attached storage (NAS) Fast file access for multiple clients Ease of data sharing High storage capacity Redundancy Ease of drive mirroring Consolidated resources Less convenient than SAN for moving large blocks of data Backup Redundant storage Storage area network (SAN) Excellent for moving large blocks of data Exceptional reliability Easily availible Fault tolerance Scalability Expensive Lack of standardization Management complexity Large databases Bandwidth-intensive applications Mission-critical applications

12 Software Level of Data storage

13 Data flow of NGS Alignment Assembly Association Complex workflow
Raw Data Sequencer Annotation of features Variations/Mutations Protein Structural Gene Expressions Function Networks Data Store Meaningful Biology Data

14 Data Management Classify the data into different levels
First Level of Storage: Dynamic, fast, Temporary Secondary Level of storage: Slower than first level, but enduring and safety Third Level of storage: High capacity medium for backups and archives Choosing file systems Current popular distributed file systems include: Lustre, HDFS, MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and GoogleFS.

15 Classify the data into different levels
First Level of Storage: Dynamic, fast, Temporary intermediate results of data analysis Reference data Secondary Level of storage: Slower than first level, but enduring and safety Sequencing raw data Meaningful data Third Level of storage: High capacity medium for backups and archives Backups and archives of raw data and meaningful data

16 Distributed File systems
Lustre lustre is a large, safe and reliable, highly available cluster file system, which is developed and maintained by the SUN. Lustre can support more than 10,000 nodes, the number to the number of PB storage system. Hadoop(HDFS) Hadoop and not just a hadoop distributed file system for storage, but designed for general-purpose computing device in the form of large-scale distributed applications running on the cluster framework. OneFS OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10 Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per second) of throughput. Storage Server Distributed file systems

17 Distributed File systems
MogileFS ( FreeNAS ( ) FastDFS (code.google.com / p / fastdfs) OpenAFS ( ) MooseFS (derf.homelinux.org) pNFS ( ) GoogleFS

18 Data compression&& Data security
Common used: Lemple-Ziv, BWT Exclusive used for DNA sequences: Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp, sam_comp Data security Raid system failure/ Redundancy File system Network

19 Data Storage Architecture In BGI

20 Data Storage Architecture In BGI
Two Copies Archiving Write Tape Library Read Sequencers Compute Nodes Read Write

21 Data Storage Architecture In BGI
Two Copies Archiving Write Tape Library Read Sequencers Compute Nodes Read Write First Level Storage

22 Data Storage Architecture In BGI
Two Copies Archiving Write Second Level Storage Tape Library Read Sequencers Compute Nodes Read Write

23 Data Storage Architecture In BGI
Two Copies Archiving Write Tape Library Read Third Level Storage Sequencers Compute Nodes Read Write

24 Data Storage Architecture In BGI
Two Copies Archiving Write Tape Library Read Sequencers Compute Nodes Read Write

25 Distributed Computing on Storage Server

26 Large memory server >500GB
Traditional Genome Assembly Costly, Unscalable NGS read file Sequence Assembly Large memory server >500GB Storage Users

27 Distributed Genome Assembly
Several storage server (IBM3630*16 for human genome) Assembly …… Cost effectively, Scalable

28 Hecate Constructing de bruijn Graph Solving Tiny Repeats
Merging Bubbles Scaffolding Merging Contigs

29 Gaea 2.1 Distributed Indexing for load balancing
Reads Reference genome Preprocessing Locating Aligning SNP calling Gaea 2.1 Distributed Indexing for load balancing Flexible splitting tolerates more mistmatches Dynamic Programming for robust gap alignment Standard mapping quality for SNP calling

30 Q&A


Download ppt "Storage Solutions for Bioinformatics"

Similar presentations


Ads by Google