Storage Solutions for Bioinformatics

Slides:



Advertisements
Similar presentations
SOLPHY POLSKA Product Presentation SOLPHY Home Storage.
Advertisements

Dominik Stokłosa Pozna ń Supercomputing and Networking Center, Supercomputing Department INGRID 2008 Lacco Ameno, Island of Ischia, ITALY, April 9-11 Workflow.
Archive Task Team (ATT) Disk Storage Stuart Doescher, USGS (Ken Gacke) WGISS-18 September 2004 Beijing, China.
Computing Infrastructure
Data Storage Solutions Module 1.2. Data Storage Solutions Upon completion of this module, you will be able to: List the common storage media and solutions.
Introduction to Storage Devices
Copyright © 2006 Quest Software SQL 2005 Disk I/O Performance By Bryan Oliver SQL Server Domain Expert.
Faculty of Information Technology Department of Computer Science Computer Organization Chapter 7 External Memory Mohammad Sharaf.
Professor Michael J. Losacco CIS 1150 – Introduction to Computer Information Systems Secondary Storage Chapter 7.
Lesson 9 Types of Storage Devices.
Basic Principles of PACS Networking Emily Seto Medical Engineering/SIMS Center for Global eHealth Innovation April 29, 2004.
Types Of Storage Device
Lesson 3: Working with Storage Systems
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
XenData SX-520 LTO Archive Servers A series of archive servers based on IT standards, designed for the demanding requirements of the media and entertainment.
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
1 Recap (RAID and Storage Architectures). 2 RAID To increase the availability and the performance (bandwidth) of a storage system, instead of a single.
High Performance Computing Course Notes High Performance Storage.
Servers Redundant Array of Inexpensive Disks (RAID) –A group of hard disks is called a disk array FIGURE Server with redundant NICs.
Storage Networking. Storage Trends Storage growth Need for storage flexibility Simplify and automate management Continuous availability is required.
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Storage and Backups November 18, 2010 | Worksighted.
Distinguish between primary and secondary storage.
Comp 1001: IT & Architecture - Joe Carthy 1 Information Representation: Summary All Information is stored and transmitted in digital form in a computer.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
© 2001 by Prentice Hall5-1 Local Area Networks, 3rd Edition David A. Stamper Part 2: Hardware Chapter 5 LAN Hardware.
Secondary Storage Chapter 7.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Hosted by Case Study - Storage Consolidation Steve Curry Yahoo Inc.
INFO1 – Practical problem solving in the digital world
IT Infrastructure Chap 1: Definition
Meeting the Data Protection Demands of a 24x7 Economy Steve Morihiro VP, Programs & Technology Quantum Storage Solutions Group
OCR GCSE Computing Chapter 2: Secondary Storage. Chapter 2: Secondary storage Computers are able to process input data and output the results of that.
Secondary Storage Chapter 8 Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. 8-1.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
1 U.S. Department of the Interior U.S. Geological Survey Contractor for the USGS at the EROS Data Center EDC CR1 Storage Architecture August 2003 Ken Gacke.
Overview of Physical Storage Media
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: Secondary Storage.
Storage Networking. Storage Trends Storage grows %/year, gets more complicated It’s necessary to pool storage for flexibility Intelligent storage.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Storage System Optimization. Introduction Storage Types-DAS/NAS/SAN The purposes of different RAID types. How to calculate the storage size for video.
BIG DATA/ Hadoop Interview Questions.
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
CDP Technology Comparison CONFIDENTIAL DO NOT REDISTRIBUTE.
Canadian Bioinformatics Workshops
OCR Computing OGAT Input, output and storage.
Computer Hardware. Focus Items  Design systems that meet business needs  Hardware industry trends  Problems Legacy hardware (and software) Dealing.
Local Area Networks, 3rd Edition David A. Stamper
Storage Networking.
Choosing the best storage method
SAN and NAS.
Chapter III, Desktop Imaging Systems and Issues: Lesson II Storing Image Data
Storage Virtualization
Chapter 7.
Distinguish between primary and secondary storage.
Storage Networking.
UNIT IV RAID.
Lesson 9 Types of Storage Devices.
Primary Storage 1. Registers Part of the CPU
Presentation transcript:

Storage Solutions for Bioinformatics Li Yan Director of FlexLab, Bioinformatics core technology laboratory liyan3@genomics.cn http://www.genomics.cn/FlexLab/index.html Science and Technology Division, BGI-Shenzhen

OUTLINE Background Hardware Infrastructure of Data Storage Data Management Data Storage Architecture In BGI Distributed Computing on Storage Server

Background: Fast Growing Big Data

Sequencing, sequencing and sequencing

Background Next generation sequencing (NGS) represents a revolution in data generation in the genetic world.  Compared to Sanger sequencing, NGS allows for sequencing the complete genomic content of a sample without the need to make clone libraries.  It allows a researcher or clinician to use a single test to examine a genome in great detail.  What took weeks or months to perform can now be completed in a matter of days.

Fast growing big data From small genomes to large complex genomes E. coli Genome: 4.9M Caenorhaditis elegans Genome: 100M Human Genome: 3G Wheat Genome: 16G Salamander: 45G From one sample to populations Human Genome: 3 billion DNA subunits (A,T,C,G) 80~100X Sequencing: 600GB Raw data for individual study 1000 Genome Project: 600TB Raw data for population study From the first generation sequencing to the second generation sequencing

Long-Term Data Storage Needs Properly secure the data Plan for data redundancy, which generally means we mirror data with two or more copies Available(24x7x365) for all kinds of uses Readily accessible and in the right format Fast Data Transfer for collaborations Fast Network server(Aspera) instead of mailing a hard drive Scalable, easy to scale up Choosing reliable file systems https://www.intrepidbio.com/next-generation-sequencing-the-data-storage-dilemma/

Hardware infrastructure of data storage

Type of Storage infrastructure Disk library A high-capacity storage system that holds a quantity of CD-ROM, DVD or magneto-optic (MO) disks in a storage rack and feeds them to one or more drives for reading and writing. Magnetic tape A high-capacity data storage system for storing, retrieving, reading and writing multiple magnetic tape cartridges. Redundant array of independent disks (RAID) RAID is a storage technology that combines multiple disk drive components into a logical unit Direct-attached storage (DAS) a digital storage system directly attached to a server or workstation, without a storage network in between Network-attached storage (NAS) Network-attached storage (NAS) is file-level computer data storage connected to a computer network providing data access to heterogeneous clients. Storage area network (SAN) A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage.

High data availability Not as easily accessible as DAS Type of Storage Pros Cons General use Disk library Fast High storage capacity High data availability Not as easily accessible as DAS Intended for write once, read rarely info Disk-to-disk backup Archiving Near line storage Magnetic tape Low cost per megabytes Portable Unlimited capacity (with multiple tapes) Inconvenient for fast recovery of individual or group files Limited-budget businesses Offsite storage Redundant array of independent disks (RAID) Reliable Security Fault tolerance Possible false sense of security Some recovery difficulty on some systems High cost for optimum systems Swap files Internet service providers Redundant storage http://searchdatacenter.techtarget.com/tutorial/Fast-Reference-Storage

Type of Storage Pros Cons General use Direct-attached storage (DAS) Simple Low starting cost Easy to use Needs separate storage for each server Not easy to transfer data in network Server takes application processing load Data and application sharing Data backup Archiving Network-attached storage (NAS) Fast file access for multiple clients Ease of data sharing High storage capacity Redundancy Ease of drive mirroring Consolidated resources Less convenient than SAN for moving large blocks of data Backup Redundant storage Storage area network (SAN) Excellent for moving large blocks of data Exceptional reliability Easily availible Fault tolerance Scalability Expensive Lack of standardization Management complexity Large databases Bandwidth-intensive applications Mission-critical applications

Software Level of Data storage

Data flow of NGS Alignment Assembly Association Complex workflow Raw Data Sequencer Annotation of features Variations/Mutations Protein Structural Gene Expressions Function Networks Data Store Meaningful Biology Data

Data Management Classify the data into different levels First Level of Storage: Dynamic, fast, Temporary Secondary Level of storage: Slower than first level, but enduring and safety Third Level of storage: High capacity medium for backups and archives Choosing file systems Current popular distributed file systems include: Lustre, HDFS, MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and GoogleFS.

Classify the data into different levels First Level of Storage: Dynamic, fast, Temporary intermediate results of data analysis Reference data … Secondary Level of storage: Slower than first level, but enduring and safety Sequencing raw data Meaningful data Third Level of storage: High capacity medium for backups and archives Backups and archives of raw data and meaningful data

Distributed File systems Lustre lustre is a large, safe and reliable, highly available cluster file system, which is developed and maintained by the SUN. Lustre can support more than 10,000 nodes, the number to the number of PB storage system. Hadoop(HDFS) Hadoop and not just a hadoop distributed file system for storage, but designed for general-purpose computing device in the form of large-scale distributed applications running on the cluster framework. OneFS OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10 Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per second) of throughput. http://www.codeweblog.com/current-popular-distributed-file-system-parade/ Storage Server Distributed file systems

Distributed File systems MogileFS (www.danga.com) FreeNAS ( www.openqrm.org ) FastDFS (code.google.com / p / fastdfs) OpenAFS ( www.openafs.org ) MooseFS (derf.homelinux.org) pNFS ( www.pnfs.com ) GoogleFS

Data compression&& Data security Common used: Lemple-Ziv, BWT Exclusive used for DNA sequences: Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp, sam_comp Data security Raid system failure/ Redundancy File system Network

Data Storage Architecture In BGI

Data Storage Architecture In BGI Two Copies Archiving Write Tape Library Read Sequencers Compute Nodes Read Write

Data Storage Architecture In BGI Two Copies Archiving Write Tape Library Read Sequencers Compute Nodes Read Write First Level Storage

Data Storage Architecture In BGI Two Copies Archiving Write Second Level Storage Tape Library Read Sequencers Compute Nodes Read Write

Data Storage Architecture In BGI Two Copies Archiving Write Tape Library Read Third Level Storage Sequencers Compute Nodes Read Write

Data Storage Architecture In BGI Two Copies Archiving Write Tape Library Read Sequencers Compute Nodes Read Write

Distributed Computing on Storage Server

Large memory server >500GB Traditional Genome Assembly Costly, Unscalable NGS read file Sequence Assembly Large memory server >500GB Storage Users

Distributed Genome Assembly Several storage server (IBM3630*16 for human genome) Assembly …… Cost effectively, Scalable

Hecate Constructing de bruijn Graph Solving Tiny Repeats Merging Bubbles Scaffolding Merging Contigs

Gaea 2.1 Distributed Indexing for load balancing Reads Reference genome Preprocessing Locating Aligning SNP calling Gaea 2.1 Distributed Indexing for load balancing Flexible splitting tolerates more mistmatches Dynamic Programming for robust gap alignment Standard mapping quality for SNP calling

Q&A