5 BackgroundNext generation sequencing (NGS) represents a revolution in data generation in the genetic world. Compared to Sanger sequencing, NGS allows for sequencing the complete genomic content of a sample without the need to make clone libraries. It allows a researcher or clinician to use a single test to examine a genome in great detail. What took weeks or months to perform can now be completed in a matter of days.
6 Fast growing big data From small genomes to large complex genomes E. coli Genome: 4.9MCaenorhaditis elegans Genome: 100MHuman Genome: 3GWheat Genome: 16GSalamander: 45GFrom one sample to populationsHuman Genome: 3 billion DNA subunits (A,T,C,G)80~100X Sequencing: 600GB Raw data for individual study1000 Genome Project: 600TB Raw data for population studyFrom the first generation sequencing to the second generation sequencing
7 Long-Term Data Storage Needs Properly secure the dataPlan for data redundancy, which generally means we mirror data with two or more copiesAvailable(24x7x365) for all kinds of usesReadily accessible and in the right formatFast Data Transfer for collaborationsFast Network server(Aspera) instead of mailing a hard driveScalable, easy to scale upChoosing reliable file systemshttps://www.intrepidbio.com/next-generation-sequencing-the-data-storage-dilemma/
9 Type of Storage infrastructure Disk libraryA high-capacity storage system that holds a quantity of CD-ROM, DVD or magneto-optic (MO) disks in a storage rack and feeds them to one or more drives for reading and writing.Magnetic tapeA high-capacity data storage system for storing, retrieving, reading and writing multiple magnetic tape cartridges.Redundant array of independent disks (RAID)RAID is a storage technology that combines multiple disk drive components into a logical unitDirect-attached storage (DAS)a digital storage system directly attached to a server or workstation, without a storage network in betweenNetwork-attached storage (NAS)Network-attached storage (NAS) is file-level computer data storage connected to a computer network providing data access to heterogeneous clients.Storage area network (SAN)A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage.
10 High data availability Not as easily accessible as DAS Type of StorageProsConsGeneral useDisk libraryFastHigh storage capacityHigh data availabilityNot as easily accessible as DASIntended for write once, read rarely infoDisk-to-disk backupArchivingNear line storageMagnetic tapeLow cost per megabytesPortableUnlimited capacity (with multiple tapes)Inconvenient for fast recovery of individual or group filesLimited-budget businessesOffsite storageRedundant array of independent disks (RAID)ReliableSecurityFault tolerancePossible false sense of securitySome recovery difficulty on some systemsHigh cost for optimum systemsSwap filesInternet service providersRedundant storage
11 Type of StorageProsConsGeneral useDirect-attached storage (DAS)SimpleLow starting costEasy to useNeeds separate storage for each serverNot easy to transfer data in networkServer takes application processing loadData and application sharingData backupArchivingNetwork-attached storage (NAS)Fast file access for multiple clientsEase of data sharingHigh storage capacityRedundancyEase of drive mirroringConsolidated resourcesLess convenient than SAN for moving large blocks of dataBackupRedundant storageStorage area network (SAN)Excellent for moving large blocks of dataExceptional reliabilityEasily availibleFault toleranceScalabilityExpensiveLack of standardizationManagement complexityLarge databasesBandwidth-intensive applicationsMission-critical applications
13 Data flow of NGS Alignment Assembly Association Complex workflow Raw DataSequencerAnnotation of featuresVariations/MutationsProtein StructuralGene ExpressionsFunction NetworksData StoreMeaningful Biology Data
14 Data Management Classify the data into different levels First Level of Storage: Dynamic, fast, TemporarySecondary Level of storage: Slower than first level, but enduring and safetyThird Level of storage: High capacity medium for backups and archivesChoosing file systemsCurrent popular distributed file systems include: Lustre, HDFS, MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and GoogleFS.
15 Classify the data into different levels First Level of Storage: Dynamic, fast, Temporaryintermediate results of data analysisReference data…Secondary Level of storage: Slower than first level, but enduring and safetySequencing raw dataMeaningful dataThird Level of storage: High capacity medium for backups and archivesBackups and archives of raw data and meaningful data
16 Distributed File systems Lustrelustre is a large, safe and reliable, highly available cluster file system, which is developed and maintained by the SUN. Lustre can support more than 10,000 nodes, the number to the number of PB storage system.Hadoop(HDFS)Hadoop and not just a hadoop distributed file system for storage, but designed for general-purpose computing device in the form of large-scale distributed applications running on the cluster framework.OneFSOneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10 Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per second) of throughput.Storage ServerDistributed file systems
17 Distributed File systems MogileFS (www.danga.com)FreeNAS ( )FastDFS (code.google.com / p / fastdfs)OpenAFS ( )MooseFS (derf.homelinux.org)pNFS ( )GoogleFS
18 Data compression&& Data security Common used:Lemple-Ziv, BWTExclusive used for DNA sequences:Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp, sam_compData securityRaid system failure/ RedundancyFile systemNetwork
29 Gaea 2.1 Distributed Indexing for load balancing ReadsReference genomePreprocessingLocatingAligningSNP callingGaea 2.1Distributed Indexing for load balancingFlexible splitting tolerates more mistmatchesDynamic Programming for robust gap alignmentStandard mapping quality for SNP calling