Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speaker: Chun-Yuan Lin Assistant Professor, CSIE Chang Gung University Development of Next-Generation Sequencing Tools based on Graphics Processing Units.

Similar presentations


Presentation on theme: "Speaker: Chun-Yuan Lin Assistant Professor, CSIE Chang Gung University Development of Next-Generation Sequencing Tools based on Graphics Processing Units."— Presentation transcript:

1 Speaker: Chun-Yuan Lin Assistant Professor, CSIE Chang Gung University Development of Next-Generation Sequencing Tools based on Graphics Processing Units 2016/6/14 1

2 Background-Biology(1 ) Figure from: 2001Summer School DNA stores and transmits genetic information 2016/6/14 2

3 Background-Biology(2) H (RNA) (DNA) Figure from: 2001Summer School Structure of nucleotides A, T, G, C A-T G-C 2016/6/14 3

4 Background-Biology(3) - The structure of DNA is a double-stranded antiparallel helix - DNA molecules have distinctive base composition - Nuleic acids hybridize by base pairing 2016/6/14 4 James D. Watson Figure from: 2001Summer School

5 Background-Biology(4) 2016/6/14 5 Isolating a gene from a cellular genome Figure from: 2001Summer School DNA library - genomic library - cDNA library

6 Background-Biology(5) 2016/6/14 6 Figure from: 2001Summer School DNA library

7 Background-Biology(6) 2016/6/14 7 Figure from: 2001Summer School

8 Background-Biology(7) 2016/6/14 8 RNA splicing

9 Background-HGP Project(1) 2016/6/14 9 Human Genome Project (data from genomics.energy.org) Begun formally in 1990, the U.S. Human Genome Project was a 13-year effort coordinated by the U.S. Department of Energy and the National Institutes of Health. (James D. Watson) The project originally was planned to last 15 years, but rapid technological advances accelerated the completion date to 2003. identify all the approximately 20,000-25,000 genes in human DNA determine the sequences of the 3 billion chemical base pairs that make up human DNA The cost is more than 1 billion US dollars.

10 Background-HGP Project(2) 2016/6/14 10 Genome assembly problem K-mer : A K-mer in a genome is a sequence of K consecutive bases in it. K-mer x is adjacent to K-mer y (written x → y) if there is a (K+1)-mer in the genome whose first K bases are x and whose last K bases are y. (It follows that x and y overlap by K-1 bases.) x y

11 Background-HGP Project(3) 2016/6/14 11 www.ncbi.nlm.nih.gov/Genbank Over 2,000 ongoing genome project (Liolios et al. (2006) ) International sequence databases exceed 100 gigabases

12 Background-HGP Project(4) 2016/6/14 12

13 Background-HGP Project(5) 2016/6/14 13 Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalis ( 陰道鞭毛蟲 ) Chang Gung University 12 January 2007 Vol. 315, Issue 5809

14 Background-HGP Project(6) 2016/6/14 14 Post-genome era Proteomics Gene functions and variations Bio-medical integration Drug design Pathways Treatment Genetic medical therapy Personal care Target therapy

15 Conventional Sequencing Techniques 2016/6/14 15 Sanger sequencing technique Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain- terminating inhibitors. Proc Natl Acad Sci USA 1977, 74:5463- 5467. The human genome reference sequence cost about $1 billion to produce; it is 99.995% accurate and near-complete, containing>99% of the euchromatic region. (David R Bentley, 2006) Advantage: long length reads (500~1500bp), accurate sequencing data Disadvantage: slow and expensive for sequencing data.

16 Next-Generation Sequencing (1) 2016/6/14 16 Goals (David R Bentley, 2006) We could sequence 24–48 human genomes and obtain a baseline measure of human genetic variation at a defined population allele frequency. The resulting dataset would be an unbiased resource for human genetic studies and could be deepened when appropriate. Extensive re-sequencing of case collections in common diseases such as diabetes, obesity or cardiovascular disease would yield catalogues of germline variation to aid searches for novel risk factors. For $1000 per human genome, the concept of personal genome sequencing would become a technical reality, and bacterial sequencing for $4 would be one of the cheapest laboratory tests available.

17 Next-Generation Sequencing (2) 2016/6/14 17 New Sequencing techniques Microelectrophoresis Sequencing by hybridisation Sequencing by synthesis on arrays The 454 system (www.454.com) The Solexa system (www.solexa.com) Single-molecule sequencing Solexa System vs. 454 System 454 read: 200~300 bp, produces ~100MB, 7.5hr (more and faster now) Solexa read: 25~80bp, produces 800MB~1GB, 3~5 days (more and faster now)

18 Next-Generation Sequencing (3) 2016/6/14 18 NGS advantage and disadvantage Advantage: fast and cheap Disadvantage: computing challenges (large amount of short reads) (ex. more than 1 billion reads with 76 bp) An example: Solexa DNA, RNA and small RNA sequencing Raw data: @HWI-EAS82_3_FC204V1AAXX:6:1:886:345 AGAGTTCTACAGTCCGGACGATCTCGTATGCCGTC +HWI-EAS82_3_FC204V1AAXX:6:1:886:345 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

19 An example-Solexa (1) 2016/6/14 19

20 2016/6/14 20 An example-Solexa (2)

21 An example-Solexa (3) 2016/6/14 21 Solexa/ Illumina 454 FLX / Roches

22 2016/6/14 22 An example-Solexa (4)

23 NGS applications 2016/6/14 23 Re-sequencing applications Human 1000 genomes project. More and more populations have been sequenced and analyzed now. De novo sequencing Genome assembly techniques. New species have been sequenced and assembled now. Cancer studies Genetic variations. Small RNA, includes microRNA, snRNA,etc. Gene expression profile.

24 Mapping tool for NGS (1) 2016/6/14 24 Many new sequencing techniques have been proposed in the last few years, such as ABI-SOLiD, Roche-454 and Illumina-Solexa systems. Mapping all short reads to a genome is a important job for these existing applications. The reads mapping process is a classical exact or approximate string matching problem. given a query string P of length m, a text string T, and a distance k (k is 0 for the exact string matching problem), find all substrings t of T that are within the distance k from P. more than million query strings for a practical application.

25 Mapping tool for NGS (2) 2016/6/14 25 Most of traditional tools are not useful for next generation sequencing applications. BLAST BLAT SSAHA GMAP, and etc. The challenges of reads mapping process for next generation sequencing applications. large amount of reads sequence errors in short reads.

26 Mapping tool for NGS (3) 2016/6/14 26 New tools have been developed for next generation sequencing applications. ELAND RMAP MAQ SeqMap ZOOM SOAP SOAP2 Bowtie BWA, etc.

27 Mapping tool for NGS (4) 2016/6/14 27 Hash look-up table algorithm is a commonly used index method for reads or for a genome. for reads ELAND, RMAP, MAQ, SeqMap, ZOOM the memory usage is dependent on the size of reads and the read length. For example, the space complexity of RMAP is O(4 n/k + rk), where r is the number of reads, n is the read length and k is the number of mismatches allowed. for a genome SOAP the memory usage is dependent on the size of genome and the read length. The space requirement of SOAP is L/3+(4*3+8*6)*4 S +(4+1)*3*(L/4)+4*2 24, where L is genome size and s is seed size.

28 Mapping tool for NGS (5) 2016/6/14 28 Split-read strategy (d+1 split strategy) is used for approximate string matching problem. SeqMap, SOAP, etc. when mismatch is 1 (d =1), a read is split into two fragments; d =2, a read is split into three fragments, and so on. the mismatch can exist in, at most, one of the two fragments at the same time.

29 Mapping tool for NGS (6) 2016/6/14 29 Reads quality control is very important for mapping results. allow two (ex. ELAND) or more (ex. SeqMap) mismatches in the mapping process. consider reads quality scores (RMAP and ZOOM) in the mapping process. filter out adapter sequence in reads (MAQ and SOAP). SOAP even has a procedure to process the mRNA tag sequencing by considering the enzyme site. for unmapped reads, SOAP also can try to trim of several base- pairs at the 3’-end and redo the mapping.

30 Mapping tool for NGS (7) 2016/6/14 30 9914527 single-end reads (length 32 bp) 5Mb human genome (from SOAP, Bioinformatics 2008) (from SeqMap, Bioinformatics 2008)

31 Mapping tool for NGS (8) 2016/6/14 31 Suffix trees (or suffix arrays) and Burrows Wheeler Transformation also have been used for recent developmental tools. SOAP2, Bowtie and BWA greatly reduce the time and space requirements for these tools Hard to do the approximate string matching problem (from SOAP2, Bioinformatics 2009) (one million reads, human genome)

32 Mapping tool for NGS (9) 2016/6/14 32 FRESCO (Frequency-based RE-Sequencing tool based on CO-clustering segmentation) is a micro-RNA discovery program by mapping Solexa small RNA short reads to a genome. Exploiting the concepts of the distance graph, frequency distance and length signature, FRESCO maps reads of variable lengths to the genome without using hash look-up table algorithm and Burrows Wheeler Transformation.

33 2016/6/14 33 Flowchart of FRESCO

34 Mapping tool for NGS (10) 2016/6/14 34 Input: reads (Fq format) and genome (Fasta format). Preprocessing: scans all reads and records reads with ‘N’ nucleotide(s) not identified by Solexa or six or more continuous-A or -T at the head or the tail of a read seen as poly-A or poly-T. (These reads can be removed if the user can omit them) records duplicate reads and remove them by only preserving one of them when mapping reads to a genome.

35 Mapping tool for NGS (11) 2016/6/14 35 For example, there are 702,906 Solexa small RNA reads (chicken) with length of 33 bp. (Glazov et al., Genome Res.2008) 5,201 reads with ‘N’ nucleotide(s), 3,922 reads with possible continuous-A or –T, no duplicate reads (data released may remove the duplicate reads) There are 3,689,856 (Giardia lamblia) and 4,659,813 (Trichomonas vaginalis) Solexa small RNA reads with length of 35 bp. (Petrus Tang) 5.89% and 2.24% of reads with ‘N’ nucleotide(s) and possible continuous-A or –T. 73.76% and 89.56% of reads are duplicate.

36 Mapping tool for NGS (12) 2016/6/14 36 Removing adapter sequence compares the reads with the adapter sequence to filter out possible nucleotides of reads. redo the previous procedure to record and remove duplicate reads.

37 Mapping tool for NGS (13) 2016/6/14 37 For chicken small RNA reads, there are 2,281 of 693,783 (reminder reads after preprocessing step) duplicate reads after removing the adapter sequence. there are 61% reads (among 691,502 reads) with possible nucleotides of adapter sequence. the range of read lengths in remainder 693,783 reads is 0~33bp. Only 0.4% reads with the length less than 14. For GlT and TvT small RNA reads, there are 248 and 220 duplicate reads after removing the adapter sequence. there are 67% and 37% reads with possible nucleotides of adapter sequence. only 0.01% reads with the length less than 14.

38 Mapping tool for NGS (14) 2016/6/14 38 Rfam filtering and clustering Rfam database (Griffiths-Jones,S. et al., Nucleic Acids Res. 2005) Removes and record reads for ncRNAs (excluding miRNA) candidates by comparing them with RNA data from Rfam release 8.1. For GlT and TvT small RNA reads, there are 5.66% and 9.86% of reads are ncRNAs (excluding miRNA) candidates. (from Rfam)

39 Mapping tool for NGS (15) 2016/6/14 39 Mapping phase In Step 1, for reads (denotes as Set A), each read is recorded by using two patterns of length l, extracted from its head and the tail. (the read that has length l which is recorded by one pattern) (FRESCO filters out reads of length <l) for the genome (denotes as Set B), it is recorded by patterns with length l using a sliding window scheme. In Step 2, each pattern in Sets A and B is split into (d+1) segment(s). these patterns for Set A or Set B are grouped into clusters such that all of the patterns in a cluster have the same segment (seed) with the same order number.

40 Mapping tool for NGS (16) 2016/6/14 40 In Step 3, a (frequency) distance graph is constructed to link clusters in Set A with those in Set B with the same seed. In Step 4, for other segment(s), not seed, in a pattern, the frequency vector (FV) is calculated. for each pattern in a cluster of Set A, the frequency distances (FDs) are computed by comparing FV with those of the patterns of the linked clusters in Set B. FDs are used to filter out non-candidate patterns when FD>d. for each possible candidate pattern, the hamming distance (HD) is calculated to determine whether it is a real candidate (HD ≦ d) or not. ((14,2) error problem) In Step 5, for a read with two patterns, the patterns may overlap (2l-RL)bp or have a distance RL-2l. (length signature)

41 2016/6/14 41 Step2 ~Step 4 of mapping phase in FRESCO

42 Mapping tool for NGS (17) The Solexa short reads utilized in the case study are obtained by the deep sequencing of small RNA libraries from chicken embryo, Gene Expression Omnibus under accession no.GSE10686. (Glazov et al., Genome Res.2008) There are 702,906 Solexa small RNA reads with length of 33 bp. The reference genome is the chicken genome (galGal3, ~1Gbp genome size) of May 2006 (Hillier et al. Nature 2004) from UCSC. All experiments are performed on a 32bit-x86 machine with FreeBSD v6 and Intel Core2 Quad CPU 2.83GHz and 4GB memory using the single thread method. 2016/6/14 42

43 2016/6/14 43

44 Mapping tool for NGS (18) The memory usage can be controlled by users and FRESCO can deal with a large-scale genome and a large amount of reads in a desktop PC with 1GB of memory. Execution Time is a problem for FRESCO due to duplicate patterns. We develop new version of FRESCO based on Graphics Processing Units. 2016/6/14 44

45 CUDA-FRESCO (1) 2016/6/14 SBBS10 45 In GPU version, we replaced distance graph and frequency distance with seed and look-up table and error table. The major cost of mapping reads is to check all of possible candidates. (check by using GPUs)

46 2016/6/14 SBBS10 46 Unknown RNA reads Read processing A Genome Genome processing Mapping phase Read processing B Check mapped position Mapping results On GPU Mapping phase in CUDA-FRESCO

47 CUDA-FRESCO (2) 2016/6/14 SBBS10 47 Genome processing two structures are used to store mapping data of genome. seed and hash look-up table (used for exact match) genome sliding window array (used for mismatch) Mapping phase Step 1: build an ERROR table (constant memory) to query how many mismatches. Step 2: find all candidate position of patterns. (1 pattern/Block) segments store in global memory and access by coalesced read deposit in shared memory. (2 segments/pattern) hash look up table stores in texture memory.

48 CUDA-FRESCO (3) 2016/6/14 SBBS10 48 Step 3: use each candidate position as index for genome sliding window array (global memory) and do exclusive –OR comparison with another segment. Step 4: the value is used as suffix to check the ERROR table to know the number of mismatches. (<d, mapped position) Check mapping position phase Check length signature then record mapped positions and other information of reads into disk.

49 Experimental Test (1) 2016/6/14 SBBS10 49 We implemented the CUDA-FRESCO on single NVIDIA GeForce GTX 260 graphics card and installed in a PC with an Intel Quad-core i7 920 2.6GHz CPUs and 12GB DDRIII- 1333 RAM running the Linux operating system. The same Solexa short reads with 100Mbp chicken genome. CUDA-FRESCO running on single GeForce GTX 260 card achieves more than 63 speedups comparing to that in sequential FRESCO running on an Intel i7 920 CPU.

50 Experimental Test (2) 2016/6/14 SBBS10 50 The overall computing time of CUDA-FRESCO achieves more than 20 speedups by comparing to that of FRESCO. In addition, CUDA-FRESCO even outperforms SOAP running on Intel Quad-core i7 920 CPUs with 4 threads computing.

51 2016/6/14 SBBS10 51 CUDA - FRESCO Execution Time Analysis (secs) 63 x speedup

52 2016/6/14 SBBS10 52 The comparisons of overall computing time


Download ppt "Speaker: Chun-Yuan Lin Assistant Professor, CSIE Chang Gung University Development of Next-Generation Sequencing Tools based on Graphics Processing Units."

Similar presentations


Ads by Google