Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Throughput Sequencing

Similar presentations


Presentation on theme: "High-Throughput Sequencing"— Presentation transcript:

1 High-Throughput Sequencing
Richard Mott with contributions from Gil Mcvean, Gerton Lunter, Zam Iqbal, Xiangchao Gan, Eric Belfield

2 Sequencing Technologies
Capillary (eg ABI 3700) Roche/454 FLX Illumina GAII ABI Solid Others….

3 Capillary Sequencing based on electrophoresis
used to sequence human and mouse genomes read lengths currently around 600bp (but used to be bp) relatively slow – 384 sequences per run in x hours expensive ???

4 Capillary Sequencing Trace
ACGT represented by continuous traces. Base-calling requires the identification of well-defined peaks

5 PHRED Quality Scores PHRED is an accurate base-caller used for capillary traces (Ewing et al Genome Research 1998) Each called base is given a quality score Q Quality based on simple metrics (such as peak spacing) callibrated against a database of hand-edited data Q = 10 * log10(estimated probability call is wrong) 10 prob = 0.1 20 prob = 0.01 30 prob = 0.001 [Q30 often used as a threshold for useful sequence data]

6 Capillary sequence assembly and editing
CONSED screen shots

7 Illumina Sequencing machines
GA-II HiSeq

8 The Illumina Flow-cell
Each flow-cell has 8 lanes (16 on HiSeq) A different sample can be run in each lane It it possible to multiplex up to 12 samples in a lane Each lane comprises 2*60=120 square tiles Each tile is imaged and analysed separately Sometimes a control phiX lane is run (in a control, the genome sequenced is identical to the reference and its GC content is not too far from 50%)

9

10

11 Illumina GA-II “traces”
Discontinuous – a set of 4 intensities at each base position

12 Cross talk: base-calling errors
Whiteford et al Bioinformatics 2009

13 Base-calling errors Typical base-calling error rate ~ 1%,
Error rate increases towards end of read Usually read2 has more errors than read1

14 Assessing Sequence Quality
Example summary for a lane of 51bp paired-end data # reads % optical duplicates % good quality reads (passed chastity filter) % mapped to reference using ELAND (the read-mapper supplied by Illumina) % of differences from reference (upper bound on error rate) in mapped reads

15 Illumina Throughput (April 2013)
Machine Read Length Run Time Output/lane Output/flow cell cost/lane HiSeq2000 2 x 50bp 1 week 15-18Gb Gb Gb 2 x 100 bp 12 days Gb £1.5-2k HiSeq2500 2 x 150bp >30hrs >4.5Gb £0.85k MiSeq 2 x 150 bp 24-27hrs 40-60Gb £2.5-3k Note: 1 human genome = 3Gb. 20-30x coverage of one human genome = 2=3 lanes of HiSeq

16 Illumina HiSeq Throughput (Feb 2011)
Read Length Run Time Output 1 x 35bp 1.5 days 26-35 Gb 2 x 50bp 4 days Gb 1 x 100 bp 8 days Gb Note: 1 human genome = 3Gb.

17 Illumina GA-II throughput per flow-cell
Note: these are correct as of February 2010. Output is constantly improving due to changes in chemistry and software. Consumables costs are indicative only - they don’t include labour, depreciation, overheads or bioinformatics - true costs are roughly double

18 Pooling and Multiplexing
Primer Primer Read Barcode 6bp Up to 96 distinct barcodes can be added to one end of a read useful for low-coverage sequencing of many samples in a simple lane Up to a further 96 barcodes can be added to other end of a read = 96*96 = 9216 samples Useful for bacterial sequencing

19 Pooling Costs Library Preps
£ per sample, depending on type of sequencing £<50 per sample for 96-plex genomic DNA Pooling costs are dominated by library prep, not HiSeq lane costs eg 96-plex of gDNA on on HiSeq lane = £4k

20 Data Formats Sequencing produces vast amounts of data
Rate of data growth exceeds Moore’s law

21 The FastQ format (standard text representation of short reads)
A FASTQ text file normally uses four lines per sequence. Line 1 begins with a character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. The letters encode Phred Quality Scores from 0 to 93 using ASCII 33 to 126 Example @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

22 Binary FastQ Computer-readable compressed form of FASTQ
About 1/3 size of FASTQ Enables much faster reading and writing Standard utility programs will interconvert (eg. maq) Becoming obsolete……

23 SAM and SAMTOOLS http://samtools.sourceforge.net/
SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM aims to be a format that: Is flexible enough to store all the alignment information generated by various alignment programs; Is simple enough to be easily generated by alignment programs or converted from existing alignment formats; Is compact in file size; Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory; Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

24 BAM files SAM, BAM are equivalent formats for describing alignments of reads to a reference genome SAM: text BAM: compressed binary, indexed, so it is possible to access reads mapping to a segment without decompressing the entire file BAM is used by lookseq, IGV and other software Current Standard Binary Format Contains: Meta Information (read groups, algorithm details) Sequence and Quality Scores Alignment information one alignment per read

25 Inside a BAM file samtools view -h WTCHG_7618.bam
@HD VN:1.0 GO:none SO:coordinate @SQ SN:chr10 LN: @SQ SN:chr11 LN: ... @SQ SN:chrX LN: @SQ SN:chrY LN: @RG ID:WTCHG_7618 PL:ILLUMINA PU:101001_GAII06_00018_FC_5 LB:070/10_MPX SM:1772/10 CN:WTCHG @PG ID:stampy VN:1.0.5_(r710) CL:--processpart=1/4 --readgroup=ID:WTCHG_7618,SM:1772/10,PL:ILLUMINA,PU:101001_GAII06_00018_FC_5,LB:070/10_MPX,CN:WTCHG --keepreforder --solexa -v0 -g /tmp/Human37 -h /tmp/Human37 -M s_5_1_sequence.txt,s_5_2_sequence.txt --bwaoptions=-t 2 -q10 /tmp/Human37 -o s_5.1_stampy.sam @PG ID:stampy.1 VN:1.0.5_(r710) CL:--processpart=2/4 --readgroup=ID:WTCHG_7618,SM:1772/10,PL:ILLUMINA,PU:101001_GAII06_00018_FC_5,LB:070/10_MPX,CN:WTCHG --keepreforder --solexa -v0 -g /tmp/Human37 -h /tmp/Human37 -M s_5_1_sequence.txt,s_5_2_sequence.txt --bwaoptions=-t 2 -q10 /tmp/Human37 -o s_5.2_stampy.sam @PG ID:stampy.2 VN:1.0.5_(r710) CL:--processpart=3/4 --readgroup=ID:WTCHG_7618,SM:1772/10,PL:ILLUMINA,PU:101001_GAII06_00018_FC_5,LB:070/10_MPX,CN:WTCHG --keepreforder --solexa -v0 -g /tmp/Human37 -h /tmp/Human37 -M s_5_1_sequence.txt,s_5_2_sequence.txt --bwaoptions=-t 2 -q10 /tmp/Human37 -o s_5.3_stampy.sam @PG ID:stampy.3 VN:1.0.5_(r710) CL:--processpart=4/4 --readgroup=ID:WTCHG_7618,SM:1772/10,PL:ILLUMINA,PU:101001_GAII06_00018_FC_5,LB:070/10_MPX,CN:WTCHG --keepreforder --solexa -v0 -g /tmp/Human37 -h /tmp/Human37 -M s_5_1_sequence.txt,s_5_2_sequence.txt --bwaoptions=-t 2 -q10 /tmp/Human37 -o s_5.4_stampy.sam @CO TM:Tue, 26 Oct :21:06 BST WD:/data1/GA-DATA/101001_GAII06_00018_FC/Data/Intensities/BaseCalls/Demultiplexed /004/GERALD_ _johnb.2 HN:comp04.well.ox.ac.uk UN:johnb @CO IX:GCCAAT SN:085 B-cell ID:070/10_MPX GE:Human37 SR:gDNA Indexed CT:false PR:P SM:1771/10 WTCHG_7618:5:40:5848:3669#GCCAAT 145 chr I40M chr TCAGAAAAAAGAAAATGTGGTATATATACACAATGGAGTACTATTCAGCCC GFIIIIIIIIHIIIIHIIIIIIIIFIIHIIHIIHIIIIIHIIIIIIIHHII PQ:i:394 SM:i:0 UQ:i:250 MQ:i:96 XQ:i:40 RG:Z:WTCHG_7618 WTCHG_7618:5:77:5375:15942#GCCAAT 99 chr M = GCAGGGAGAATGGAACCAAGTTGGAAAACACTCTGCAGGATATTATCCAGG GBBHHEBG<GGGGGGEGGGEGDGDGGDBGDHHHFHGGEGEBGGDGHEHHFH PQ:i:91 SM:i:0 UQ:i:38 MQ:i:0 XQ:i:33 RG:Z:WTCHG_7618 WTCHG_7618:5:77:5375:15942#GCCAAT 147 chr M = AGCTGATCTCTCAGCAGAAACCGTACAAGCCAGAAGAGAGTGGGGGCCAAC PQ:i:91 SM:i:0 UQ:i:33 MQ:i:0 XQ:i:38 RG:Z:WTCHG_7618 WTCHG_7618:5:49:18524:13016#GCCAAT 163 chr M = CCCATCTCACGTGCAGAGACACACATAGACTCAAAATAAAAGGATGGAGGA EHHIIIHIIIIHIIIIIFIDIIIEGEIIIHIHIIIIIIHHIHBDHEGFDEI PQ:i:57 SM:i:0 UQ:i:0 MQ:i:0 XQ:i:39 RG:Z:WTCHG_7618 WTCHG_7618:5:2:1789:11020#GCCAAT 161 chr M2D49M chrM GTGGGTTGCAATCCTAGTCTCTGATAAAACAGACTTTAAACCAATAAAGAT GGGGG>DDBGGGGGGIIGIBDFGE?IIDIHIIIIBIGIBIHIIHII<DAI< PQ:i:192 SM:i:0 UQ:i:48 MQ:i:96 XQ:i:0 RG:Z:WTCHG_7618 WTCHG_7618:5:5:8834:6028#GCCAAT 163 chr M = AGAAGAGCTAACTATCCTAAATATATATGCACCCAATACAGGAGCACCCAG EIIIIHHHGGIDIIIHEGIGIHGIGIDFIGBGGGEGGGGGIHDHIDIIHGH PQ:i:23 SM:i:0 UQ:i:0 MQ:i:0 XQ:i:0 RG:Z:WTCHG_7618 WTCHG_7618:5:49:18524:13016#GCCAAT 83 chr M = CCAATACAGGAGCACCCAGATTCATAAAGCAAGTCCTGAGTGACCTACAAT BHHHHGHHHHHHHHFHHHHHHHGHHHGGGGBHHEHHHHHHHHHHHHHHHHH PQ:i:57 SM:i:0 UQ:i:39 MQ:i:0 XQ:i:0 RG:Z:WTCHG_7618 WTCHG_7618:5:5:8834:6028#GCCAAT 83 chr M = TACCCAGGAATTGAACTCAGCTCTGCACCAAGCAGACCTAATAGACATCTA DEHIIIHIIIIDIGIHFHHGIHIGIIIIIIIIHIGIIIHIIHIIIIIIGII PQ:i:23 SM:i:0 UQ:i:0 MQ:i:0 XQ:i:0 RG:Z:WTCHG_7618 samtools view -h WTCHG_7618.bam

26 SAMtools A package for manipulating sequence data
import: SAM-to-BAM conversion view: BAM-to-SAM conversion and subalignment retrieval sort: sorting alignment merge: merging multiple sorted alignments index: indexing sorted alignment faidx: FASTA indexing and subsequence retrieval tview: text alignment viewer pileup: generating position-based output and consensus/indel calling Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25,

27 Pileup Alignments seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23 ,.$....,,.,.,...,,,., <7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< seq1 276 G T,,.,.,...,,,., ;+<<7=7<<7<&<<1;<<6< seq1 277 T ,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&< seq1 278 G ,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

28 Applications

29 Genome Resequencing Align reads to reference genome
assumed to be very similar, most reads will align Identify sequence differences SNPs, indels, rearrangements Focus may be on producing a catalogue of variants (1000 genomes) producing a small number of very accurate genomes (mouse, Arabidopsis) Generate new genome sequences

30

31 Mapping Accuracy in simulated human data
Effects of Indels

32 Read Mapping: read length matters
E.Coli Mb S. cerevisiae Mb A thaliana 120 Mb H sapiens Gb : Genome Res.

33 Read Mapping(1): Hashing
Each nucleotide can be represented as a 2-bit binary number A=00, C=01, G=10, T=11 A string of K nucleotides can be represented as a string of 2K bits eg AAGTC = Each binary string can be interpreted as a unique integer All DNA strings of length K can be mapped to the integers 0,1,…..4K-1 k= ,535 k= ,143 k= ,048,575 k= ,194,303 k= ,777,215 k= ,108,864 (effective limit for 32-bit 4-byte words) Can use this relationship to index DNA for fast mapping Need not use contiguous nucleotides – spaced seeds, templates Trade-off between unique indexing/high memory use

34 MAQ Package for read mapping, SNP calling and management of read data and alignments Genome Research 2008 Easy to use - unix command-line based Although no longer state of the art and comparatively slow , generally produces good results

35 MAQ Read Mapping Indexes all reads in memory and then scans through genome Uses the first 28bp of each read for seed mapping Guarantees to find seed hits with no more than two mismatches, and it also finds 57% of hits with three mismatches Uses a combination of 6 hash tables that index different parts of each read to do this Defines a PHRED-like read mapping quality Qs = −10log10 Pr{read is wrongly mapped}. Based on summing the base-call PHRED scores at mismatched positions Reads that map equally well to multiple loci are randomly assigned one location (and have Q=0) Uses mate pair information to look for pairs of reads correctly oriented within a set distance Defines mapping quality for a pair of consistent reads as the sum of their individual mapping qualities

36 Read-Mapping (2) Bowtie, BWA, Stampy
all use the Burrows-Wheeler transform

37 Burrows Wheeler transform
Represents a sequence in a form such that The original sequence can be recovered Is more compressible (human genome fits into RAM) similar substrings tend to occur together (fast to find words)

38 Bowtie http://bowtie-bio. sourceforge. net/index
Uses the BWA algorithm Indexes the genome, not the reads Not quite guaranteed to find all matching positions with <= 2 mismatches in first 28 bases (Maq’s criterion) Very fast (15-40 times faster than Maq) Low memory usage (1.3 Gb for human genome) Paper focuses on speed and # of mapped reads, not accuracy. “[…] Bowtie’s sensitivity […] is equal to SOAP’s and somewhat less than Maq’s”

39

40 Stampy (Gerton Lunter, WTCHG)
“Statistical Mapper in Python” (+ core in C) Uses BWA and hashing <= 3 mismatches in first 34 bp  match guaranteed More mismatches: gradual loss of sensitivity Algorithm scans full read, rather than just beginning (and no length limit) Handles indels well: Reads are aligned to reference at all candidate positions Faster than Maq, slower than Bowtie 2.7 Gb memory (shared between instances)

41 Performance – sensitivity
Mapping sensitivity Indel size Top panel: Sensitivity for reads with indels Right-hand panel: Sensitivity as function of divergence (Genome: human)

42 Viewing Read Alignments: lookseq http://www. sanger. ac

43 Viewing Read Alignments: IGV http://www.broadinstitute.org/igv/

44 Variant calling A hard problem, several SNP callers exist eg MAQ/SAMTOOLS, Platypus (WTCHG) GATK (Broad) Issue is to distinguish between sequencing errors and sequence variants If variant has been seen before in other samples then problem is easier genotyping vs variant discovery VCF Variant call format is now standard file format MAQ Assumes genome is diploid by default Error model initially assumes that sequence positions are independent, attempts to compute probability of sequence variant Has to use number of heuristics to deal with misalignments SNP caller now part of SAMTOOLS varFilter.pl acts as a filter on a large number of statistics tabulated about each sequence position

45 Problems with Variant Calling
Variant Calling is difficult because a diploid genome will have two haplotypes present, which can differ significantly, eg due to polymorphic indels should be easier with haploid or inbred genomes but even harder when looking at low-coverage pools of individuals (eg 1000 genomes) Coverage can vary depending on GC content problem is sporadic Optical duplicates may give the impression there is more support for a variant often all reads with the same start and end points are thinned to a single representative, but this can cause problems if the coverage is very high read misalignments can produce false positives repetitive reads can be mapped to the wrong place indels near the ends of reads can cause local read misalignments, where mismatches (SNPs) are favoured over indels very divergent sequence is hard to align may fail to give any mapping signal and will look like a deletion problem addressed by local indel realignment (GATK)

46 GC content can affect read coverage
(Arabidopsis data from Plant Sciences, thanks to Eric Belfield and Nick Harberd) coverage global GC content Possible causes: Sanger identified that melting the gel slice by heating to 50 °C in chaotropic buffer decreased the representation of A+T-rich sequences. Nature methods | VOL.5 NO.12 | DECEMBER 2008 PCR bias during library amplification. Nature methods | VOL.6 NO.4 | APRIL 2009 | 291

47 Deletions can cause SNP artefacts, by inducing misalignments at ends of reads
Arabidopsis Data from Eric Belfield and Nick Harberd, Plant Sciences, Oxford

48 de-Novo genome assembly
No close reference genome available Harder than resequencing Only about 80-90% of genome is assembleable due to repeats contiguation scaffolding Different Algorithms More data required: greater depth of coverage range of paired-end insert sizes

49 Assembling Genomes from Scratch de-novo assembly
Software include: VELVET ABySS ALL_PAIRS SOAPdenovo CORTEX

50 Computational limitations
Traditional approach to take reads as fundamental objects, and build algorithms/data structures to encode their overlaps essentially quadratic in the number of reads Next-generation sequencing machines generate too many reads! simply holding the base-calls requires tens of terabytes for large projects analysis produces lots of large intermediate files Whatever we do, it has to scale slower than coverage

51 The de Bruijn Graph a representation of all possible paths joining reads together Pevsner, PNAS 2001
Choose a word length k (5 in this example, but larger in applications) AACTACTTACGCG AACTA

52 The de Bruijn Graph AACTAACTACGCG AACTA ACTAA

53 The de Bruijn Graph AACTAACTACGCG AACTA ACTAA CTAAC

54 The de Bruijn Graph AACTAACTACGCG AACTA ACTAA CTAAC TAACT

55 The de Bruijn Graph AACTAACTACGCG AACTA ACTAA CTACT TAACT

56 Same sequence, different k=3
ACTACTACTGCAGACTACT TGC TAC CTA CTG GCA ACT CAG GAC AGA

57 Same sequence, different k=17
ACTACTACTGCAGACTACT ACTACTACTGCAGACTA TACTACTGCAGACTACT CTACTACTGCAGACTAC

58 Recovering unambiguous contigs
bulge– two different paths; in a diploid genome both might be correct

59 Outline of de-Novo Assembly with the deBruijn Graph
: Genome Res.

60 : Genome Res.

61

62 Examples

63

64

65

66 Resequencing Inbred Lines
Mouse 15 inbred strains, at Sanger (PI David Adams) 2.8Mbp sequenced to 20x Arabidopsis thaliana (plant model) 19 inbred accessions, here 120Mb sequenced to 20-30x

67 Iterative Reassembly (IMQ/DENOM) Xiangchao Gan
Basic idea Should only be one haplotype present (but not always true) Align reads to reference (Stampy) Identify high-confidence SNPs and indels (SAMTOOLS) Modify reference accordingly Realign reads to modified reference Iterate until convergence 5 iterations usually sufficient Combine with denovo assembled contigs to improve assembly

68 Iterative reassembly of inbred strains

69 Iterative Alignment + deNovo Assembly

70 Low Coverage sequencing
“Cheap” alternative to SNP genotyping chips Sequence populations at <1x coverage Impute compute genotypes from population data + haplotype data (1000 genomes…)

71 CONVERGE study of Major Depression
12000 Chinese Women 6000 cases with Major Depression 6000 matched controls Sequenced at ~1x coverage

72

73 Imputation from 1x coverage Comparison with 16 samples genotyped on Illumina Omnichip
orange is after imputation with 570 asian Thousand Genomes Project haplotypes green is before imputation (just using genotype likelihoods)

74 Outbred Mice 2000 commercial outbred mice
descended from standard laboratory inbred strains phenotyped for ~300 traits sequenced at ~ 0.1x coverage

75 Haplotype reconstruction as probabalistic mosaics

76 QTLs

77 Other Applications RNA-Seq Chip-Seq Methylation Mutation detection
gene expression Chip-Seq DNA-protein binding sites Histone marks Nucleosome positioning DNAse hypersensitive sites Methylation bisulphite sequencing Mutation detection from mutagenesis experiments from human trios Multiplex Pooling random genotyping from low coverage read data


Download ppt "High-Throughput Sequencing"

Similar presentations


Ads by Google