Presentation on theme: "Class 12 – 2nd next generation seq. method"— Presentation transcript:
1 Class 12 – 2nd next generation seq. method High throughput DNA sequencing using “bridging”(surface) pcr and reversible terminator chemistryArticle from Illumina, Nature 456:53 (2008)
2 Why sequence?1. basic biologydetermine amino acid seq. of proteinslearn role of non-coding seq.study evolutionary relationships –can help identify functional regions of DNA2. medical applicationssome mutations cause disease (CF, SCD)shed light on disease mechanismsome sequence variants assoc w/disease,drug sensitivity (“personalized med”)diagnose microbial infections3. non-medical applications – e.g. plant engineering4. forensics – individual identification
3 Key ideas and innovations in Illumina Method Biochemistry“bridging” pcr to get array of ~108 DNA spots onglass slide, each containing ~104 copies of anindividual ~ 200 bp DNA species in ~ 1mm areasequencing by synthesis, 1 base at a time, usingdNTPs with removable fluors and 3’ blocking groupsreading ~35b from both ends of each DNA species toget seq that should be known distance apart in ref. seq.Image analysis – automated collection and analysis of ~106microscope images/runInformatics – mapping short seq. runs to genome
4 How does Illumina method differ from Sanger sequencing? 1. clone DNA in bacteria to get many copies neededfor adequate signal -> ‘bridging” pcr to create surfacearray of clusters of dna fragments, each clustercontaining many copies of a single dna species2. seq rxn: DNA template + primer + DNA pol + dNTP+ 4 diff. dye-labeled ddNTP chain terminators-> reversibly 3’blocked dye-lableleddNTPs (each a diff. color) , extend 1 base,then remove 3’block, add next base, etc3. electrophoresis to size separatedna species -> sequential photos tosee order of base add. cluster
5 First challenge – how to assemble multiple copies of individual templates on solid surface wheresequencing will be doneShear genomic DNA(nebulizer) into segments~200 – 2000bp“blunt” ends w/ DNA polB’AA’AAB’BB’Ligate “forked” adapter oligonucleotidePcr w/ oligos complementary to adapter seq forked endsA, B -> at 5’-ends of alternate strands of all fragments
6 Substrate = glass flow cell, 8 channels ~100mm height, thin layer of polyacrylamide applied in each channelPolyacrylamide contains bromo… (BRAPA) which covalentlylinks to phospho-thioate group on 5’ end of new primers3’ ~20 bases of attached primers match those of oligo A or Bused to pcr the genomic fragments, so meltedamplified genomic fragments anneal to the attachedprimers. Primer ext. w/ DNA pol makes copy of 1 strandof particular genomic fragment at some spot on surfaceNext challenge – make multiple copies of each fragmentin small region on substrate surface (to have enoughcopies to get a strong sequencing signal)
7 Now melt offtemplateNewly synthsized strand anneals at its3’-end to nearby, 5’-attached oligo AABA’B’A’BBRepetition grows thicket of both strands of particulargenomic fragment in small spot on surface “bridging” pcr;note all strands are covalently attached via 5’ endsFor unexplained reason they do this surface pcr by repeatedcycles of chemical rather than thermal denaturation
8 Image of DNA fragments on surface after bridging pcr;each fragment is labeled(during sequencing) with1 of 4 differently coloredfluors by methodexplained belowEach spot = “polony” or“cluster” of many copies ofsingle DNA fragmentSpot diameters ~1mm; each spot contains ~ 104 strands;-> primers ~10nm apart; areal density c/w initial conc.of annealed genomic fragments ~3pM
9 Next challenge – how to make surface pcr’d DNA single-stranded to serve as sequencing templateBBClever method – cut one strand of DNA at chemicallysensitive site (*) engineered in oligo B, then melt offnon-coval. attached DNA, add free primer B that annealsto distal (3’) end of attached template, extend B w/pol
10 How to make the single-strand cut? Put diol modified base in attached oligo B; diol can bechemically cleaved by periodateHow to sequence other end of template?BiiAdiolAAAAfter sequencing 1st strand, melt off primer-ext. product,perform another cycle of bridging pcr (ii), make single-stranded cut in attached oligo A, melt off oligo Aextension product, seq. w/ soluble primer A
11 Note you need a new way to make ss cut in oligo A so you can make the A and B cuts separately; here are 2 ways:Synthesize oligo A with uracil U instead of T at givenposition; enzyme uracil glycosylase removes uracil (notnormally in DNA); heat or high pH then breaks A strandat site of removed UAlternative: put oxoG in place of one G in oligo A; enzymeFpg glycosylase removes abnormal oxoG; heat orhigh pH then breaks A strand at site of removed GNovel use of enzymes that remove abnormal bases(repair mutations in vivo) plus ability to insert abnormalbases during oligo synthesis makes this possible
12 Additional complication: any free 3’ ends on DNA on surface might “fold-back” and serve asprimer for competing sequence rxnThey block this by enzymatically addingnucleotide w/blocked 3’OH groupto all DNAs before adding seq.primer
13 How is sequence read biochemically? They synthesized novel nucleotides!baseT modified with floursugar3’ azide group N3 blocks extensionA, C and G similarly modified but with diff. colored fluors;only one base is added at a time due to 3’ blocking group
14 Treatment with TCEPremoves fluor and3’ blocking group,which allows nextnucleotide to be addedand its color detected,(prev. fluor is removed)
15 Amazing that bulky, unnatural chemical groups left attached do not inhibit polymerase, or mess up base-pairingThey say they had to engineer (mutate) DNA polymerase toget it to incorporate these modified bases efficientlyThis is another innovative step!
16 Repeated cycles of flowing in polymerase plus 4 modified nucleotides (1 of which gets incorporated in given spot),washing, taking picture, treating with TCEP -> sequencePicture taken at step n duringsequencing run; all strands ina given cluster label with A, C,G or T depending on sequenceat nth base in template strand.How does spot density compareto ion torrent?
18 “custom”Note they use TIRF microscopy to reduce background,only see fluors within < 1mm of surfaceWhy “custom” objective?
19 How big is typical microscope field of view (FOV) at 60x magnification? Imagine FOV expanded60x in each direction and mapped to 3x3mm CCDHow many images would they need to cover ~10cm2 flowcell surface?How long would it take to collect these images seriallyif they have to move slide 1 FOV between images?Their “custom” lens gives them ?? (0.1mm)2 FOVHow many sets of images do they need (1 for each baseaddition)? How long does it take to collect datafor 1 run? ~week
20 How do they adjust focus to correct stage drift over hours? Laser spot off-centered on lens and reflected off ofsurface has different x-y position depending on z-positionof slide.Adjusting z sospot is in samex-y position in FOVfixes z so imageis in focus
21 Do they need to align the spots in images of the same FOV taken hours apart? Automated spot alignment programCross-talk of different fluors – they need to adjust imageintensities to correct for “red” fluorescence of “green”fluor, etc to get best estimate of which dNTP was incorp.If base extension or deblocking is not completefor all strands in cluster, different nucleotideswill be incorporated at subsequent steps, purityof fluorescence signal will erode (phasing prob.)Quality control measures used to decide when base callingis unreliable; e.g. purity filter: intensity of 1 base must be> .6 sum of it plus next brightest base in 1st 12 positions
22 # errors determined by sequencing DNA with known seq. # errors/35 bases21Even withQC criteriato selectgood readsget only~35 breliableseq.!
23 Informatics – mapping shorts seq. reads to genome 2 programs used to look for matches betw. the ~35bend seq. they obtain for a cluster and ref seq.ELAND – finds all seq. in reference that match first 32bases of cluster seq, allowing up to 2 mismatchesbut no gaps; then sees which of these best matchcluster seq at any bases beyond 32MAQ – more sophisticated in allowing gaps betw. ref.and cluster seq., so picks up more matches withsmall “indels”, but potentially more errors
24 If genome seq. were random, what length seq. would be unique (unlikely to occur more than once)?Complication: some sequences >35b occur many times“selfish” genes have replicated and re-insertedin different positions in the genome, e.g.short interspersed nuclear elements (SINES, alu)~300 bp; ~106 copies (~10% of genome)long interspersed nuclear elements (LINES)~6000 bp; ~105 copies (~20% of genome)
25 Two features help assignment of 35b reads to correct position in genomethey know the paired end read should mapto other DNA strand about 200 bp awayin reference sequenceeach region of DNA is read many times, sothey can just map consensus sequencefor any segment
26 Tests of qualityHow uniformly does their data cover the ref. seq.?If some DNA segments don’t amplify well (? due tohigh GC content) they might be absent in their seq.If cluster seq. is random sample of ref seq., Poisson dist.predicts how many times, n, a ref. seq. base shouldappear in cluster seq.pn=e-mmn/n! where m = aver. # timesm=130Gb of cluster seq/3Gb per genome = 43
27 Fig. 2 Take every 50th base of ref seq.; how many times is an overlapping frag. found in a cluster seq. mapped to theref seq.? Make a histogram of the # of suchbases found n times in the cluster seq data set. Forinterest, consider separately bases that don’t occur inrepetitive elements like SINES and LINES (unique only)The dist. is pretty closeto Poisson (only slt. moresamples in tails), so themethod seems to samplepretty randomly
28 Does GC content affect how often a region is sampled? Plot # times a particular baseis sequenced in the data set)as function of GC content ofseq. in which it occurs.Only cluster sequenceswith most extreme GCcontents were sequenced lessthan the average ~40 timesSo what? If a seq. (with extreme GC content) is under-sampled, you might get only the maternal or paternalcopy (allele) in the seq data set and so miss finding apolymorphism (false negative)
29 Next evaluation – compare how often SNPs are identified in the seq. vs. SNP hybridization assays (“GT, genotyping”)Note this company makes SNP hybrid. assay, so it workinghard on technology that may replace its current platform!std version of hybrid.assay (GT) w/.5M SNPsUsing ELAND program:latest version of hybrid.assay w/ >3M SNPs<1% discordant callsmost often the arrayassay (GT) finds aSNP missed by seq.
30 Same table, using MAQ program, seq. does slt. better, but in general GT and seq. have similar fail-to-detect ratesTheir new, favorite set ofSNPs with least ambiguityMost GT failures-to-detect are due to person carrying so variant a seq.that it fails to hybridize to anything on the chipMost seq. failures-to-detect are due to low sampling rate of one allele
31 But seq. picked up ~1M new SNPS in this person! Why?Std SNP panels selected for SNPs that occur fairlyfrequently in populationThis individual of African ancestry - ?underrepresentedin std SNP panelMaybe most of us carry lots of “private” SNPsthat are very rare in the population
32 How can you get information about structural changes larger than 35bases from 35base long reads?Use info from paired end reads!Idea – label ends of genomic DNA segments w/ biotinnucleotides (B) using DNA polcircularize DNA segments (ligate diluted sample)re-shear DNA; purify biotinylated DNA; make clustersas before and read seq of ends of junction frags.
33 Now sequence at opposite ends of small frags comes from genomic DNA regions separated by length of circularizedfragments; also, oriention wrt each other is flippedIf you can map both end sequences to genome, you can finddeletions (end seq. further apart in ref. seq. thancircularized fragment length), insertions (end seq. closertogether in ref. seq. than circ. frag. length), inversions(orientation reversed)
34 They identified 1000s of >50bp deletions, many of which were known selfish DNA elements present in reference seq.but not in the seq. of the person whose DNA they analyzed90% of these are SINES present inreference but not in this individual60% are LINES
35 They also found 2345 insertions How many are incoding sequences?How many arehomozygous?
36 Map of a region containing an inversion flanked by 2 small deletions. What do symbols represent?Note ~2kb region of refseq. with no read pairs(green)“short insert” pairsflanking this region(orange) map to sites~2kb apart in ref. but~.5kb in this sample(i.e. span deletion)
37 Last level of complexity – bio-medical interpretation of seq. informationExample - variability greater in certain areas of genomee.g. parts of X chromosome - why?
38 Potentially medically relevant findings – your DNA is likely similar!26,140 SNPs in protein-coding regions5,361 encode non-conservative amino acid changes153 encode premature terminations“many of which are expected to affect protein function”excerpt of Table 9
39 Summary - Impressive accomplishment! Innovations in many fields – all needed for useful productmolecular biology: bridging pcr to get ~104 copiesof individual fragments arrayed on surface,nicking tricks to convert pcr products to ss forsequencing and getting the complementary ss forsequencing,new dNTPs with reversibly blocked 3’ ends and chem-ically removable fluors, to seq. 1 base at a timeengineered DNA pols that use these new dNTPsphotonics, data acquisition, informatics …Lots of detail -> fuller explanation than ion-torrent
40 Major challenges remaining quantifying errorsmethods for resequencing variants for confirmationidentifying structural variants larger than the pieces of dna sequenced – e.g. deletions, insertions, duplications, inversionsspeeding up (parallelization of) data acquisitioninterpretation – clinical significance of variants;implications for human biology
41 Some key ideas you should take away from today: How they get array of spots, each withmany copies of a DNA to sequenceHow they get sequence, 1 base at a time,using reversible dye terminator chem.How they get information about structuralvariants larger than the 35 bp runs(paired end reads)How over sequencing (fold-coverage) helpsHow they evaluate seq. accuracyWhat kind of mutational load are we all likelyto carry in our DNA