Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1 Introduction to high throughput sequencing

Similar presentations


Presentation on theme: "Lecture 1 Introduction to high throughput sequencing"— Presentation transcript:

1

2 Lecture 1 Introduction to high throughput sequencing
Michael Brudno CSC 2431 January 13, 2010 Adapted from presentations by Francis Ouelette, OICR, Michael Stromberg, BC and Asim Siddiqui, ABI

3 DNA sequencing How we obtain the sequence of nucleotides of a species
…ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

4 DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in DNA
Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output Can only sequence ~500 letters at a time

5 Generations of Sequences
Sanger-style: Classic 454 “First Next-gen” Illumina + ABI SOLiD “Next-gen” Helicos “2.5 Gen” PacBio “Next-next-gen”, 3rd gen

6 After Next-generation:
Why are we sequencing? Before Next-generation: DNA, RNA, (proteins), (populations), sampling, averages, consensus Problems: sampling, averages, consensus. After Next-generation: Genome sequence and structure Less cloning/PCR Single molecules (for some)

7 Sanger (old-gen) Sequencing
Now-Gen Sequencing Whole Genome Human (early drafts), model organisms, bacteria, viruses and mitochondria (chloroplast), low coverage New human (!), individual genome, 1,000 normal, 25,000 cancer matched control pairs, rare-samples RNA cDNA clones, ESTs, Full Length Insert cDNAs, other RNAs RNA-Seq: Digitization of transcriptome, alternative splicing events, miRNA Communities Environmental sampling, 16S RNA populations, ocean sampling, Human microbiome, deep environmental sequencing, Bar-Seq Other Epigenome, rearrangements, ChIP-Seq

8 Differences between the various platforms:
Nanotechnology used. Resolution of the image analysis. Chemistry and enzymology. Signal to noise detection in the software Software/images/file size/pipeline Cost $$$

9 Next Generation DNA Sequencing Technologies
Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” Next Generation DNA Sequencing Technologies Human Genome 6GB == 6000 MB Req’d Coverage 6 12 30 3730 454 Illumina bp/read 600 400 2X75 reads/run 96 500,000 100, bp/run 57,600 0.5 GB 15 GB # runs req’d 625,000 144 runs/day 2 1 0.1 Machine days/human genome 312,500 (856 years) 120 Cost/run $48 $6,800 $9,300 Total cost $15,000,000 $979,200 $111,600

10 Solexa-based Whole Genome Sequencing
Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome”

11 Illumina (Solexa)

12 Illumina (Solexa)

13 Illumina (Solexa)

14 From Debbie Nickerson, Department of Genome Sciences, University of Washington,

15 What is a base quality? Base Quality Perror(obs. base) 3 50.12% 5
31.62% 10 10.00% 15 3.16% 20 1.00% 25 0.32% 30 0.10% 35 0.03% 40 0.01%

16 Next-gen sequencers From John McPherson, OICR bases per machine run
100 Gb AB/SOLiDv3, Illumina/GAII short-read sequencers (10+Gb in bp reads, >100M reads, 4-8 days) 10 Gb 1 Gb 454 GS FLX pyrosequencer ( Mb in bp reads, 0.5-1M reads, 5-10 hours) bases per machine run 100 Mb 10 Mb ABI capillary sequencer Very different types of data. Different run times. Different costs ( Mb in bp reads, 96 reads, 1-3 hours) 1 Mb 10 bp 100 bp 1,000 bp read length

17 DNA sequencing – vectors
Shake DNA fragments Known location (restriction site) Vector Circular genome (bacterium, plasmid) + =

18 Method to sequence longer regions
genomic segment cut many times at random (Shotgun) Get two reads from each segment ~500 bp ~500 bp

19 Reconstructing the Sequence (Fragment Assembly)
reads Cover region with ~7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region

20 Definition of Coverage
Length of genomic segment: L Number of reads: n Length of each read: l Definition: Coverage C = n l / L How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides

21 Challenges with Fragment Assembly
Sequencing errors ~1-2% of bases are wrong Repeats Computation: ~ O( N2 ) where N = # reads false overlap due to repeat

22 History of DNA Sequencing
Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) 1870 Miescher: Discovers DNA Avery: Proposes DNA as ‘Genetic Material’ Efficiency (bp/person/year) 1940 Watson & Crick: Double Helix Structure of DNA 1953 1 Holley: Sequences Yeast tRNAAla 15 1965 Wu: Sequences  Cohesive End DNA 150 1970 Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation 1,500 1977 15,000 Messing: M13 Cloning 1980 25,000 Hood et al.: Partial Automation 50,000 1986 Cycle Sequencing Improved Sequencing Enzymes Improved Fluorescent Detection Schemes 200,000 1990 50,000,000 2002 Next Generation Sequencing Improved enzymes and chemistry New image processing 100,000,000,000 2009

23 Which representative of the species?
Which human? Answer one: Answer two: it doesn’t matter Polymorphism rate: number of letter changes between two different members of a species Humans: ~1/1,000 – 1/10,000 Other organisms have much higher polymorphism rates Just beneath the surface of the ocean floats a tiny egg. Within it's hull of follicle cells the embryo of Ciona, a sea squirt or Ascidian, is beginning to develop. The shape of the egg does not give any clue of the creature that it is going to be. Now about a third of a millimeter it is waiting to become a larva, and this larva is a rather surprising creature. The larva is developing within a few weeks. A head and a tail are evolving. When you look carefully you can see a rod that stiffens the tail. We know such a device as a notochord. A trained zoologist would identify the animal as belonging to the Phylum Chordata, and that is the same group we humans belong to. (See footnote). Also the internal organs are beginning to show. But the image is a bit deceiving. What seems to be an eye is in fact a device for equilibrium called the 'Otolith'. Above it we find the light sense organ, the 'Ocellus' So there is something suspicious about these so-called sea squirts. Freed from it's shell a familiar form has appeared. Clearly the resemblance with a tadpole larva is seen. The little creature swims for a few hours to find a good spot somewhere on a solid surface. Then a surprising thing happens. This larva is not going to evolve into a fish, amphibian or anything like that. With the front of it's head it attaches itself to a surface. Within minutes resorption of the larva tail commences and the sea squirt will stay on that same spot for all it's life

24 Why humans are so similar
A small population that interbred reduced the genetic variation Out of Africa ~ 40,000 years ago Out of Africa

25 Migration of human variation

26 Migration of human variation

27 Migration of human variation

28 Genetic Variations: Why?
Phenotypic differences Inherited diseases Ancestral history

29 Genetic Variations: SNPs & INDELs

30 Structural Variations
Paul Medvedev review in prep July 2009

31 SNP Discovery: Goal sequencing errors SNP

32 SNP Discovery: Base Qualities
High quality Low quality

33 SNPs & Bayesian Statistics
# of individuals base quality allele call in read

34 SNP Discovery haploid diploid AACGTTAGCATA AACGTTAGCATA strain 1
AACGTTCGCATA strain 1 individual 1 AACGTTCGCATA strain 2 AACGTTCGCATA AACGTTAGCATA individual 2 strain 3 AACGTTAGCATA individual 3

35 Genotyping & Consensus Generation
haploid diploid AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA strain 1 [A] individual 1 [A/C] strain 2 [C] AACGTTCGCATA AACGTTCGCATA individual 2 [C/C] AACGTTAGCATA strain 3 [A] individual 3 [A/A] AACGTTAGCATA

36 Visualization: Consed

37 1000 Genomes Project

38 1000G: Goals Discover genetic variations Variant alleles
1 % minor allele frequencies across genome 0.1 – 0.5 % MAF across gene regions Variant alleles Estimate frequencies Identify haplotype background Characterize linkage disequilibrium

39 1000G: Pilot Projects Pilot 1 Pilot 2 Pilot 3 Low coverage 180 samples
70 4X 110 2X 2.7 Tbp total 202 Gbp 454 1.8 Tbp Illumina 640 Gbp AB SOLiD Pilot 2 Deep trios (CEU & YRI) 6 samples 1.1 Tbp total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD Pilot 3 Exon capture 607 samples 2.2 Mbp of targets 8800 targets 10 – 20x coverage

40 Questions about the genome
Obtaining a genome sequence is a one step towards understanding biological processes Questions that follow from the genome are: What is transcribed? Where do proteins bind? What is methylated? In other words, how does it work?

41 Central dogma ZOOM IN tRNA transcription DNA rRNA snRNA translation
POLYPEPTIDE mRNA

42 Transcription The DNA is contained in the nucleus of the cell.
A stretch of it unwinds there, and its message (or sequence) is copied onto a molecule of mRNA. The mRNA then exits from the cell nucleus.

43 DNA RNA T C A G G A T C G A U C A = T G = C T  U

44 More complexity The RNA message is sometimes “edited”.
Exons are nucleotide segments whose codons will be expressed. Introns are intervening segments (genetic gibberish) that are snipped out. Exons are spliced together to form mRNA.

45 Splicing frgjjthissentencehjfmkcontainsjunkelm
thissentencecontainsjunk

46 Key player: RNA polymerase
It is the enzyme that brings about transcription by going down the line, pairing mRNA nucleotides with their DNA counterparts.

47 Promoters Promoters are sequences in the DNA just upstream of transcripts that define the sites of initiation. The role of the promoter is to attract RNA polymerase to the correct start site so transcription can be initiated. 5’ 3’ Promoter

48 Promoters Promoters are sequences in the DNA just upstream of transcripts that define the sites of initiation. The role of the promoter is to attract RNA polymerase to the correct start site so transcription can be initiated. 5’ 3’ Promoter

49 Transcription – key steps
DNA Initiation Elongation Termination DNA + RNA

50 Genes can be switched on/off
In an adult multicellular organism, there is a wide variety of cell types seen in the adult. eg, muscle, nerve and blood cells. The different cell types contain the same DNA though. This differentiation arises because different cell types express different genes. Promoters are one type of gene regulators

51 Transcription (recap)
The DNA is contained in the nucleus of the cell. A stretch of it unwinds there, and its message (or sequence) is copied onto a molecule of mRNA. The mRNA then exits from the cell nucleus. Its destination is a molecular workbench in the cytoplasm, a structure called a ribosome.

52 The Transcriptome The transcriptome is the entire set of RNA transcripts in the cell, tissue or organ. The transcriptome is cell type specific and time dependant i.e. It is a function of cell state The transcriptome can help us understand how cells differentiate and respond to changes in their environment.

53 Transcriptome complexity
Transcripts may be: Modified Spliced Edited Degraded Transcriptome is substantially more complex than the genome and is time variant.

54 ESTs ESTs were the first genome wide scan for transcriptional elements
Different library types: Proportional Normalized Subtractive Can be sequenced from the 5’ or 3’ end

55 “Hello Mr Chips” Microarray chips introduced in 90’s
Parallel way to measure many genes Probes placed on slides RNA -> cDNA, labelled with fluorescent dye and hybridized. Fluorescence measured Chips have been highly successful Simplified analysis Useful when there is no genome sequence Linear signal across 500 fold variation Standardization has aided use in medical diagnostics E.g. Mammaprint

56 Microarray expression profiling by 2-color assay (“cDNA arrays”)
PCR products 6250 yeast ORFs hybridized cDNAs: green = control red = experiment *Schena et al., 1995

57 Chips: pros and cons Advantages Disadvantages
Do not require a genome sequence Highly characterised, with many s/w packages available One Affymetrix chip FDA approved Disadvantages Measurements limited to what’s on the array Hard to distinguish isoforms when used for expression Can’t detect balanced translocations or inversions when used for resequencing

58 mRNA-seq Basic work flow
Align reads (sometimes to transcriptome first and then the genome) Tally transcript counts Align tags to spliced transcripts Add to transcript counts

59 Cloonan et al Used SOLiD to generate 10Gb of data from mouse embryonic stem cells and embryonic bodies Used a library of exon junctions to map across known splice events

60 Distribution of tags

61 Tag locations

62 General issues Coverage across the transcript may not be random
Some reads map to multiple locations Some reads don’t map at all Reads mapping outside of known exons may represent New gene models New genes

63 Size of the transcriptome
Carter et al (2005) Using arrays estimated 520,000 to 850,000 transcripts per cell. Use upper limit and estimate average transcript size of 2kb Transcriptome ~2GB Transcriptome cost ~ genome cost

64 The Boundome DNA binding proteins control genome function
Histones impact chromatin structure Activators and repressors impact gene expression The location of these proteins helps us understand how the genome works

65 ChIP

66 Chip-Seq Instead of probing against a chip, measure directly
Basic work flow Align reads to the genome Identify clusters and peaks Determine bound sites

67 Robertson et al. 2007 Used Illumina technology to find STAT1 binding sites Comparisons with two ChIP-PCR data sets suggested that ChIP-seq sensitivity was between 70% and 92% and specificity was at least 95%.

68 Tag statistics

69 Typical Profile

70 Mikkelsen et al., 2007 Performed a comparison with ChIP-chip methods ~98% concordance

71 Comparison with ChIP-seq

72 The Methylome In methylated DNA, cytosines are methylated.
This leads to silencing of genes in the region e.g. X inactivation It is yet another form of transcriptional control and together with histone modifications a key component of epigenetics

73 Bi-sulphite sequencing
Converts un-methylated cytosines to uracil (which becomes thymine when converted to cDNA) Experimental procedure is difficult Sequence alignment is tricky, but the basic concepts hold

74 Taylor et al, 2007 Targeted sequencing reduced alignment difficulties
Used dynamic programming to identify alignments of sequences against an in silico bisulphate converted sequence of the target amplicon regions

75 Metagenomics Craig Venter’s sequencing of the sea one of the earliest and most well known examples Used Sanger sequencing Many recent studies including Angly et al – studied ocean virome Cox-Foster et al – studied colony collapse disorder All use 454 for its longer read length and target amplification of 16S or 18S ribsomal subunits


Download ppt "Lecture 1 Introduction to high throughput sequencing"

Similar presentations


Ads by Google