Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 6: Genotype by sequencing

Similar presentations


Presentation on theme: "Lecture 6: Genotype by sequencing"— Presentation transcript:

1 Lecture 6: Genotype by sequencing
Statistical Genomics Lecture 6: Genotype by sequencing Zhiwu Zhang Washington State University

2 Outline Genetic markers Sequencing Full vs. reduced Experiment
Data process and format

3 Human genome project Funded by DOE, NIH and Welcome Trust in the UK
Begun in 1990 Original planed to last 15 years. Institute for Genomic Research and U. of Washington provided over 450K BAC each was tagged and contain 3~4 Kb across the entire human genome

4 Human genome project Accelerate the completion date to 2003
Celera Genomics Craig Venter was among those sequenced Identified 20~120K genes Sequence of 3 billion base pairs Cost near 3 billion dollars

5 Types of genetic markers
RFLP: Restriction fragment length polymorphism SSR: Simple Sequence Repeats SNP: Single Nucleotide Polymorphism Chip Sequencing

6 RFLP Restriction Enzyme Restriction fragment length polymorphism

7 SSR

8 SNP by hybridization

9 Fredric Sanger 1958 Nobel Price of Chemistry for Protein identification by electrophoresis 1980 Nobel Price of Chemistry for DNA sequencing

10 Ladder of DNA length dNTP (deoxynucleotides)
ddNTP: (dideoxynucleotides): chain reaction terminator

11 1st Generation DNA sequencing
Fred Sanger and Alan R. Coulson, Nature 24, 687–695 (1977)

12 2nd generation sequencing
Sequencing-by-synthesis by 454 Life Science: Margulies, M. et al. Nature 437, 376–380 (2005). Multiplex Polony sequencing by George M. Church lab at Harvard Medical School: Shendure, J. et al. Science 309, 1728–1732 (2005). 1 2 3 4 5 6

13 Sequencing-by-synthesis
454 Life Science: Margulies, M. et al. Nature 437, 376–380 (2005). 1 2 3 4 5 6 T G C T A C … T T T T T T …

14

15 Multiplex Polony sequencing
George M. Church lab at Harvard Medical School: Shendure, J. et al. Science 309, 1728–1732 (2005).

16

17 Cluster Generation

18 $1000 Genome Price Price/unit $/Genome* Consumables $/Gb HiSeq X Five $6M $1.2M $1,425 $1,200 $10.6 HiSeq X Ten $10M $1M $1,000 $800 $7

19 DNA/RNA fragmentation
Physical Fragmentation 1) Acoustic shearing 2) Sonication 3) Hydrodynamic shear Enzymatic Methods 4) DNase I or other restriction endonuclease, non-specific nuclease 5) Transposase Chemical Fragmentation 6) Heat and divalent metal cation

20 Reduced Genotyping Sequencing
Restriction site

21 Restriction enzymes: ApeKI
Recognition: 5’GCWGC3’ W: A or T Expected size: 4x4x2x4x4=512bp= 0.5Kb Genome coverage 100 bp read/512 bp size=20%

22 Restriction enzymes: PstI
Recognition: 5’ CTGCAG3’ Expected size: 4^6=4096bp= 4Kb Genome coverage 100 bp read/4096 bp size=2.5%

23 Multiplex barcode Aalborg University, Denmark: Craig et al. Nat. Methods 2000, 5: 887–893. 4~8 bases

24 Adapter and Barcode By Sharon Mitchell

25 Genotyping by sequencing (GBS)
Digest DNA Ligate adapters with barcodes 5. Illumina sequencing 3. Pool DNAs 4. PCR . . Here, I need to make a brief introduction to genotyping by sequencing. This flowchart shows the protocol of GBS. The DNA samples are digested with enzyme and barcoded. Then we pool the DNA and do the PCR. Eventually the samples are sequenced in Illumina sequencer. This protocol is more simpler than the similar methods, like rrl and rad. Elshire et al PLoS One

26 Cost reduction by multiplexing
Besides, the GBS is cost effective. One sample only costs 9 dollors, if 384 plex is used. Considering the coverage, we used the 96 plex protocol on all the switchgrass samples.

27 Sequencing depth Definition: Expected sequencing times per base pair
Calculation 100Mb genome, 100M read of 100 bp: 100X 3G genome, 1% reduced, 50 multiplex, 6G data (1byte one base): 6G/(50x3Gx1%)=4X

28 Genomic coverage and depth
ApeKI PstI Recognition bases 5 6 Fragment size .5Kb 4Kb Genome coverage (100bp read) 20% 2.5% Number of unique sequence (3G genome) 3G/.5Kb=6M 3G/4Kb=.75M Sequencing depth (60G data on 3G genome) 60/(3x.2)=100X 60/(3*.025)=800X

29 Distribution of length
Expectation of length=length/number of cut Variance=Squared Expectation (need proof)

30 Distribution of length
size= x=round(runif(n,1,size)) y=sort(x) interval=y[-1]-y[-n] hist(interval) Ex=size/n Va=Ex*Ex m=mean(interval) v=var(interval) m v

31 Distribution of length
Beissinger et al, Genetics. 2013, 193(4):

32 Number of reads

33 FASTQ Line 1: start with @ followed by sequence description
@SRR _SLXA-EAS1_s_7:5:1:817:345 length=36 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC +SRR _SLXA-EAS1_s_7:5:1:817:345 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC Line 1: start followed by sequence description Line 2: Sequence Line 3 start with + followed by description Line 4: Symbols of sequence quality values (same length as sequence) with ! the lowest and ~ the highest. There are 94 symbols with ascii code from 33 to 126.

34 Ascii code x CHAR(x) 33 ! 56 8 80 P 103 g 34 " 57 9 81 Q 104 h 35 # 58
: 82 R 105 i 36 $ 59 ; 83 S 106 j 37 % 60 < 84 T 107 k 38 & 61 = 85 U 108 l 39 ' 62 > 86 V 109 m 40 ( 63 ? 87 W 110 n 41 ) 64 @ 88 X 111 o 42 * 65 A 89 Y 112 p 43 + 66 B 90 Z 113 q 44 , 67 C 91 [ 114 r 45 - 68 D 92 \ 115 s 46 . 69 E 93 ] 116 t 47 / 70 F 94 ^ 117 u 48 71 G 95 _ 118 v 49 1 72 H 96 ` 119 w 50 2 73 I 97 a 120 51 3 74 J 98 b 121 y 52 4 75 K 99 c 122 z 53 5 76 L 100 d 123 { 54 6 77 M 101 e 124 | 55 7 78 N 102 f 125 } 79 O 126 ~

35 Post-sequencing

36 Hapmap format IUPAC code

37 Genotype in Numeric format
myGD=read.table(file="

38 Genetic map myGM=read.table(file="

39 Outline Genetic markers Sequencing Full vs. reduced Experiment
Data process and format


Download ppt "Lecture 6: Genotype by sequencing"

Similar presentations


Ads by Google