Presentation is loading. Please wait.

Presentation is loading. Please wait.

Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics

Similar presentations


Presentation on theme: "Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics"— Presentation transcript:

1 Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics Email: hji@jhsph.eduhji@jhsph.edu

2 Outline of today’s lecture Introduction to genome and genomics Topics and tools Relevance of statistics

3 DNA DNAs (Deoxyribonucleic acids) are molecules to store genetic information of a living organism. DNA consists of two polymers made from four types of nucleotides: adenine (A) guanine (G), cytosine (C) and thymine (T). Purines: A, G; Pyrimidines: C, T Two polymers are complementary to each other and from a double-helix structure 5’-ACCGTTCGACGGTAA-3’ ||||||||||||||| 3’-TGGCAAGCTGCCATT-5’

4 Chromosome

5 Genome TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCT CTCACACCTGACATGAAAAGGCACATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACA GGGCCCCCCTCGGGGGTTGCGCCCAGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGT GATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCATGGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACG GGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGG GGGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGAGGCTGCTGGCCTTGCCGTCCAGGGGACAAGGAGCCAGAGTCCAG GTGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGACATAGGTCAA CACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTG CCTGCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGG CCTAAGTTCTCTGCCTCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGT AGCTTTTAGAAGTATAAGGCTCCTGTTTCTGGGTCATATTAGTGTTGTTTTCACCTGTCCCCAGCCCTAAGCCAGGTGTGG CCAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTCCACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTA TTAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTGGCCACATCAATCAAAAGGACCGTGACCAAC TTCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTACAGCTCAGCAGGGCATCG TCACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGACACAATT CACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGG GCCTCAGGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTA AGGAAGGAACCTGTGGACTCCTCCCTACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGT CCTCTCCCATTTCAGCTTACCTGACAATGAAATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAG CACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCATAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGG CCTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTCAGCTGTGTCTATACTTGGGTTTTGTAT … Total amount of DNA in human genome: 3 * 10 9 base pairs (bp)

6 Gene

7 Central Dogma Gene expression

8 Topic 1: gene expression and microarray Expression A B C A B C A B C No Expression X Y Z X Y Z X Y Z X Y Z Temporally Spatially

9 Microarray probe cDNA sample

10 Microarray data

11 Topic 2: transcriptional regulation TF1TF2 Transcription factors (TF): Transcription factor binding sites (TFBS):CCACCCAC, TAATAAAAT TF1 TF2 TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA... TF1 TF2 TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA...

12 Transcription factor binding motif GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG TF 123456789 TGGGTGGTC TGGGTGGTA TGGGAGGTC TGGGTGGTG TGAGTGGTC TGGGTGGTC 1 2 3 4 5 6 7 8 9 A 0 0 1 0 1 0 0 0 1 C 0 0 0 0 0 0 0 0 4 G 0 6 5 6 0 6 6 0 1 T 6 0 0 0 5 0 0 6 0 123456789 A0.00 0.170.000.170.00 0.17 C0.00 0.66 G0.001.000.831.000.001.00 0.000.17 T1.000.00 0.830.00 1.000.00 Motif Transcription Factor Binding Sites (TFBS)

13 Motifs are regulatory codes in the genome TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCT CTCACACCACCCATGTTTTGTTTATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACAG GGCCCCCCTCGGGGGTTGCGCCCAGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGTG ATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCATGGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACGG GGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGG GGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGTTGCTGCTGGCCTTGCCGTCCAGGGGACAAGGAGCCAGAGTCCAGG TGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGACATAGGTCAAC ACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTGC CTGCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGGC CTAAGTTCTCTGCCTCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGTA GCTTTTAGAAGTATAAGGCTCCTGTTTCTGGGTCATATTAGTTTTGTTTTCACCTGTCCCCACCCATAAGCCAGGTGTGGC CAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTCCACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTAT TAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTGGCCACATCAATCAAAAGGACCGTGACCAACT TCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTACAGCTCAGCAGGGCATCGT CACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGACACAATTC ACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGGG CCTCAGGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTAA GGAAGGAACCTGTGGACTCCACCCAACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGTC CTCTCCCATTTCAGCTTACCTGACAATGAAATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAGC ACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCATAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGGC CTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTCAGCTGTGTCTATACTTGGGTTTTGTAT Transcription Factor Binding Sites (TFBS) Gene

14 Gene regulatory network Transcription factors Other genes Activation Repression Other Interactions MisregulationDiseases Gene1 TACTACCACCCACAACATAATAAAATCTAA TF1TF2 Gene2 TTAATAAAATACCACCCACAACCTAAGGAT TF1 TF2 TF3 Gene3

15 Motif discovery and decoding regulatory programs in the genome DictionaryHuman Language guesswhatthestoryisaslongasyouknowthela nguageitshouldbeprettyeasy Guess what the story is. As long as you know the language, it should be pretty easy. Know Guess Be … Dictionary Genomic Language GGCCCTGAGCGGTCCCTATTGCTGGGTGGTCAATGCCCTTCATCTGAAATTTC AAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGG GGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTC GGCCCTGAGCGGTCCCTATTGCTGGGTGGTCAATGCCCTTCATCTGGAATTTC AAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGG GGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTC step1 step2 step1 step2

16 Finding motifs from co-regulated genes (Roth et al., 1998; Hughes et al., 2000; etc.) GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Gene 1 Gene 2 Gene 3 … Gene N Condition1 Condition2 GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Gene1 Gene2 Gene3

17 Motif discovery is difficult in mammalian genomes due to a low signal-to-noise ratio Gene1 100~1000 bp Gene2 100~1000 bp Gene3 100~1000 bp Gene1 10k~1000k bp Gene2 10k~1000k bp Gene3 10k~1000k bp yeast human

18 Topic 3: ChIP-chip and tiling array IPNo IP 500~2000 bp long ChIP-chip (Chromatin ImmunoPrecipitation coupled with Microarray)

19 ChIP-chip on tiling arrays IP1 1000 20 32 1120 800 50 12 1700 600 11 20 17 80 780 60 IP2 1200 30 25 1500 730 45 11 1650 700 15 30 23 90 790 70 CT1 80 32 30 21 32 35 22 50 30 24 25 33 12 30 10 CT2 20 25 27 50 29 60 17 45 20 13 15 29 21 45 13 IP CT 500~2000 bp long Probe: 25~60 bp long 35~300 bp spacing

20 A combined approach to study gene regulation ChIP-chip 500~2000bp 6~30bp Sequence Analysis GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC

21 Topic 4: alternative splicing and exon array splicing gene exon intron promoter transcription start site (TSS)

22 Alternative splicing Isoform 1 Isoform 2 Isoform 3 exon 1exon 2exon 3 exon 4exon 5

23 Exon array

24 Topic 5: single nucleotide polymorphism and SNP array SNPs: occur every 100 to 1000 bp make up 90% of genetic variations minor allele frequency >= 1% (otherwise we call them mutations)

25 SNP array ACCGTGGA[C/T]CTGAACCG |||||||| | |||||||| TGGCACCT[G/A]GACTTGGC ACCGTGGA[G]CTGAACCG ACCGTGGA[C]CTGAACCG ACCGTGGA[T]CTGAACCG ACCGTGGA[A]CTGAACCG What will happen when the genotype is CC? CT? TT? Applications: 1. Genotyping & genome-wide association study 2. Copy number variations and loss of heterozygosity 3. Allele specific expression …

26 Topic 6: next-generation sequencing Traditional sequencing

27 Next-generation sequencing Prepare genomic DNA  Attach DNA to surface  Bridge amplification  Fragement become double stranded  Denature the double stranded molecules  Complete amplification  Determine first base  Image first base  Determine second base  Image second base  Sequence reads over multiple cycles  Align data. >50 milliion clusters/flow cell, each 1000 copies of the same template, 1 billion bases per run, 1% of the cost of capillary-based method. (From: http://www.illumina.com/downloads/SS_DNAsequencing.pdf)

28 Array vs. next-generation sequencing

29 Microarray, Exon array  RNA-seq ChIP-chip  ChIP-seq SNP array  SNP/mutation detection by sequencing …  …

30 Other topics Epigenomics Transposon miRNA

31 Relevance of statistics GenomicsStatistics Need new statistical theories and tools Guide development of efficient data analysis strategies

32 Example 1: differential gene expression

33 Example 1: multiple testing Gene i=1 i=2 i=3 … i=I t-statistic 1.2 6.7 5.1 … -0.5 p-value 0.30 0.001 0.002 … 0.56 Bonferroni adjustment Rejections … Bonferroni adjustment too stringent Multiplicity needs to be adjusted in order to determine statistical significance False discovery rate

34 False discovery rate (FDR, Benjamini & Hochberg, 1995) AcceptRejectTotal True H 0 UVm0m0 True H 1 TSm-m 0 m-RRm FDR = E(V/R) = Pr(R>0)E(V/R|R>0) FWER = Pr(V ≥ 1) False discovery rate (FDR)

35 Pooling information … Test Sample Variance ( df ) 1 2 3 … I … Variance Estimates … Modified t-statistics Multiplicity caused some problem in controlling type I errors, but it can be used to improve statistical power! A common distribution

36 Example 2: motif discovery 00 Θ S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA A: 000000000000001000000000000000000000000001000000000000000000000000000000 Motif:Background: A C G T A.3.2.2.3 C.2.3.3.2 G.2.3.3.2 T.3.2.2.3 1 2 3 4 5 6 7 8 9 A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17 C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66 G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17 T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00  A Inference by iterative estimation/sampling (Gibbs sampler) f (A,Θ | S) Marginalization: f (A | S) = ∫ f (A, Θ | S) dΘ


Download ppt "Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics"

Similar presentations


Ads by Google