Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm – 5:00pm in Room W2015 Carlo Colantuoni

Similar presentations


Presentation on theme: "Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm – 5:00pm in Room W2015 Carlo Colantuoni"— Presentation transcript:

1 Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm – 5:00pm in Room W2015 Carlo Colantuoni carlo@illuminatobiotech.com http://www.illuminatobiotech.com/GEA2010/GEA2010.htm

2 Class Outline Basic Biology & Gene Expression Analysis Technology Data Preprocessing, Normalization, & QC Measures of Differential Expression Multiple Comparison Problem Clustering and Classification The R Statistical Language and Bioconductor GRADES – independent project with Affymetrix data. http://www.illuminatobiotech.com/GEA2010/GEA2010.htm

3 Class Outline - Detailed Basic Biology & Gene Expression Analysis Technology –The Biology of Our Genome & Transcriptome –Genome and Transcriptome Structure & Databases –Gene Expression & Microarray Technology Data Preprocessing, Normalization, & QC –Intensity Comparison & Ratio vs. Intensity Plots (log transformation) –Background correction (PM-MM, RMA, GCRMA) –Global Mean Normalization –Loess Normalization –Quantile Normalization (RMA & GCRMA) –Quality Control: Batches, plates, pins, hybs, washes, and other artifacts –Quality Control: PCA and MDS for dimension reduction –SVA: Surrogate Variable Analysis Measures of Differential Expression –Basic Statistical Concepts –T-tests and Associated Problems –Significance analysis in microarrays (SAM) [ & Empirical Bayes] –Complex ANOVA’s (limma package in R) Multiple Comparison Problem –Bonferroni –False Discovery Rate Analysis (FDR) Differential Expression of Functional Gene Groups –Functional Annotation of the Genome –Hypergeometric test?, Χ 2, KS, pDens, Wilcoxon Rank Sum –Gene Set Enrichment Analysis (GSEA) –Parametric Analysis of Gene Set Enrichment (PAGE) –geneSetTest –Notes on Experimental Design Clustering and Classification –Hierarchical clustering –K-means –Classification LDA (PAM), kNN, Random Forests Cross-Validation Additional Topics eQTL (expression + SNPs) Next-Gen Sequencing data: RNAseq, ChIPseq Epigenetics? –The R Statistical Language: http://www.r-project.org/ –Bioconductor : http://www.bioconductor.org/docs/install/ –Affymetrix data processing example

4 Questions for you: Student’s training and experience: Statistics or Biology MS or MD or PhD Student’s goals Student’s data? R Statistic Language? other programming experience? Extra topics: Student’s interests

5 DAY #1: Genome Biology The Transcriptome Microarray Technology

6 The Human Genome DADMOM YOU 2 copies of the entire genome in each cell: 3.3 billion ”bases” (Gb) ~30K genes millions of variants We each get 1 copy from MOM & 1 from DAD. Each parent passes on a ”mixed copy” (from their parents). Each copy of the genome is contained in 23 chromosomes: 22+XorY (2 copies = 46 / cell). All in DNA!

7 DNA A deoxyribonucleic acid or DNA molecule is a double- stranded polymer composed of four basic molecular units called nucleotides. Each nucleotide contains a phosphate group, a deoxyribose sugar, and one of four nitrogen bases: adenine (A), guanine (G), cytosine (C), and thymine (T). The two chains are held together by hydrogen bonds. Base-pairing occurs according to the following rule: G pairs with C, and A pairs with T. Directionality & Complementarity: Reverse Complements hybridize.

8 How do these molecular interactions influence directionality and complementarity? G-C pairs are “stickier” than A-T pairs (3 vs. 2 H- bonds). A + G = purines (2 rings) T + C + U= pyrimidines (1 ring) (T in DNA, U in RNA)

9 Another View of DNA Where does an individual gene lie in this schematic?

10 Another View of DNA

11

12 Central Dogma of Modern Cellular & Molecular Biology:

13 Transcription From DNA to mRNA: Transcription occurs at Genes (T in DNA => U in RNA)

14 Transcript Processing

15 Translation From RNA to Protein: In the exons of protein coding genes (and their mRNA intermediates), each codon (3 base pairs) encodes 1 amino acid in the protein.

16 Perspective: Biological Setup Every cell in the human body contains the entire human genome: 3.3 Gb in which ~30K genes exist. The investigation of gene expression is meaningful because different cells, in different environments, doing different jobs express different genes. Cellular “Plans”: DNA - RNA - PROTEIN

17 Cellular Biology, Gene Expression, and Microarray Analysis DNA RNA Protein A protein-coding gene is a segment of chromosomal DNA that directs the synthesis of a protein via an mRNA intermediate. How do we design and implement probes that will effectively assay expression of ALL (most? many?) genes simultaneously.

18 Easy to sequence some genomic DNA. Laboratory Methods: The Genome and The Transcriptome Easy to sequence some expressed mRNA’s. NOT EASY to catalogue all genomic DNA, all expressed mRNA’s, and to map out the exact relations between all these sequences.

19 AAAAA STARTSTOP protein coding 5’ UTR 3’ UTR mRNA Genomic DNA 3.3 Gb Protein Molecular Cell Biology: Components of the Central Dogma Transcription Translation

20 AAAAA Gene: Protein coding unit of genomic DNA with an mRNA intermediate. STARTSTOP protein coding 5’ UTR 3’ UTR mRNA Genomic DNA 3.3 Gb DNA Probe ~30K genes Sequence is a Necessity. Transcription

21 From Genomic DNA to mRNA Transcripts EXONSINTRONS RNA editing & SNPs Alternative splicing Alternative start & stop sites in same RNA molecule ~30K >30K Transcript coverage Homology to other transcripts Hybridization dynamics 3’ bias Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.

22 Designing DNA Probes From Genomic DNA Sequence Sequence & assemble the entire human genome. Search for genes predicted to produce mRNA transcripts. Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns. Completeness? Design DNA probes. [ Genomic DNA databases & assembly ]

23 Designing DNA Probes From mRNA Sequences Sequence ALL expressed mRNA molecules. Completeness? Design DNA probes.

24 Sequence Quality! Redundancy! Completeness? Unsurpassed as source of expressed sequence Chaos?!?

25 From Genomic DNA to mRNA Transcripts ~30K >30K >>30K

26 Transcript-Based Gene-Centered Information

27 From Genomic DNA to mRNA Transcripts

28

29

30

31

32

33

34 DAY #1: Genome Biology The Transcriptome Microarray Technology

35 RNA Expression Measurement: Northern Blot SAMPLE 1SAMPLE 2 RNA 1RNA 2 RNA Extraction electrophoreric transfer to membrane hybridization of labeled probe electrophoreric separation Design + construction of labeled “probe” Seq DB “target”

36 SEQUENCE knowledge is REQUIRED for BOTH! MicroarrayNorthern Target: unknown (sample) Probe: known (synthetic) Target Probe Northern blots seek to interrogate the expression of ONE gene in a SINGLE hybridization reaction. Target Probes RNA Expression Measurement: Northern Blot & Microarrays Microarrays seek to interrogate the expression of MANY genes simultaneously in a MULTIPLEX hybridization reaction.

37 Hybridization on a Northen Blot Labeled Probe Unlabeled Targets 1 MANY Hybrid MEMBRANE 1 Target: unknown Probe: known Edwin Southern et al, Nature Genetics Suppl 1999

38 Labeled Target Unlabeled Probes MANY Solid Support Hybridization on a Microarray MANY Hybrids Target: unknown Probe: known Edwin Southern et al, Nature Genetics Suppl 1999

39 Essentials of Microarray Experimental Design: Probe sequence selection & design Probe deposition on solid support Target Labeling Target Hybridization Signal detection Microarray Target Probes

40 cDNA Microarray Fabrication cDNA Microarray Printing onto standard glass microscope slides or nylon Bacterial clones in 96 well plates

41 cDNA Microarray Experimentation SampleStandard RNA cDNA Hybridized Microarray Scan Cy5Cy3

42 cDNA Microarray Scanning Cy5Cy3 Merged Image Cy3 Channel DataCy5 Channel Data Quantification

43 cDNA Microarray Quantification

44

45

46 Log Intensity cDNA Microarray Quantification

47 Log Intensity [ ] + Log Ratio / cDNA Microarray Quantification [ ]

48 Essentials of Microarray Experimental Design: Probe sequence selection / design Probe deposition on solid support Target Labeling Target Hybridization Signal detection Microarray Target Probes

49 Agilent (HP) Microarrays 2-channel fluorescence on glass slides. 44,000 oligonucleotides (60 NT’s) synthesized in situ using inkjet printing and solid phase phosphoramidite chemistry.

50 NIA Microarray 10K Full Length cDNA’s P 33 One-Channel Spotted on Nylon

51 Affymetrix GeneChip One-channel data generated using biotin labeling. 1,300,000 oligonucleotides (25 NT’s) in 54,000 “probe sets” (11 PM’s and 11 MM’s). Oligo’s synthesized in situ on a silicon wafer using photolithography.

52 Affymetrix GeneChip

53 Affymetrix Probe Set Design 5’ 3’ Reference sequence …TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT… GTACTACCCAGTCTTCCGGAGGCTA GTACTACCCAGTGTTCCGGAGGCTA Perfectmatch (PM) Mismatch (MM) NSB & SB NSB

54 NimbleGen Microarrays Oligonucleotides synthesized in situ on a glass slide using maskless, digital micromirror device. 195,000 oligonucleotides (60 NT’s): 5 probes / gene. One-channel data.

55 Amersham’s CodeLink Arrays One-channel data. 54,841 oligonucleotides (30NT’s). Spotted into a 3-D aqueous polyacrylamide gel surface on a glass slide.

56 ABI’s Human Genome Survey Array One-channel data using digoxigenin/AP. Oligonucleotides spotted into a 3-D nylon matirx. 31,077 oligonucleotides (60 NT’s).

57 Illumina’s BeadChip One-channel data using biotin. Oligonucleotides anchored on beads distributed in random arrays of plasma etched pits in the silicon wafer. 1,700,000 oligonucleotides (50 NT’s) immobilized on beads and represented ~30 times (6 full arrays per glass slide).

58 Essentials of Microarray Experimental Design: Probe sequence Probe deposition on solid support Target Labeling Target Hybridization Signal detection Microarray Target Probes Oligo vs. cDNA (Design: follow-up) 1 vs. 2 channel most important for experimental and analysis design Specifics of each technology will determine idiosyncrasies of data preprocessing. Probe length: Specificity & Sensitivity Signal? Amplification?

59 An Example to Remind us of Gene Structure and Gene Cross-Referencing Issues 2 independent probes (!) on your microarray interrogate the same gene (!) and both show an extreme expression change in your cell line following treatment: YES!!! However, the directionality of this change is opposite: one probe shows induction while the other shows repression: NO !?!

60 Log Intensity cDNA Microarray Quantification

61 Log Intensity Log Ratio cDNA Microarray Quantification Probes designed to interrogate expression of the same gene!

62 From Genomic DNA to mRNA Transcripts

63 SF1 in Entrez Gene (RefSeq): A Complex Transcriptional Profile

64 Lacks regulatory SPSP phosphorylation motif Probe Decreased Probe Increased

65 SF1 in AceView: A Complex Transcriptional Profile!

66 AAAAA Gene: Protein coding unit of genomic DNA with an mRNA intermediate. STARTSTOP protein coding 5’ UTR 3’ UTR mRNA Genomic DNA 3.3 Gb DNA Probe ~30K genes Sequence is a Necessity. Transcription

67 From Genomic DNA to mRNA Transcripts EXONSINTRONS RNA editing & SNPs Alternative splicing Alternative start & stop sites in same RNA molecule ~30K >30K Transcript coverage Homology to other transcripts Hybridization dynamics 3’ bias Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.

68 USCS Genome Browser: Genes Transcripts Probes

69

70

71

72

73

74

75

76

77

78 (Live Web Demo) USCS example with genes, transcripts, and probe mapping – custom tracks.


Download ppt "Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm – 5:00pm in Room W2015 Carlo Colantuoni"

Similar presentations


Ads by Google