Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Bioinformatics - RNA seq - Marcel Willemsen

Similar presentations


Presentation on theme: "Introduction to Bioinformatics - RNA seq - Marcel Willemsen"— Presentation transcript:

1 Introduction to Bioinformatics - RNA seq - Marcel Willemsen

2 Biology intro Central dogma molecular biology RNA o mRNA o tRNA o miRNA => Regulation o... (*) Proteins o Structure o Hormones o Enzymes o... Number of mRNA molecules = gene expression measure protein bio-synthesis *

3 General RNA-seq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

4 General RNA-seq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

5 Sample preparation General RNA-seq pipeline Strand specific

6 Argarose gel electrophoresis purification General RNA-seq pipeline - Sample prep Digestion by restriction endonucleases (EcoRI) Cutting out desired length

7 General RNA-seq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

8 NGS platforms General RNA-seq pipeline - Sequencing Short reads o Illumina-Solexa o ABI-Solid (Life Technologies) Long reads o Roche-454 Third generation (direct/single molecule) o Helicos o Pacific Biosciences o Ox ford Nanopore No amplification, ligation or cDNA synthesis!

9 Intermezzo: colorspace General RNA-seq pipeline - QC A ACGT C A T G C A T C G

10 Intermezzo: colorspace SNP A A C G T A C G A T A C G T T C G A T G First 2 'bases' are often removed SNP A

11 Intermezzo: colorspace Sequencing error A A C G T A C G A T A Translate reference to colorspace during alignments! Sequencing error! A C G T T G C T A

12 General RNAseq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

13 Quality Control General RNAseq pipeline - QC QC tools -> FastQC,... o Input: Fastq file o Output: Quality report

14 Quality Control General RNAseq pipeline - QC Per base sequence quality

15 Quality Control General RNAseq pipeline - QC Per sequence quality scores

16 Quality Control General RNAseq pipeline - QC Per base sequence content Fastq input Double encoded!

17 Quality Control General RNAseq pipeline - QC Per base GC content

18 Quality Control General RNAseq pipeline - QC Per sequence GC content

19 General RNAseq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

20 Alignment General RNAseq pipeline Tool: IGV Integrative Genomics Viewer (Broad Institute) Coverage Direction Annotation: Ref. seq genes

21 Coverage General RNA-seq pipeline The more samples the less coverage!

22 Alignment General RNAseq pipeline - SAGE sample Position Introns

23 Alignment General RNAseq pipeline - SAGE sample StartStop StartStop Mismatches Minus reads are reversed! Seeding (l 25 k 1 n 100)

24 SAM/BAM Format General RNA-seq pipeline header reference name reference length query name = sample name:bead coordinates strand: 0=plus, 16 =minus, 4=no match left most postition 5' for plus strands 3' for minus strands mapping quality (phred scaled) cigar string query sequence on same strand as reference

25 SAM/BAM Format General RNA-seq pipeline query quality (encoded) optional fields TAG:VALUE TYPE:VALUE mate info (mate pair seq)

26 Original RNA read is reverse complement!

27 Alternative splicing => isoforms Re-alignment RNA splicing

28 Re-alignment RNA splicing donoracceptor

29 Y = T C (pyrimidine) Re-alignment RNA splicing

30 Re-alignment RNA splicing

31 Re-alignment RNA splicing

32 Re-alignment General RNA-seq pipeline

33 Re-alignment General RNA-seq pipeline Database available from ALEXA-Seq

34 General RNA-seq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

35 Alignment analysis General RNA-seq pipeline 24nt (-l 24 -k 1 -n 100 -o 0 -t 4 -c) 20nt

36 QC filtering General RNA-seq pipeline Mapping quality > 0 Unique mappings BWA default settings

37 Alignment analysis General RNA-seq pipeline Read length vs. Mapping % BWA default settings Unique mappings Mapping qual > 0 Error = 1 - (reads mapping in exons / total reads mapping) Mapping Percentage 1 = 100% Read length

38 General RNA-seq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

39 Whole transcriptome General RNA-seq pipeline  Total RNA or poly(A) RNA  RNAs: mRNA, tRNA, rRNA, pri-miRNA, snRNA...  Which RNA is polyadenylated?  RNA is fragmented  Enzymatic digestion / physical shearing  Several reads per transcript possible!  Sliding window analysis  Discover new RNAs  Intergenic regions

40 Sliding window analysis - Scaling Whole transcriptome mean pos (mean i (coverage ko )) mean pos (mean i (coverage wt )) X mean i (coverage wt ) Mean Coverage Position Wildtype vs. Knock out Wildtype vs. Knock out scaled

41 Sliding window analysis Whole transcriptome Sliding window = hypothetical transcriptional unit Parameters : o Windowsize: 200, 100 and 20 nt o Threshold: log2 mean window (FoldChange) > 2, 1.6, 1.3, 1 o Background: (wt < 10 AND ko < 10)

42 Sliding window analysis Whole transcriptome + strand- strand Position index

43 Sliding window analysis Whole transcriptome

44 General RNA-seq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

45 Intermezzo: Post transcriptional modifications 5' Capping of pre-mRNA 3' Polyadenylation of pre-mRNA

46 Intermezzo: 5' Capping - Stability - Export out of nucleus - Promote translation - Splicing mitochondrial and chloroplast mRNA are not capped => not in CAGE

47 Intermezzo: 3' Polyadenylation Nuclear export, translation, stability

48 Gene expression profiling General RNA-seq pipeline Qualitative o Which part of the genome is expressed, in which cells, which mRNA isoforms Quantitative o Compare across conditions, understand biological processes/mechanisms  Tumor vs. Normal tissue  Knock-out vs. wild-type mouse  Changing nutrient conditions in yeast  Etc.

49 Gene expression profiling General RNA-seq pipeline DeepSAGE = Digital Gene Expression SAGE = Serial Aanlysis of Gene Expression Tag-based: one read per transcript o DeepSAGE -> most 3' CATG o DeepCAGE -> 5' end DE analysis GSEA

50 SAGE Gene expression profiling Magnetic beads capture poly(A) RNA cDNA synthesis with reverse transcriptase (E.coli) NlaIII digestion Every 250 bp ~99% human transcripts Adapter A ligation Complementary overhang EcoP151 recognistion site PCR primer site (P2) EcoR151 Digestion Asymmetric 27 bp downstream from adapter A Adapter B ligation PCR primer site (P1) Sequencing intitiation site

51 SAGE Mapping against Genome Exons Tag library (SAGE Genie) No re-alignment No gene length bias

52 General RNA-seq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

53 RNA interference siRNA o Silencing through methylation Exogenous Viral dsRNA Endogenous miRNA

54 microRNA Short Mature ~ bases Difficult to map uniquely  2 miRNAs may differ 1 base  Adapter removal Reads are cut out of gel at desired length Mapping against miRBase New miRNA(target) discovery / prediction o mirDeep o miRFinder o etc...

55 Adapter (P2) removal Before alignment cut move

56 Adapter (P2) removal During alignment

57 Adapter (P2) removal During alignment

58 Adapter (P2) removel Total number of Read number of

59 General RNA-seq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

60 Quantifying transcript abundance How many reads? Depends on Sequencing method Sequencing depth Cell type Genes detected Number of aligned reads

61 Quantifying transcript abundance exon ' 3' Exons Alternative transcripts

62 Counts Quantifying transcript abundance 2 wildtype replicates 2 knockout replicates features Total tag count Libraries

63 Normalization Quantifying transcript abundance Technical bias o Solid bar codes random spread Sample size bias Gene length bias o Proportion of significant DE genes increases width transcript length o Has in particular implications for the ranking of differentially expressed genes => introduce bias in gene set testing Normalization: o RPKM = Reads per Kb per million mapped reads total exon reads mapped reads(millions) * exon length(KB) RPKM =

64 General RNA-seq pipeline Sample prep Alignment re-alignment Sequencing QC Whole transcriptome Gene expression profiling microRNA Alignment analysis QC Filtering Quantifying transcript abundance Statistical analysis Applications:

65 RNA-seq vs. microarrays Statistical analysis RNA-seq o Counts o Absolute abundance of transcript o All transcripts present Microarray o Hybridization signal to complementary probe o Relative abundance o Content limited o Cross-hybridizaton

66 Count data - Poisson distribution Discrete probability distribution Not continuous Probability of a number of events occurring in a fixed period of time Events occur with known average rate and independently of the time since the last event DESeq/EdgeR negative binomial distribution (related to Poisson) λ = expected k = number of occurrences

67 Multiple Hypothesis testing H 0 = trueH 0 = false H 0 rejectedVS H 0 not rejectedUT One test: H 0 : μ 1 = μ 2 H 1 : μ 1 <> μ 2 Multiple tests: "Discoveries" Sensitivity vs. Specificity o false negatives => not sensitive enough o false positives => not specific enough (false positives) (false negatives)

68 Multiple Hypothesis testing K tests at level α = Expect 0.05 * K false discoveries 1 out of 20 K = => 2000

69 Multiple Hypothesis testing Bonferroni (Holm) controls the FWER (familywise error rate) α’ = α/k Benjamini–Hochberg controls the FDR FDR = false discovery rate Vfalse discoveries V + Stotal discoveries Expected proportion of false positives FDR is less conservative than FWER H 0 trueH 0 false H 0 rejectedVS H 0 not rejectedUT

70 Differential Expression Condition 1: Wildtypevs. Condition 2: Knockout Statistical testing (DESeq) etc. Multiple Testing! correcting gives adjusted p-Value design=c("wt","wt","d","d")

71 Differential Expression Results: log 2 FC / p-value / padj o Fold change => log 2 FC log = -0.08log = 0.08 fold change: 121/114 = /121 = 0.94 mean: wt wt ko ko = 1.06 Other way around:

72 Differential Expression Results: o M vs. A plot (minus vs. average) M = log 2 wt - log 2 ko = log 2 (wt/ko) = fold change log 2 FC A = 0.5(log 2 wt + log 2 ko) = average expression baseMean M A = padj < 0.05

73 Differential Expression Results: o Hierarchical clustering of samples and genes - Heatmap 9.5 ED7.5 ED8.5 ED =>relicates Top 100 varying genes (features)  F-test / ANOVA  Linear regression  Variance stabilizing transformation (VST)

74 Differential Expression Variance-stabilizing transformation Find a simple function ƒ to apply to values x in a data set to create new values y = ƒ(x) such that the variability of the values y is not related to their mean value

75 Differential Expression Variance stabilizing transformedRaw counts

76 Venn diagram Differential Expression Intersection of differentially expressed genes

77 Gene Set Enrichment Analysis (GSEA) edgeR, bayseq, DESeq... o Return a list of differentially expressed genes Biological theory is not about isolated genes o Typical biological research questions and hypotheses  About pathways  About biological processes  About areas of the genome o About sets of related genes Question o How to analyze RNAseq data from a gene set perspective?

78 GSEA Which gene sets? Any defined set that has something in common o Pathways  KEGG, Reactome, Biocarta o Gene Ontology terms  Biological process, Molecular function, Cellular component o Chromosomal regions  Chromosome arms, cytobands, linkage peaks, genes o Published gene sets  Predictive signatures, gene lists

79 GSEA H 0 : null hypothesis o Genes in the gene set are as often differentially expressed as genes outside H 1 : alternative hypothesis o Genes in the gene set are more often differentially expressed as genes outside

80 GSEA Scoring gene sets GeneyesAdjusted p-valueMember of set G gene 13.44E-006yes gene 21.77E-005no gene 39.92E-005yes... gene yes gene no... gene n1no Ordered gene list L Gene set G: {1,3,...100,...}

81 GSEA Scoring gene sets GeneyesAdjusted p-valueMember of set G gene 13.44E-006yes gene 21.77E-005no gene 39.92E-005yes... gene yes gene no... gene n1no Ordered gene list L Gene set G: {1,3,...100,...} Define cut off and count members of G above and below cutoff (= 0.5)

82 GSEA Fisher’s exact test Define cutoff (C=100) and count members of G above and below cutoff Fisher’s exact test: # genes above C# genes below C # genes in set G5 (a)0 (b)5 (a + b) # genes not in set G1 (c)4 (d)5 (c + d) 6 (a + c)4 (b + d)10 (a + b + c + d = n) n = a + b + c + d

83 GSEA Fisher’s exact test P-value = sum of all probabilities ≤ P cutoff diff. expr. genenon-diff. expr. gene in gene set50 not in gene set14 = 1 P-value = Reject H 0 => enriched set!

84 GSEA Alternative: Chi 2 test Tool: goseq (R package) o Correction for length bias o Not in SAGE!

85 GSEA Example KEGG: Time series 7.5 ED vs. 9.5 ED 7.5 ED9.5 ED All genes in pathway

86 The End


Download ppt "Introduction to Bioinformatics - RNA seq - Marcel Willemsen"

Similar presentations


Ads by Google