Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Bioinformatics - RNA seq -

Similar presentations


Presentation on theme: "Introduction to Bioinformatics - RNA seq -"— Presentation transcript:

1 Introduction to Bioinformatics - RNA seq -
Marcel Willemsen

2 Biology intro Central dogma molecular biology
RNA mRNA tRNA miRNA => Regulation ... (*) Proteins Structure Hormones Enzymes Number of mRNA molecules = gene expression measure protein bio-synthesis * reminder central dogma => primary transcript => post transcriptional modifications => mature mRNA => translation, protein bio-synthesis ribosomes abundance of all mRNAs => expression profile

3 General RNA-seq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

4 General RNA-seq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

5 Sample preparation General RNA-seq pipeline
Total RNA => poly A enrichment not in miRNA=> Fragment (enzymes) or cut out gel gel => ligate RNA adapters => RT cDNA Strand specificity => Separate reads by strand => Antisense transcription Strand specific

6 Argarose gel electrophoresis purification General RNA-seq pipeline - Sample prep
Digestion by restriction endonucleases (EcoRI) Cutting out desired length

7 General RNA-seq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling Verwijzen naar vorige lecture Frank Baas voor details sequencing!!! identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

8 NGS platforms General RNA-seq pipeline - Sequencing
Short reads Illumina-Solexa ABI-Solid (Life Technologies) Long reads Roche-454 Third generation (direct/single molecule) Helicos Pacific Biosciences Ox ford Nanopore No amplification, ligation or cDNA synthesis! long reads de novo without amplification, ligation or cDNA synthesis

9 Intermezzo: colorspace General RNA-seq pipeline - QC
reverse complement is the same Thymine Urasil A A C G T C A T G C A T C G

10 Intermezzo: colorspace SNP
A C G T A C G A T SNP A C G T T C G A T A snp => 2 differences in colorspace! First 2 'bases' are often removed G

11 Intermezzo: colorspace Sequencing error
A C G T A C G A T Sequencing error! A A C G T T G C T A one difference in colorspace Translate reference to colorspace during alignments!

12 General RNAseq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

13 Quality Control General RNAseq pipeline - QC
QC tools -> FastQC, ... Input: Fastq file Output: Quality report

14 Quality Control General RNAseq pipeline - QC
Per base sequence quality The longer the reads => quality drops

15 Quality Control General RNAseq pipeline - QC
Per sequence quality scores average quality for whole read

16 Quality Control General RNAseq pipeline - QC
Per base sequence content Fastq input Double encoded! 3 1 2

17 Quality Control General RNAseq pipeline - QC
Per base GC content

18 Quality Control General RNAseq pipeline - QC
Per sequence GC content

19 General RNAseq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

20 Alignment General RNAseq pipeline
Tool: IGV Integrative Genomics Viewer (Broad Institute) Coverage Direction Annotation: Ref. seq genes

21 Coverage General RNA-seq pipeline
3 4 3 2 3 2 The more samples the less coverage!

22 Alignment General RNAseq pipeline - SAGE sample
Position Introns Chrom position! Eukatyote => introns

23 Alignment General RNAseq pipeline - SAGE sample
Start Stop Start Stop Mismatches Minus reads are reversed! Seeding (l 25 k 1 n 100)

24 SAM/BAM Format General RNA-seq pipeline
reference name reference length left most postition 5' for plus strands 3' for minus strands header mapping quality (phred scaled) cigar string strand: 0=plus, 16 =minus, 4=no match query sequence on same strand as reference query name = sample name:bead coordinates

25 SAM/BAM Format General RNA-seq pipeline
optional fields TAG:VALUE TYPE:VALUE ASCII-33 gives the Phred base quality XA: alternative mapping location query quality (encoded) mate info (mate pair seq)

26 Original RNA read is reverse complement!

27 Re-alignment RNA splicing
isoforms alternative splicing Alternative splicing => isoforms

28 Re-alignment RNA splicing
donor acceptor 5 prime splice donor GU 3 prime splice acceptor AG

29 Re-alignment RNA splicing
Y = T C (pyrimidine) Y = T C (pyrimidine) iupac covalent bond

30 Re-alignment RNA splicing
oxygen of hydroxyl attacks phosphate group

31 Re-alignment RNA splicing

32 Re-alignment General RNA-seq pipeline

33 Re-alignment General RNA-seq pipeline
Database available from ALEXA-Seq

34 General RNA-seq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Alignment re-alignment Gene expression profiling Quantifying transcript abundance identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol regulation Statistical analysis Alignment analysis microRNA QC Filtering

35 Alignment analysis General RNA-seq pipeline
24nt (-l 24 -k 1 -n 100 -o 0 -t 4 -c)

36 QC filtering General RNA-seq pipeline
BWA default settings Unique mappings Mapping quality > 0

37 Alignment analysis General RNA-seq pipeline
Mapping Percentage 1 = 100% Read length vs. Mapping % BWA default settings Unique mappings Mapping qual > 0 Define optimum Error = 1 - (reads mapping in exons / total reads mapping) Read length

38 General RNA-seq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

39 Whole transcriptome General RNA-seq pipeline
Total RNA or poly(A) RNA RNAs: mRNA, tRNA, rRNA, pri-miRNA, snRNA ... Which RNA is polyadenylated? RNA is fragmented Enzymatic digestion / physical shearing Several reads per transcript possible! Sliding window analysis Discover new RNAs Intergenic regions snRNAs small nuclear regulation

40 Sliding window analysis - Scaling Whole transcriptome
meanpos(meani(coverageko)) meanpos(meani(coveragewt)) X meani(coveragewt) Mean Coverage Wildtype vs. Knock out Wildtype vs. Knock out scaled running mean Position

41 Sliding window analysis Whole transcriptome
Sliding window = hypothetical transcriptional unit Parameters : Windowsize: 200, 100 and 20 nt Threshold: log2 meanwindow(FoldChange) > 2, 1.6, 1.3, 1 Background: (wt < 10 AND ko < 10)

42 Sliding window analysis Whole transcriptome
+ strand strand Position index

43 Sliding window analysis Whole transcriptome

44 General RNA-seq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling Break!!!!! identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

45 Intermezzo: Post transcriptional modifications
5' Capping of pre-mRNA 3' Polyadenylation of pre-mRNA discuss PTM for good understanding of rest lecture

46 Intermezzo: 5' Capping - Stability - Export out of nucleus
Guanosinetriphosphate + RNA => Cap + pyrophosphate 5-5 triphosphate bridge CAGE => 5’ cap used to isolate RNA eukaryotes 5'cap stability (endonucleases) export out the nucleus promote translation splicing in nucleus, mitochondrial and chloroplast mRNA are not capped - Stability - Export out of nucleus - Promote translation - Splicing mitochondrial and chloroplast mRNA are not capped => not in CAGE

47 Intermezzo: 3' Polyadenylation
3 hydroxyl (OH) polyadenylation signal nuclear export, translation, stability shortened over time, and, when it is short enough, the mRNA is enzymatically degraded Nuclear export, translation, stability

48 Gene expression profiling General RNA-seq pipeline
Qualitative Which part of the genome is expressed, in which cells, which mRNA isoforms Quantitative Compare across conditions, understand biological processes/mechanisms Tumor vs. Normal tissue Knock-out vs. wild-type mouse Changing nutrient conditions in yeast Etc.

49 Gene expression profiling General RNA-seq pipeline
DeepSAGE = Digital Gene Expression SAGE = Serial Aanlysis of Gene Expression Tag-based: one read per transcript DeepSAGE -> most 3' CATG DeepCAGE -> 5' end DE analysis GSEA

50 SAGE Gene expression profiling
Magnetic beads capture poly(A) RNA cDNA synthesis with reverse transcriptase (E.coli) NlaIII digestion Every 250 bp ~99% human transcripts Adapter A ligation Complementary overhang EcoP151 recognistion site PCR primer site (P2) Emulsion PCR -> sequencing EcoR151 Digestion Asymmetric 27 bp downstream from adapter A Adapter B ligation PCR primer site (P1) Sequencing intitiation site

51 SAGE Mapping against No re-alignment No gene length bias Genome Exons
Tag library (SAGE Genie) No re-alignment No gene length bias

52 General RNA-seq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

53 RNA interference siRNA Silencing through methylation Exogenous
Viral dsRNA Endogenous miRNA Primary miRNA Precursor miRNA Guide strand Target strand Dicer cuts double stranded RNA in small fragments siRNA RNA induced silencing complex

54 microRNA Short Reads are cut out of gel at desired length
Mature ~ bases Difficult to map uniquely 2 miRNAs may differ 1 base Adapter removal Reads are cut out of gel at desired length Mapping against miRBase New miRNA(target) discovery / prediction mirDeep miRFinder etc...

55 Adapter (P2) removal Before alignment
move cut cut red = mismatch look for minimum!

56 Adapter (P2) removal During alignment

57 Adapter (P2) removal During alignment

58 Adapter (P2) removel Total number of Read number of

59 General RNA-seq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

60 Quantifying transcript abundance
How many reads? Depends on Sequencing method Sequencing depth Cell type Genes detected relative abundance microarrays Number of aligned reads

61 Quantifying transcript abundance
5' 3' exon exon exon +6 -2 +4 -1 +4 -0 Exons Alternative transcripts

62 Counts Quantifying transcript abundance
2 wildtype replicates 2 knockout replicates Total tag count Libraries The more replicates the more power in your statistical test the more change on finding significant differentially expressed genes Libraries Features: miRNAs, genes, intergenic regions ... Watch counts on plus strand features

63 Normalization Quantifying transcript abundance
Technical bias Solid bar codes random spread Sample size bias Gene length bias Proportion of significant DE genes increases width transcript length Has in particular implications for the ranking of differentially expressed genes => introduce bias in gene set testing Normalization: RPKM = Reads per Kb per million mapped reads positive association between gene counts and length It is expected from the mRNA-Seq assay that longer transcripts contribute more "sequencible" fragments than shorter ones expressed at the same level. There is clearly a positive association between gene counts and length, an association that is not entirely removed via scaling by gene length, as in the RPKM of [7] ([Additional file 1: Supplemental Figure S4]). This suggests either higher expression among longer genes or non-linear dependence of gene counts on length total exon reads RPKM = mapped reads(millions) * exon length(KB)

64 General RNA-seq pipeline
Sample prep Sequencing Applications: QC Whole transcriptome Quantifying transcript abundance Alignment re-alignment Gene expression profiling identify novel transcripts, novel exon-exon junctions, non-coding RNAs, alternately spliced transcripts, and expressed SNPs with the stranded assay SAGE/CAGE transcription factors/RNA pol -> not realy RNA seq ! related regulation Statistical analysis Alignment analysis microRNA QC Filtering

65 RNA-seq vs. microarrays Statistical analysis
Counts Absolute abundance of transcript All transcripts present Microarray Hybridization signal to complementary probe Relative abundance Content limited Cross-hybridizaton content limited probes on array

66 Count data - Poisson distribution
Discrete probability distribution Not continuous Probability of a number of events occurring in a fixed period of time Events occur with known average rate and independently of the time since the last event DESeq/EdgeR negative binomial distribution (related to Poisson) Discrete(integers, counts)<=>Continuous λ = expected k = number of occurrences

67 Multiple Hypothesis testing
One test: H0: μ1 = μ H1: μ1 <> μ2 Multiple tests: H0 = true H0 = false H0 rejected V S H0 not rejected U T (false positives) "Discoveries" false negatives => not sensitive enough false positives => not specific enough (false negatives) Sensitivity vs. Specificity false negatives => not sensitive enough false positives => not specific enough

68 Multiple Hypothesis testing
K tests at level α = 0.05. Expect 0.05 * K false discoveries 1 out of 20 K = => 2000

69 Multiple Hypothesis testing
H0 true H0 false H0 rejected V S H0 not rejected U T Bonferroni (Holm) controls the FWER (familywise error rate) α’ = α/k Benjamini–Hochberg controls the FDR FDR = false discovery rate V false discoveries V + S total discoveries Expected proportion of false positives FDR is less conservative than FWER E= expected value P= probability

70 Differential Expression
Condition 1: Wildtype vs. Condition 2: Knockout design=c("wt","wt","d","d") Statistical testing (DESeq) The more replicates the more power in your statistical test the more change on finding significant differentially expressed genes Libraries Features: miRNAs, genes, intergenic regions ... Watch counts on plus strand etc. Multiple Testing! correcting gives adjusted p-Value

71 Differential Expression
Results: log2FC / p-value / padj Fold change => log2FC wt wt ko ko mean: 114 Other way around: fold change: /114 = 1.06 114/121 = 0.94 log = 0.08 log = -0.08 20.08 = 1.06

72 Differential Expression
Results: M vs. A plot (minus vs. average) M = log2wt - log2ko = log2(wt/ko) = fold change log2FC A = 0.5(log2wt + log2ko) = average expression baseMean M = padj < 0.05 A

73 Differential Expression
Results: Hierarchical clustering of samples and genes - Heatmap 9.5 ED 7.5 ED 8.5 ED => relicates Verwijzen Perry voor details clustering/distance measures Relation between variance and mean Top 100 varying genes (features) F-test / ANOVA Linear regression Variance stabilizing transformation (VST)

74 Differential Expression
Variance-stabilizing transformation Find a simple function ƒ to apply to values x in a data set to create new values y = ƒ(x) such that the variability of the values y is not related to their mean value K ~ B(n, p) The probability of getting exactly k successes in n trials sample variance exceeds the sample mean. In such cases, the observations are overdispersed with respect to a Poisson distribution

75 Differential Expression
Variance stabilizing transformed Raw counts

76 Venn diagram Differential Expression
Intersection of differentially expressed genes

77 Gene Set Enrichment Analysis (GSEA)
edgeR, bayseq, DESeq ... Return a list of differentially expressed genes Biological theory is not about isolated genes Typical biological research questions and hypotheses About pathways About biological processes About areas of the genome About sets of related genes Question How to analyze RNAseq data from a gene set perspective?

78 GSEA Which gene sets? Any defined set that has something in common
Pathways KEGG, Reactome, Biocarta Gene Ontology terms Biological process, Molecular function, Cellular component Chromosomal regions Chromosome arms, cytobands, linkage peaks, genes Published gene sets Predictive signatures, gene lists

79 GSEA H0: null hypothesis
Genes in the gene set are as often differentially expressed as genes outside H1: alternative hypothesis Genes in the gene set are more often differentially expressed as genes outside

80 GSEA Scoring gene sets Gene set G: {1,3,...100,...}
Geneyes Adjusted p-value Member of set G gene 1 3.44E-006 yes gene 2 1.77E-005 no gene 3 9.92E-005 ... gene 100 0.49 gene 101 0.51 gene n 1 Gene set G: {1,3,...100,...} Ordered gene list L

81 GSEA Scoring gene sets Gene set G: {1,3,...100,...}
Geneyes Adjusted p-value Member of set G gene 1 3.44E-006 yes gene 2 1.77E-005 no gene 3 9.92E-005 ... gene 100 0.49 gene 101 0.51 gene n 1 Gene set G: {1,3,...100,...} Define cut off and count members of G above and below cutoff (= 0.5) Ordered gene list L

82 GSEA Fisher’s exact test
Define cutoff (C=100) and count members of G above and below cutoff Fisher’s exact test: # genes above C # genes below C # genes in set G 5 (a) 0 (b) 5 (a + b) # genes not in set G 1 (c) 4 (d) 5 (c + d) 6 (a + c) 4 (b + d) 10 (a + b + c + d = n) n = a + b + c + d

83 GSEA Fisher’s exact test
P-value = sum of all probabilities ≤ Pcutoff diff. expr. gene non-diff. expr. gene in gene set 5 not in gene set 1 4 = 1 P-value = Reject H0 => enriched set!

84 GSEA Alternative: Tool: goseq (R package) Chi2 test
Correction for length bias Not in SAGE! disease of the myocardium (the muscle of the heart)

85 GSEA Example KEGG: Time series 7.5 ED vs. 9.5 ED 7.5 ED 9.5 ED
All genes in pathway

86 The End


Download ppt "Introduction to Bioinformatics - RNA seq -"

Similar presentations


Ads by Google