6 Month Allelic Series RNAseq QC 1. QC summary 2 QC was performed on all 192 samples focusing on determining failed or outlier samples. Four samples are.

Slides:



Advertisements
Similar presentations
C) between 18 and 27. D) between 27 and 50.
Advertisements

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
TI – 83 Plus1 A Quick Reference Presentation for AMSTI Year 1 Training.
Advanced Piloting Cruise Plot.
Chapter 1 The Study of Body Function Image PowerPoint
STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
UNITED NATIONS Shipment Details Report – January 2006.
Summary of Convergence Tests for Series and Solved Problems
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Right vs. Left Brain. This theory of the structure and functions of the mind suggests that the two different sides of the brain control two different.
Copyright © 2010 Pearson Education, Inc. Slide
2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt ShapesPatterns Counting Number.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
CS1512 Foundations of Computing Science 2 Week 3 (CSD week 32) Probability © J R W Hunter, 2006, K van Deemter 2007.
1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.
Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.
Solve Multi-step Equations
Reliability McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Factoring Quadratics — ax² + bx + c Topic
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
Randomized Algorithms Randomized Algorithms CS648 1.
Jeopardy $100 $100 $100 $100 $100 $200 $200 $200 $200 $200 $300 $300
ABC Technology Project
DIVISIBILITY, FACTORS & MULTIPLES
Mental Math Math Team Skills Test 20-Question Sample.
Hash Tables.
1 Slides revised The overwhelming majority of samples of n from a population of N can stand-in for the population.
1 University of Utah – School of Computing Computer Science 1021 "Thinking Like a Computer"
VOORBLAD.
Weighted moving average charts for detecting small shifts in process mean or trends The wonders of JMP 1.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Statistical Analysis SC504/HS927 Spring Term 2008
Sets Sets © 2005 Richard A. Medeiros next Patterns.
LO: Count up to 100 objects by grouping them and counting in 5s 10s and 2s. Mrs Criddle: Westfield Middle School.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Copyright Pearson Prentice Hall
Before Between After.
Benjamin Banneker Charter Academy of Technology Making AYP Benjamin Banneker Charter Academy of Technology Making AYP.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Subtraction: Adding UP
Equal or Not. Equal or Not
Slippery Slope
Take out the homework from last night then do, Warm up #1
12.1 – Arithmetic Sequences and Series
Factors Terminology: 3  4 =12
Januar MDMDFSSMDMDFSSS
Number bonds to 10,
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
CSE Lecture 17 – Balanced trees
An Interactive Tutorial by S. Mahaffey (Osborne High School)
A SMALL TRUTH TO MAKE LIFE 100%
PSSA Preparation.
Chapter 11: The t Test for Two Related Samples
Essential Cell Biology
Testing Hypotheses About Proportions
Order of Operations And Real Number Operations
Presentation transcript:

6 Month Allelic Series RNAseq QC 1

QC summary 2 QC was performed on all 192 samples focusing on determining failed or outlier samples. Four samples are recommended for omission from the final analysis dataset based on evidence of RNA degradation, PCA analysis, and model-based gene outlier detection. Those four samples can be found on slide 19. Additionally two correctable issues were identified. First, one flowcell worth of samples was run an additional time to add read depth to the 100 million required. This re-run was inadvertently run as 75-mers instead of 50-mer so the samples are a mix of read length. Secondly, for a subset of cortex samples (Q92 and Q140) there appears to be an infinitesimal but detectable amount of liver tissue. The overall dilution is x, but given the extraordinary sensitivity of RNAseq this is still measureable. We have recommended a simple filter to remove those liver transcripts based on the fact that they have a recognizable correlation pattern (listed on slide 29), but other methods may be more sensitive.

How does CHDI QC RNAseq data in general? Mostly we’re looking for outliers Also showing overall experiment worked When we find outliers, we try to determine the cause –That helps show it is an outlier and not part of the biology Methods –Principal Components Analysis –RNA degradation plots –Paired end insert size –Read lengths –Read mapping efficiency –Repetitive sequences and their origin –Highly expressed genes –# gene outliers 3

PCA whole dataset 4 Not surprisingly, tissues cluster Strong sex effect in liver Cortex is tightly clustered

PCA striatum 5 Q lengths cluster, good sign the design worked Q92, 111, 140, 175 uniquely cluster They even stagger in Q length order Couple potential outliers (in red outline) 30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped

PCA cortex 6 Only Q175 stands outside the main cluster Possible Q175 outliers, but hard to be certain 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped

PCA liver 7 Strong sex clustering will need to be accounted for No strong Q clusters (sex masking?) One potential outlier 450_Liver_Q175_HET_M_L8.LB1_1.clipped

Duplication in brain (representative examples) 8 Duplication is consistent and hovers between 13-24% No red flags Higher in striatum than cortex generally Origin of the majority of the duplicated sequences is mitochondrial

Liver duplication (representative examples) 9 Liver duplication is much higher, 40-50% Major duplicated sequences are all mouse pheromone receptors (Mup1-21) Hurts our true read depth, but nothing terrible Should keep in mind for future liver work

5’ -> 3’ degradation charts (representative examples) 10 Displays the likelihood of getting full length transcripts for various mRNA lengths Very high quality samples in general Most samples show >70% of all mRNA molecules are >80% complete Liver on average more degraded Some samples have degradation in the longer mRNA species (one marked in red)

Suspect samples by 5’ -> 3’ degradation _Liver_Q175_HET_M_L3._1.clipped 456_Liver_Q175_HET_M_L7.LB4_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped 845_Liver_Q92_HET_F_L6.LB25_1.clipped 522_Liver_Q111_HET_F_L8.LB13_1.clipped 452_Liver_Q175_HET_M_L1.LB2_1.clipped 776_Liver_Q140_HET_F_L8.LB13_1.clipped 716_Liver_Q80_HET_F_L7.LB6_1.clipped 843_Liver_Q92_HET_F_L6.LB23_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped

GC content per read has a red flag 12 8 of the samples have a “shoulder” in the GC# chart This is usually a really bad thing Suggests non-mouse or non-biological sequence

Those same samples flag for read length as well 13 Those same samples have a mix of 50mer reads and 75mer reads That’s very odd At this point we asked our sequencing lab for clarification on what happened

Our sequencing partner found the cause _1_450L_cortex_Q175_HET_M V02604VIRT _1_528L_cortex_Q111_HET_F V02604VIRT _1_644L_cortex_Q20_HET_M V02604VIRT _1_844L_striatum_Q92_WT_F V02604VIRT _1_845L_cortex_Q92_HET_F V02761VIRT _1_847L_cortex_Q92_HET_F V02761VIRT _Liver_Q175_HET_M V02761VIRT _Liver_Q175_HET_F V02761VIRT4 The 8 suspect samples For these 8 samples, the initial run didn’t get a full 100 million reads. When that happens the lab runs the samples again and then merges the run into a full “virtual run” of the full read depth we paid for. That’s all good. The strange thing that happened to us this time was that the run they added our 8 samples to (they add it to ongoing flow cells) happened to be a 75mer run. Again no big problem usually, and what they do is clip off 25 bases in their processing and all is compatible. This specific time they forgot to trim, so we saw the ugly intermediate state. What this means is that the data for those 8 are fine. They are longer, but still good reads from our samples.

Mitochondrial rate in brain % of the reads are mtRNA nothing surprising there

Mitochondrial rate in liver % of reads are mtRNA Again in line with expectations

Other QC parameters that looked great 17 Insert sizes: All right around 175 as expected Sense/antisense sequence ratio: 1:1 as expected Sequence coverage –40% of mouse transcriptome detected in brain –About 30% of mouse transcriptome detected in liver Mapped read rate in the upper 90s –98% for brain, 96% for liver 95-97% of our reads are mapped to known genes –3-5% intergenic regions

Model based outlier detection 18 Method by which we look for the number of genes that are outliers after accounting for our modeled effects 2 samples stand out, and additional 4-6 are suspect, but probably OK (Q92 Het males) 30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped

Integrating the sample QC to choose omissions 19 A very simple way to determine what to throw out is to look for multiple strikes against a sample 454_Liver_Q175_HET_M_L3._1.clipped 456_Liver_Q175_HET_M_L7.LB4_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped 845_Liver_Q92_HET_F_L6.LB25_1.clipped 522_Liver_Q111_HET_F_L8.LB13_1.clipped 452_Liver_Q175_HET_M_L1.LB2_1.clipped 776_Liver_Q140_HET_F_L8.LB13_1.clipped 716_Liver_Q80_HET_F_L7.LB6_1.clipped 843_Liver_Q92_HET_F_L6.LB23_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped 5’ -> 3’ charts 30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped 450_Liver_Q175_HET_M_L8.LB1_1.clipped PCA outliers 30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped Model based outliers

_718L_striatum_Q80_HET_F_L5.LB14_1.clipped 35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped 20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped 21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped Final list of proposed samples for omission

Liver contamination in cortex Q140 and Q92? 21 While the sequencing lab was looking into the 75mer issue I ran cortex through some basic statistical modeling (omitting the samples mentioned previously) I found changes, but the pattern and biology was all wrong

22 Every single change is an increase Completely off in Q175, 111, 80, and 20 On (but not that strongly in 140 and 92) It’s make no sense for Q111 to be skipped and for Q175 to go back to normal Logged FPKM Albumin is the top hit? Isn’t Albumin liver specific?

Some of the other changed genes are suspicious 23 Albumin ApoC3, C1, Mup3, 10, 18, 19 FABP1 Urate oxidase All reasonably solid liver markers DAVID functional annotation also suggests the altered genes are liver related (p < )

24 A subset of genes with good correlation between liver and cortex but shifted from the 1:1 axis

25 Same chart with the “significant” genes in red

26 Same chart and shading in Q111, notice the Lack of linear correlation

What we suspect happened 27 The basic problem is that liver specific transcripts should not have correlated expression in cortex A very small amount of liver contamination has occurred. The shift is 500 to 1000 times lower than normal liver expression What this means is only the absolute highest liver expressed genes are detected at all The challenge is uniquely identifying the affected genes CortexStriatumLiver Albumin FPKMs of albumin, which should not exist in brain

28 Liver filter created as Liver mean count > 2000 Mean ratio of liver to cortex > 500 Cortex count > 0 Not a bad first approximation

Effect of filtering out the liver specific genes from the cortex data 29

Summary of QC 30 All but 4 of the 192 samples can move forward to the analysis A filter to clear out highly expressed liver genes is needed for the cortex Q140 and Q92 sets Striatum PCA plots show that CAG length is the single largest global element of variance!