Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

The Metric System.
Two Special Right Triangles
Chapter 7 Algebra II Review JEOPARDY Jeopardy Review.
I S THAT C OIN F AIR ? Section DEFINITIONS Null Hypothesis (H 0 ) : claiming that nothing that is out of the ordinary. Alternative Hypothesis (H.
Convolutional Codes Mohammad Hanaysheh Mahdi Barhoush.
Mental Mind Gym coming …. 30 Second Challenge - Advanced Additive.
8 seqs/day 96 seqs/2 hrs Bioinformatics for Genomics.
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Challenge 2 L. LaRosa for T. Trimpe 2008
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.
Metric Conversions Ladder Method
THE METRIC SYSTEM.
Graphing Linear Inequalities in Two Variables
1/2, 1/6, 1/6, 1/6 1)If you spin once, what is the probability of getting each dollar amount (fractions)? 2) If you spin twice, what is the probability.
Comparing and Rounding Decimals
Year 6 mental test 10 second questions Numbers and number system Numbers and the number system, fractions, decimals, proportion & probability.
Year 6 mental test 15 second questions Numbers and number system Numbers and the number system, Measures and Shape.
FRACTIONS, DECIMALS AND PERCENTS
WHO WANTS TO BE A MILLIONAIRE? CAPACITY STYLE.
Who Wants To Be A Millionaire? Decimal Edition Question 1.
Patterns and sequences We often need to spot a pattern in order to predict what will happen next. In maths, the correct name for a pattern of numbers is.
SOLVING EQUATIONS AND EXPANDING BRACKETS
$100 $200 $300 $400 $100 $200 $300 $400 $100 $200 $300 $400 $100 $200 $300 $400 $100 $200 $300 $400.
Equations, Tables and Graphs Graphing Activity. Warm UP xy InputOutput Determine if the following relations are functions.
EXAMPLE 1 Multiply Decimals Using Mental Math a
Money Matters First Grade Math 1. What coin is worth $0.01? 1.Penny 2.Nickel 3.Dime.
Issues in Interpreting Correlations:. Correlation does not imply causality: X might cause Y. Y might cause X. Z (or a whole set of factors) might cause.
15. Oktober Oktober Oktober 2012.
Logarithmic Functions
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
1.5 Decimals Expanded Form.
 Which number is in exponential form: -10,, 10  Identify the Base of that number  What is the exponent in that number? - the number being multiplied.
Work, Power, and Machines
Metric Conversions Kilo Big prefix Means- Times 1000.
8 2.
Calculating the Results for Using the Iodine Clock Method to Find the Order of a Reaction (Activity EP6.4)
Multiplying and dividing by 10, 100, 1000 When you multiply by 10, what would the following numbers become: 1.) 21.
Healey Chapter 7 Estimation Procedures
UNIT 2: SOLVING EQUATIONS AND INEQUALITIES SOLVE EACH OF THE FOLLOWING EQUATIONS FOR y. # x + 5 y = x 5 y = 2 x y = 2 x y.
1-1 Patterns and Inductive Reasoning
Riding the Storm Out Bruce Vandal, Education Commission of the States September 20, 2012.
Addition 1’s to 20.
Mail-merge and Contact Log Shaun Elliott – Business Consultant Enhance your knowledge, improve your organisation.
Slippery Slope
4.1 Powers of 10.
End Simplify A. 13B. 147 C. 17D – 2(5)+7.
9-6 X-box Factoring ax 2 +bx+c. X-box Factoring This is a guaranteed method for factoring quadratic equations—no guessing necessary! We will learn how.
Equations of Circles. Equation of a Circle The center of a circle is given by (h, k) The radius of a circle is given by r The equation of a circle in.
Number bonds to 10,
Bottoms Up Factoring. Start with the X-box 3-9 Product Sum
Jeopardy Start Final Jeopardy Question Category 1Category 2Category 3Category 4Category
Back to menu category 1 type you categories here– delete these instructions. Final jeopardy question.
Mixture Applications Example 1: Susan would like to mix a 10% acid solution with a 40% acid solution to make 20 ounces of a 30% acid solution. How much.
Powerpoint Jeopardy Category 1Category 2Category 3Category 4Category
Copyright © Cengage Learning. All rights reserved.
Anupam Saxena Associate Professor Indian Institute of Technology KANPUR
Perimeter Perimeter of a shape is the total length of its sides. Perimeter of a rectangle length width length width = length + width + length + width P.
One sample means Testing a sample mean against a population mean.
Decimals and the Area Model
Solving systems of equations with 2 variables Word problems (Coins)
MES Genome Informatics I - Lecture IV. NGS basics Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University.
Introduction: Cloud, Linux and basic skills Mick Watson Director of ARK-Genomics The Roslin Institute.
Adapter and quality trimming Mick Watson Director of ARK-Genomics The Roslin Institute.
What should a bioinformatician know about DNA sequencing, and why?
Sequencing technology and assembly
Equation Review Given in class 10/4/13.
Equation Review.
Additional file 2: RNA-Seq data analysis pipeline
Presentation transcript:

Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

QUALITY SCORES

Quality scores The sequencer outputs base calls at each position of a read It also outputs a quality value at each position – This relates to the probability that that base call is incorrect The most common Quality value is the Sanger Q score, or Phred score – Q sanger -10 * log 10 (p) – Where p is the probability that the call is incorrect – If p = 0.05, there is a 5% chance, or 1 in 20 chance, it is incorrect – If p = 0.01, there is a 1% chance, or 1 in 100 chance, it is incorrect – If p = 0.001, there is a 0.1% chance, or 1 in 1000 chance, it is incorrect Using the equation: – p=0.05, Q sanger = 13 – p=0.01, Q sanger = 20 – p=0.001, Q sanger = 30

For the geeks…. In R, you can investigate this: sangerq <- function(x) {return(-10 * log10(x))} sangerq(0.05) sangerq(0.01) sangerq(0.001) plot(seq(0,1,by= ),sangerq(seq(0,1,by= )), type="l")

The plot

For the geeks…. And the other way round…. qtop <- function(x) {return(10^(x/-10))} qtop(30) qtop(20) qtop(13) plot(seq(40,1,by=-1), qtop(seq(40,1,by=-1)), type="l")

The important stuff Q30 – 1 in 1000 chance base is incorrect Q20 – 1 in 100 chance base is incorrect

QUALITY ENCODING

Quality Encoding Bioinformaticians do not like to make your life easy! Q scores of 20, 30 etc take two digits Bioinformaticians would prefer they only took 1 In computers, letters have a corresponding ASCII code: Therefore, to save space, we convert the Q score (two digits) to a single letter using this scheme

The process in full p (probability base is wrong) : 0.01 Q (-10 * log10(p)) : 30 Add 33 : 63 Encode as character : ? PQCode ?

For the geeks…. code2Q <- function(x) { return(utf8ToInt(x)-33) } code2Q(".") code2Q("5") code2Q("?") code2P <- function(x) { return(10^((utf8ToInt(x)-33)/-10)) } code2P(".") code2P("5") code2P("?")

QC OF ILLUMINA DATA

FastQC FastQC is a free piece of software Written by Babraham Bioinformatics group Available on Linux, Windows etc Command-line or GUI

Read the documentation Follow the course notes

Per sequence quality One of the most important plots from FastQC Plots a box at each position The box shows the distribution of quality values at that position across all reads

Obvious problems

Less obvious problems

Really bad problems

Other useful plots Per sequence N content – May identify cycles that are unreliable Over-represented sequences – May identify Illumina adapters and primers