National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment of RNA-Seq Data
National Center for Genome Analysis Support: What do the data look 6046 length=76 GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATT GCGCCCT +SRR length=76 Fastq is a common format for storing Next Gen Sequencing data. Text based Stores both the sequence and quality information Originally developed at Wellcome Trust Sanger Insitute and later adopted by Solexa (Bennett, 2004) Information for each read comprises of 4 lines Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), doi: /
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCA A + Sequence Identifier Begins with symbol Comprises of Instrument Name Flowcell Lane Tile X and Y coordinates of the Cluster on the Tile Member of a Pair (1 or 2) Index FASTQ Format
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCA A + Read Sequence (A, G, T, C, N)
FASTQ 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA + ‘+’ character Can be followed by the same Sequence Identifier (from Line1)
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA + Base Quality Scores (Phred33) for the sequence in Line2 Must contain the same number of characters as those in the sequence FASTQ Format
National Center for Genome Analysis Support: Sequencers can assign a “confidence” value per call based on how ambiguous the base call is Quality Scores Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186– 194. doi: /gr PMID The sequencer will estimate the probability that a given base call is NOT correct (Erwing 1998)
National Center for Genome Analysis Support: Quality Scores Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186– 194. doi: /gr PMID P-10*log10(p) Est. Accuracy = 1-P PHRED Score is defined as q = -10 x log 10 (p) (Erwing 1998) P = probability call is not correct
National Center for Genome Analysis Support: Why not just have numbers? Quality Score 1:N:0:ACAGTG CGTTCAGT… …
National Center for Genome Analysis Support: Why not just have numbers? Quality Score 1:N:0:ACAGTG CGTTCAGT… … Quality symbols to the rescue
National Center for Genome Analysis Support: Letters are represented deep down in the computer as numbers The quality score + a constant number (33 or 64, usually) is the number, which is converted to the quality symbol using ASCII Quality Score Encodings
National Center for Genome Analysis Support: ASCII Table
National Center for Genome Analysis Support: FastQC is an excellent program for visualizing the overall quality of all reads in a fastq file Quality Scores FastQC is developed by the Babraham Bioinformatics Group:
National Center for Genome Analysis Support: Tactics for increasing overall quality We want to cut away the low quality bases! Trimming Based on Quality ✔
National Center for Genome Analysis Support: Wholesale cutting by base position Trimming Based on Quality
National Center for Genome Analysis Support: Start from ends of read and cut away until quality is above a specified threshold (usually 20) Trimming Based on Quality ✔ 3622
National Center for Genome Analysis Support: Start from one end and keep bases until they fall below a specified threshold Trimming Based on Quality 362
National Center for Genome Analysis Support: Sliding windows and minimum vs. average quality scores Trimming Based on Quality Average: Min: Max: 25 Target: Average below 20
National Center for Genome Analysis Support: Sliding windows and minimum vs. average quality scores Trimming Based on Quality Average: Min: Max: Target: Average below 20 Step Size = 5 Window Size = 6
National Center for Genome Analysis Support: Sliding windows and minimum vs. average quality scores Trimming Based on Quality Average: Min: Max: Target: Average below 20 Step Size = 5 Window Size = 6
National Center for Genome Analysis Support: Sliding windows and minimum vs. average quality scores Trimming Based on Quality Average: Min: Max: Target: Average below 20 Step Size = 5 Window Size = 6
National Center for Genome Analysis Support: Mate pairs, orphans and minimum sequence length Trimming Based on Quality Right read too short to keep Left read survives trimming
National Center for Genome Analysis Support: Trimmomatic Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. Trim Galore! developed by the Babraham Bioinformatics Group: FASTX Toolkit Galaxy Trimming tools Trimming Software Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol Aug 25;11(8):R86. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology Jan; Chapter 19:Unit Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research Oct; 15(10):
National Center for Genome Analysis Support: What’s a Kmer? For a given sequence and a number, K, how many sub- sequences of length K are there? Kmers
National Center for Genome Analysis Support: Why? Kmers K = 5
National Center for Genome Analysis Support: When fragments are shorter than total length of the read, adapters will be sequenced on both mates of a paired-end read. For example, if we use technology that can sequence up to 100 bp: Primers and Adapters
National Center for Genome Analysis Support: When to suspect this: Patterns toward ends of reads Primers and Adapters
National Center for Genome Analysis Support: Software for removing adapters Primers and Adapters Cutadapt Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: /ej pp FASTX-Toolkit Scythe
National Center for Genome Analysis Support: Library Prep – retained and sequenced poly-As/poly-Ts When to suspect this: Poly-A Tails and Other Artifacts
National Center for Genome Analysis Support: PRINSEQ (Schmieder 2011) for trimming poly- Ts – takes a % of the read that contains T’s and sorts them out Conservatively, 60% of a read is T? Kick it out. Filter on % base, sequence complexity, duplicates Poly-A Tails and Other Artifacts Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27: [PMID: ]
National Center for Genome Analysis Support: How much sequence one can afford to cut out depends on the following: Coverage: If your sequence was run with very low coverage, you may not want to cut aggressively Sequence length: You can afford to cut 20bp out of a 150bp read, but not 30bp read Goals: Depending on your end goal, cut more or less aggressively Conservative QC vs Aggressive QC - factors
National Center for Genome Analysis Support: References Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), doi: / Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology Jan; Chapter 19:Unit Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi: /gr PMID Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research Oct; 15(10): Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol Aug 25;11(8):R86. Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: /ej pp Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27: [PMID: ]
National Center for Genome Analysis Support: Fin Thanks for watching! Questions and comments: