Presentation is loading. Please wait.

Presentation is loading. Please wait.

Diseño Experimental y Controles de Calidad

Similar presentations


Presentation on theme: "Diseño Experimental y Controles de Calidad"— Presentation transcript:

1 Diseño Experimental y Controles de Calidad
Introducción a NGS Diseño Experimental y Controles de Calidad Antonio Rueda

2 01/30/12 01/30/12 NGS technologies 2

3 Library preparation

4 Amplification

5 01/30/12 02/04/12 Sequencing The identity of each base of a cluster is read off from sequential images From Michael Metzker, 5

6 Concepts Lecturas (reads): Son cada una de las secuencias que lee el secuenciador, cada una de las posiciones de la read se leyó en un ciclo del secuenciación. Para cada posición se reporta una base y un valor de calidad. Cada read tiene un identificador único.

7 Concepts Calidad de secuenciación (Base quality):
Todos los secuenciadores asumen errores durante el proceso de secuenciación, se reporta por cada base secuenciada un valor de calidad.

8 Concepts Mapeo: Consiste en colocar cada read en su lugar en el genoma de referencia. Aunque existen otros formatos, el formato SAM/BAM es el más usado

9 Concepts Cobertura (coverage): Llamamos coverage al número de veces que se lee cada una de las posiciones. Cuando analizamos el coverage, debemos tener en cuenta dos puntos muy importantes: El coverage. Usaremos el parámetro, coverage medio para medirlo. Debe ser suficiente para dar confianza a la zona secuenciada El coverage debe ser uniforme en todas las zonas que queremos secuenciar, usaremos el % de posiciones cubiertas y la desviación típica para cuantificarlo.

10 Concepts Detección de variantes(Variant Calling): Consiste en la búsqueda de diferencias en las reads con respecto al genoma de referencia. Referencia

11 Experimental Design Resequencing Mutation calling
02/04/12 01/30/12 Resequencing Mutation calling Profiling Genome annotation RNA-seq /Transcriptomics Quantitative Descriptive Alternative splicing miRNA profiling De novo sequencing Exome sequencing Targeted sequencing ChIP-seq /Epigenomics Protein-DNA interactions Active transcription factor binding sites Histone methylation Copy number variation Metagenomics Metatranscriptomics 11

12 DNA sequencing - 1 Whole GENOME Resequencing Need reference genome
Variation discovery

13 DNA sequencing - 2 Whole GENOME “de novo” sequencing
Uncharacterized genomes with no reference genome available Known genomes where significant structural variation is expected Long reads or mate-pair libraries. Sequencing mostly done by Roche 454 and also Illumina Assembly of reads is needed: Computational intensive E.g. Genome bacteria sequencing

14 DNA sequencing - 3 Whole EXOME Resequencing E.g. Human exome
Need reference genome Available for Human and Mouse Variation discovery on ORFs 2% of human genome (lower cost) 85% disease mutation are in the exome Need probes complementary to exons Nimblegen Agilent E.g. Human exome

15 DNA sequencing - 4 Targeted Resequencing Custom genes panel sequencing
Capture of specific regions in the genome Custom genes panel sequencing Allows to cover high number of genes related to a disease E.g. Disease gene panel Low cost and quicker than capillary sequencing Multiplexing is possible Need custom probes complementary to the genomic regions Nimblegen Agilent

16 Introduction to NGS Technologies
Transcriptomics - 1 RNA-Seq Sequencing of mRNA rRNA depleted samples Very high dynamic range No prior knwoledge of expressed genes Gives information about (richer than microarrays) Differential expression of known or unknown transcripts during a treatment or condition Isoforms New alternative splicing events Non-coding RNAs Post-transcriptional mutations or editing Gene fusions 2 Oct, 2013 Introduction to NGS Technologies

17 Introduction to NGS Technologies
Transcriptomics - 1 RNA-Seq Sequencing of mRNA Detecting gene fusions Introduction to NGS Technologies

18 Common Problems Signal Errors.
Intensity signal Error(454, Ion-Torrent).

19 Common Problems Diploid and Polyploidy Genomes:
Error or Heterozygous Variant??!!!! USE COVERAGE!!

20 Common Problems Polymerase Errors(All platforms, excluding Solid)
Ligase Errors (Solid) Mapping Errors Variant Calling Errors Human Error Maybe, Quality Control is needed!!

21 Comparison Roche 454 Illumina SOLiD Long fragments Low throughput
01/30/12 01/30/12 Comparison Roche 454 Illumina SOLiD Long fragments Low throughput Expensive Poly nts errors De novo sequencing Amplicon sequencing Metagenomics RNASeq Short fragments High throughput Cheap GC bias Resequencing De novo sequencing ChipSeq RNASeq MethylSeq Short fragments High throughput Cheap Color-space Resequencing ChipSeq RNASeq MethylSeq 21

22 Similar to all NGS platforms Pipeline & LOTS of DATA
DNA Sample NGS Instrument Data Library Preparation Sequencing Data Analysis NGS is relatively cheap but think what you want to answer, because the analysis will not do magic

23 Basic steps NGS data processing
01/30/12 01/30/12 QC and read cleaning 23

24 Basic steps NGS data processing
01/30/12 01/30/12 QC and read cleaning Mapping 24

25 Basic steps NGS data processing
01/30/12 01/30/12 QC and read cleaning Mapping DNA Binding site 25

26 Práctica Descargar programas:
FastQC Qualimap Descargar ficheros (Página de la asignatura): Fastq file Bam file

27 Quality Control: Raw Data
Number of input reads Base Quality Reads Quality Biases Software: FastQC

28 Transcript quantification
Where are we? Sequence processing Mapping Variant calling Transcript quantification Variant annotation DE analysis miRNA prediction

29 (inversed integer value)
Fastq format Fastq format “ is a fasta with qualities”: Header line (like fasta but starting with Sequence (string of nucleotides) “+” and sequence ID (optional) Quality values of sequence encoded as a single byte ASCII code File extension: .fastq Sequence quality encoding Base quality must be encoded in just 1 byte! Each base has a corresponding quality value: quality in position n is related to base in position n Encoding procedure: @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Phred transformation (inversed integer value) Error probability ASCII encoding

30 Fastq format. Sequence quality encoding
Phred transformation (inversed integer value) Error probability ASCII encoding Prob. of incorrect base call Phred quality Score Base call accuracy 1 in 10 10 90% 1 in 100 20 99% 1 in 1000 30 99.9% 1 in 10000 40 99.99% 1 in 50 99.999% Phred + 33 Sanger [0,40], Illumina 1.8 [0,41], llumina 1.9 [0,41] Phred + 64 Illumina 1.3 [0,40], Illumina 1.5 [3,40]

31 Per base sequence quality
Good quality Good data Consistent High quality along the read Reasonable quality Poor quality Shows an overview of the range of quality values across all bases at each postion in the fastq file The central red line is the median value The yellow box represents the inter-quartile range (25-75%) The upper and lower whiskers represent the 10% and 90% points The blue line represents the mean quality

32 Per base sequence quality
Good Bad data High variance Quality decreases towards the end of the read Reasonable Poor

33 Per sequence quality scores
Allows to see if a subset of sequences have universally low quality values Low quality reads Good data Most of the reads are high-quality sequences Bad data Distribution with bi-modalities

34 Per base sequence content
Good data Smooth over the read Bad data Sequence position bias Library contamination (overrep. sequence)? Plots the proportion of each base position in a file for which each of the four normal DNA bases has been called -> little/ no difference between different bases in a random library The relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other

35 Per base GC content Good data Smooth over the read Bad data Sequence position bias Library contamination (overrep. sequence)? Plots the GC content of each base position in a file -> little / no difference between the different bases (random library) The overall GC content should reflect the GC content of the underlying genome

36 Per sequence GC content
Good data Normal distribution Distribution fits with expected Organism dependent Bada data Distribution does not fit with expected Library contamination? Measures the GC content across the whole length sequence in a file and compares it to a modelled normal distribution of GC content

37 Per base N content Good data Bada data There are N bias per base position Plots the percentages of base calls at each position for which an N was called It’s not unusual to see a very low proportion of Ns appearing in a sequence, especially nearer the end of a sequence

38 Sequence length distribution
Some sequencers generate sequence fragments of uniform length Some sequencers output reads of different length (for example, Roche 454)

39 Sequence duplication levels
Bada data High number of duplicates May indicate some kind of enrichment bias (eg PCR over amplification) Good data Low level of duplication May indicate a very high level of coverage in the target sequence In transcriptomics, it is expected higher number of duplicated sequences In genomics, it is expected a low number of duplicated sequences

40 Overrepresented sequences and K-mer content
Exact same sequences too many times… Is that a problem? It depends…. PCR primers, adapters,…

41 Typical artifacts Sequence adapters

42 Typical artifacts Platform dependent

43 Sequence Filtering It is important to remove bad quality data -> our confidence on downstream analysis will be improved

44 Sequence Filtering Sequence filtering: Mean quality Read length
Read length after trimming Percentage of bases above a quality threshold Adapter trimming Adapter reads Minimum quality threshold

45 Sequence Filtering Sequence filtering tools
Fastx-toolkit ( Galaxy ( SeqTK ( Cutadapt ( ….

46 Transcript quantification
Where are we? Sequence processing Mapping Variant calling Transcript quantification Variant annotation DE analysis miRNA prediction

47 Quality Control: Mapping Data
Coverage Mapped Reads Uniformity Biases Software: BamQC

48 Qualimap Example

49 Quality Control: Capture Data
Sensitivity Specificity Uniformity Biases Software: NGScat

50 ngsCAT Example


Download ppt "Diseño Experimental y Controles de Calidad"

Similar presentations


Ads by Google