Canadian Bioinformatics Workshops

Slides:



Advertisements
Similar presentations
A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Exploiting SNP polymorphism data Formation Bio-informatique, 9 au 13 février 2015.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
GBS & GWAS using the iPlant Discovery Environment
Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill
DNAseq analysis Bioinformatics Analysis Team
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
High Throughput Sequencing
Ruibin Xi Peking University School of Mathematical Sciences
From sequence data to genomic prediction
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
The Phase 1 Variant Set and Future Developments
NGS Analysis Using Galaxy
NGS Workshop Variant Calling
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
NGS Workshop Variant Calling and Structural Variants from Exomes/WGS
NGS Cancer Systems Biology Workshop Variant Calling and Structural Variants from Exomes/WGS Ramesh Nair May 30, 2014.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Next Generation DNA Sequencing
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
Quality Control Hubert DENISE
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.
Bioinformatics trainings, Vietnam Hanoi, November, 2015
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
Genome STRiP ASHG Workshop demo materials
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Personalized genomics
Calling Somatic Mutations using VarScan
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Integrated variant detection Erik Garrison, Boston College.
Canadian Bioinformatics Workshops
DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Cancer Institute Frederick National Laboratory is a federally funded research.
Inheritance Model testing Andrew Stubbs Dept. Bioinformatics.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Tools for Targeted Sequencing and NGS analysis O. Harismendy, PhD BIOM262 – W2016.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Data and Hartwig Medical Foundation
Canadian Bioinformatics Workshops
> cd ~ > cp –R /media/sf_shared/BioNGS/GenomicVar/* .
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
Variant Calling Workshop
First Bite of Variant Calling in NGS/MPS Precourse materials
Introduction to RAD Acropora millepora.
EMC Galaxy Course November 24-25, 2014
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
2nd (Next) Generation Sequencing
Information processing after resequencing
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Canadian Bioinformatics Workshops
Quality Control & Nascent Sequencing
The Variant Call Format
Presentation transcript:

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 5 Small variant calling & annotation Guillaume Bourque Informatics on High-throughput Sequencing Data June 10-11, 2015

Learning Objectives of Module Have an overview of the variant calling analysis workflow Understand the basic principles of variant calling Know what can improve the variant calls Learn how to filter and annotate variants Be able to call and annotate small variants Learn about the vcf format Visualize SNPs and indels in IGV

Simplified variant analysis workflow Louis Letourneau

Main analysis steps Quality control Pre-processing (trimming, remove adapters, …) Mapping (Module 2) Small variant calling (Module 5 – this module!) SNP and indels calling Variant filtering and annotation Structural variant calling (Module 6)

Main analysis steps Quality control Pre-processing (trimming, remove adapters, …) Mapping (Module 2) Small variant calling (Module 5 – this module!) SNP and indels calling Variant filtering and annotation Structural variant calling (Module 6)

Importance of quality control Before you start an analysis, it’s very important to look at your raw data! Are all of your samples sequenced using the same protocol and instruments? Are there any technical issues affecting some of the samples? This is especially important if you plan to compare different samples or different conditions

Running FastQC on read 1 Very good!

Running FastQC on read 2 Pretty good!

Adapters sequences in reads http://www.illumina.com http://srna-workbench.cmp.uea.ac.uk

Check for over-represented sequences

Read trimming tools For example, Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data: ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read. SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. LEADING: Cut bases off the start of a read, if below a threshold quality TRAILING: Cut bases off the end of a read, if below a threshold quality CROP: Cut the read to a specified length HEADCROP: Cut the specified number of bases from the start of the read MINLEN: Drop the read if it is below a specified length TOPHRED33: Convert quality scores to Phred-33 TOPHRED64: Convert quality scores to Phred-64

Main analysis steps Quality control Pre-processing (trimming, remove adapters, …) Mapping (Module 2) Small variant calling (Module 5 – this module!) SNP and indels calling Variant filtering and annotation Structural variant calling (Module 6)

Goal Michael Strömberg

SNP Discovery: Goal Michael Strömberg sequencing errors SNP

Base quality QPhred = – 10 log10 P (error) P (error) = 10 – Q / 10 10% 20 1% 30 0.1% 40 0.01% QPhred = – 10 log10 P (error) P (error) = 10 – Q / 10 Consequtive basepairs

SNP Discovery: Base Qualities Consequtive basepairs High quality Low quality Michael Strömberg

SNPs & Bayesian Statistics # of individuals base quality allele call in read Michael Strömberg

Strategies that improve variant calling Local realignment Duplicate marking Base quality recalibration Population structure and imputation

Strategies that improve variant calling Local realignment Duplicate marking Base quality recalibration Population structure and imputation

Local realignement Before realignement After realignement DePristo et al. Nat Genet 2011

Strategies that improve variant calling Local realignment Duplicate marking Base quality recalibration Population structure and imputation

Duplicate marking www.broadinstitute.org

Strategies that improve variant calling Local realignment Duplicate marking Base quality recalibration Population structure and imputation

Base quality recalibration Adapted from DePristo et al. Nat Genet 2011

Strategies that improve variant calling Local realignment Duplicate marking Base quality recalibration Population structure and imputation

Using haplotypes for base calling Suppose that only 2 haplotypes have been observed in a population: Chr1: ..........A....T.......G.......... Chr1: ..........C....G.......A.......... And that you observe the following reads: ......A....N.......G.. ..A....N.......G..... ...A....N.......G... Can you guess the value of N ?

Impact of using multi-samples and haplotype information Nielsen et al. Nat Rev Genet 2011

GATK framework DePristo et al. Nat Genet 2011

GATK framework Module 5 Module 2 DePristo et al. Nat Genet 2011

GATK samtools freeBayes File size File format Tools Time BAM files 200 GB Recalibrated BQ, duplicates removed GATK samtools freeBayes cortex_var 10 hours Raw variants (VCF) 1 GB Sites with non-reference bases are genotyped Adapted from Mark DePristo

Main analysis steps Quality control Pre-processing (trimming, remove adapters, …) Mapping (Module 2) Small variant calling (Module 5 – this module!) SNP and indels calling Variant filtering and annotation Structural variant calling (Module 6)

VCF format Mandatory header line Mandatory header line Reference base Quality score Allele frequency, read depth, etc. Alternative base https://samtools.github.io/hts-specs/VCFv4.2.pdf

Variant filtering Raw variant calls have a lot of false positives. How to filter? Manual filtering based on different parameters (e.g. using GATK VariantFiltration or snpSift): Based on quality score, depth of coverage, etc. Difficult and requires time and expertise Learn the filters from the data itself (e.g. GATK VariantRecalibrator): Better rank-order variants based on their likelihood of being real

QC: HapMap & dbSNP International HapMap Project (phase III) 1301 individuals in 11 populations genotyped ~1 SNP per 2 kb Proxy for false negatives dbSNP (build 130) 14 million SNPs in human genome Varying quality Proxy for false positives Michael Strömberg

Variant Quality Recalibration DePristo et al. Nat Genet 2011

Somatic Mutations in 100 kidney tumours 1000 mutations (Total 575693) Scelo G et al. Nat Commun 2014

Somatic Mutations in 100 kidney tumours 1000 mutations (Total 575693) 1000 coding mutations (Total 6172) Scelo G et al. Nat Commun 2014

Annotating variants with SnpEff Annotations using reference genomes Calculate effects: Coding (e.g. Syn, Non-Syn, Stop gained, Splice) Non-coding (e.g. TFBS) Basic prioritizations (putative impact): {HIGH, MODERATE, LOW, MODIFIER} And many other things… Pablo Cingolani

samtools GATK freeBayes File size File format Tools Time BAM files 200 GB Recalibrated BQ, duplicates removed samtools GATK freeBayes cortex_var 10 hours Raw variants (VCF) 1 GB Sites with non-reference bases are genotyped GATK snpSift & snpEff 30 min Expert user judgment days Filtered & annotated variants (VCF) 1 GB Separate true segregating variation from machine/alignment artifacts Adapted from Mark DePristo

Lab time!

We are on a Coffee Break & Networking Session