Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.

Similar presentations


Presentation on theme: "Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015."— Presentation transcript:

1 Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015 TraIT Galaxy Training http://tinyurl.com/o74uehq

2 Focus of lecture and practical part Lecture: from NGS data to Variant analysis Hands on training: we will analyze NGS read data of a panel of cancer genes (Illumina TruSeq Amplicon - Cancer Panel) of prostate cancer cell line VCaP. Analysis software tools will be run interactively through Galaxy, “a web-based platform for data intensive biomedical research” Image source: A survey of tools for variant analysis of next-generation genome sequencing data, Pabinger et al., Brief Bioinform (2013) doi: 10.1093/bib/bbs086

3 Workflow: QC & Mapping reads Input reads (fastq files) Quality check with FastQC Quality- & Adapter- trimming Not OK? OK? Map reads to reference genome using e.g. BWA or Bowtie2 Output: Sorted BAM file (binary SAM sequence alignment map) Sort by coordinates using SAMtools sort or PicardTools SortSam

4 Variant Calling & Annotation pipeline Reads mapped to reference genome SAM or BAM file SAMtools Mpileup Analyze mismatches & compute likelihoods of SNP etc. Varscan2 does the actual calling Output: VCF Various statistics on quality of each variant (read depth etc.), homozygous/heterozygous etc. Filter variant allele frequency Discard variants with variant allele frequency below threshold Slice VCF Cut vcf file to retain only the regions that were enriched for sequencing (= discard regions covered by off target reads ANNOVAR: Annotate SNPs with: statistics (dbSNP, 1000 genomes etc.) and predictions (SIFT, PholyPhen etc.) DGIdb: Drug Gene Interaction Database Find drugs to diseases arising from gene mutations

5 Target enrichment used for selecting exomes Image source: Agilent

6 Selecting parts of the genome for sequencing The Illumina TruSeq Amplicon - Cancer Panel uses Multiplex PCR to amplify a selected part of the genome (a selection of the exons of 48 genes are targeted with 212 amplicons)

7 Properties of Reads (Illumina) Typical read length: 50 … 100 … 150 … 200 … 300 bp Paired reads: Insert size 200 – 500 bp Mate pairs: Insert size several kbp Depending on which Illumina platform is used, the read quality drops after 100, 150 or 200 bp Errors in Illumina reads are typically substitution errors Source: evomics.org/2014/01/alignment- methods/ Image source: Mate Pair v2 Sample Prep Guide For 2-5 kb Libraries

8 Quality Measure: Phred Score Phred score = quality scores originally developed by the base calling program Phred used with Sanger sequencing data Phred quality score Q is defined as a property which is logarithmically related to the base-calling error probability P Error rate P = 10 – (Phred score Q / 10) Q = -10 log 10 P Example: Phred score 30 = error rate 10 -3 = 1 base in 1000 will be wrong Illumina’s ‘Q score’ = Phred score The base calling programs that convert raw data to sequence data (the ‘base callers’) need to be ‘trained’ to give realistic quality values

9 format standard format to store sequence data (DNA and protein seq.) >FASTA header, often contains unique identifiers and descriptions of the sequence @unique identifiers and optional descriptions of the sequence the actual DNA sequence + separator optionally followed by description The quality values of the sequence (one character per nucleotide) standard format to sequencing reads with quality information (‘Q’ stands for Quality) More info see wikipedia ‘FASTQ_format’

10 Quality control with FastQC Need to check the quality of reads before further analysis Program FastQC is quasi standard Sequencing platform companies provide also their own tools for quality control

11 Quality control: FastQC Encoding tells which set of characters stands for which Phred scores. There’s also Encoding Illumina 1.5 and others. Other programs might not automatically recognize the encoding In Galaxy there is a possibility to set the encoding of a FastQ file via the ‘pen’ symbol.

12 Read Quality Control with FastQC Examples of per base sequence quality of all read s  Historical example of very first Solexa reads 2006 (Solexa acquired by Illumina 2007) Not so good, might still be usable, depending on application  50 bp

13 Examples of other quality measures in FastQC Upper 4 graphs from the data set of the practical course: Many reads are repeated Apparently not uniformly distributed over whole genome Overrepresented sequences: Sequenced fragment was too short and sequencing reaction ran into the Adapter/PCR primer

14 “Mapping reads to the reference” is finding where their sequence occurs in the genome Source: Wikimedia, file:Mapping Reads.png 100 bp identified 200 – 500 bp unknown sequence

15 “Mapping reads to the reference”: naïve text search algorithms are too slow Naïve approach: compare each read with every position in the genome – Takes too long, will not find sequences with mismatches Search programs typically create an index of the reference sequence (or text) and store the reference sequence (text) in an advanced data structure for fast searching. An index is basically like a phone book (with addresses)  Quickly find address (location) of a person Example of algorithm using ‘indexed seed tables’ to quickly find locations of exact parts of a read

16 “Mapping reads to the reference”: frequently used programs BLAST, the most famous bioinformatics program since 1990, is used to find similar sequences in DNA and protein data bases – 0.1-1 sec to find a result – Mapping 60 million reads would take ~ 2 months on one CPU 1  too slow for NGS 1 Popular tools for read mapping: – Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST/mrFAST: Hatem et al. BMC Bioinformatics 2013, 14:184 http://www.biomedcentral.com/1471-2105/14/184 http://www.biomedcentral.com/1471-2105/14/184 – CLCbio read mapper (commercial) – No tool is the best tool in all example conditions – differences in speed – Differently optimized for mismatches/gap models/Insertions & Deletions/taking into account read base quality/local realignment of matches etc.

17 Read Mappers: BWA and Bowtie2 Are based on the Burrows-Wheeler Transformation (BWT) BWT: special sorting of all letters in the text (sequence) Similar suffixes (word ends) will be close to each other Easier to compress Good for approximate string matching (sequence alignment) Index (FM index) for finding the locations of matched strings (sequences) in the genome

18 Read Mapping: General problems Read can match equally well at more than one location (e.g. repeats, pseudo-genes) Even fit less well to it’s actual position, e.g. if it carries a break point, insertions and/or deletions

19 SAM and BAM files SAM = Sequence Alignment Map BAM = Binary SAM = compressed SAM Sequence Alignment/Map format contains information about how sequence reads map to a reference genome Requires ~1 byte per input base to store sequences, qualities and meta information. Supports paired-end reads and color space from SOLiD. Is produced by bowtie, BWA and other mapping tools Partly from: genetics.stanford.edu/gene211/lectures/Lecture3_Resequencing_Functional_Genomics-2014.pdf

20 Example from: genetics.stanford.edu/gene211/lectures/Lecture3_Resequencing_Functional_Genomics-2014.pdf

21 Harvesting Information from SAM Query name, QNAME (SAM) / read_name (BAM). FLAG provides the following information: are there multiple fragments? are all fragments properly aligned? is this fragment unmapped? is the next fragment unmapped? is this query the reverse strand? is the next fragment the reverse strand? is this the last fragment? is this a secondary alignment? did this read fail quality controls? is this read a PCR or optical duplicate Source: www.cs.colostate.edu/~cs680/Slides/lecture3.pdf

22 Variant Calling & Annotation

23 Possible reasons for a mismatch True SNP Error generated in library preparation Base calling error – May be reduced by better base calling methods, but cannot be eliminated Misalignment (mapping error): – Local re-alignment to improve mapping Error in reference genome sequence Partly from www.biostat.jhsph.edu/~khansen/LecSNP2.pdf

24 Variant Calling: Principles Naïve approach (used in early NGS studies): – Filter base calls according to quality – Filter by frequency – Typically, a quality Filter of PHRED Q 20 was used (i.e., probability of error 1% ). – Then, the following frequency thresholds were used according to the frequency of the non-ref base, f(b): – The frequency heuristic works well if the sequencing depth is high, so that the probability of a heterozygous nucleotide falling outside of the 20% - 80% region is low. – Problems with frequency heuristic: For low sequencing depth, leads to undercalling of heterozygous genotypes Use of quality threshold leads to loss of information on individual read/base qualities Does not provide a measure of confidence in the call In parts from: compbio.charite.de/contao/index.php/genomics.html

25 Variant Calling: Principles Today’s Variant Callers rely on probability calculations Use of Bayes’ Theorem: – E.g. MAQ: One of the first widely used read mappers and variant callers Takes into account a quality score for whole read alignment & quality of base at the individual position Calls the most likely genotype given observed substitutions Reliability score can be calculated

26 Variant Calling & Annotation: Popular Tools SAMtools (Mpileup & Bcftools) GATK Varscan2 Freebayes MAQ

27 VCF = Variant Call Format Variant Call Format / BCF = binary version

28 dbSNP and snpEff dbSNP = the Single Nucleotide Polymorphism Database @ NCBI Different collections of SNPs are available: ‘all humans’, human subpopulations, different clinical significance (www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf). snpEff is a program that can annotate a collection of SNVs according to information available in dbSNP and information extracted from the location of the SNV (Exon, Intron, silent/non-sense mutation etc.)

29 ANNOVAR a Swiss Knife to annotate genetic variants (SNPs and CNVs) Input: – variants as VCF file – various databases with statistical and predictive annotations: dbSNP, 1000 genomes, … Output: – In coding region? Which gene? How frequently observed in 1000 genomes project? (and more statistics). According to coordinates from RefSeq genes, UCSC genes, ENSEMBL genes, GENCODE genes or others, etc. – In non-coding region? Conserved region? According to conservation in 44 species, transcription factor binding sites, GWAS hits, etc. – Predicted Effect? Score of SIFT, PolyPhen-2, GERP++

30 Annotation with DGIdb: mining the druggable genome Drug Gene Interaction Database: Matching disease genes with potential drugs. Searches genes against a compendium of drug-gene interactions and identify potentially 'druggable' genes

31

32 Practical part

33 Import Workflow into Galaxy Import the workflow 'Goecks Exome pipeline hg19_noDedup’ to your workflows by either – clicking on ‘Shared Data’  ‘Published Workflows’  Goecks Exome pipeline hg19_noDedup  green plus symbol to import – Or click here: – http://bioinf-galaxian.erasmusmc.nl/galaxy/u/crausch/w/imported- goecks-exome-pipeline-hg19 http://bioinf-galaxian.erasmusmc.nl/galaxy/u/crausch/w/imported- goecks-exome-pipeline-hg19 – This workflow has been imported with small modifications from: http://www.usegalaxy.org/cancer Ref: Goecks et al., Cancer Med. 2014. http://www.usegalaxy.org/cancerRef: Goecks et al., Cancer Med. 2014.

34 Import data set Import data by either – Clicking on: http://bioinf- galaxian.erasmusmc.nl/galaxy/u/crausch/h/vcap-variant-analysishttp://bioinf- galaxian.erasmusmc.nl/galaxy/u/crausch/h/vcap-variant-analysis – Or go to ‘Shared Data’  published histories  VCaP Variant Analysis – Or download the data from http://tinyurl.com/pfmshlu, unzip and upload to Galaxy by clicking onhttp://tinyurl.com/pfmshlu

35 Run Goeck’s Exome analysis pipeline on your data Chose parameter Variant allele frequency: e.g. 10% Chose the name of the data The genomic regions (bed file) contains the locations of the exons / amplicons Overview of the workflow: see next slide Because the whole workflow runs about 20 min, you can import all results from: – http://bioinf-galaxian.erasmusmc.nl/galaxy/u/crausch/h/goecks- exome-pipeline-hg19nodedupvcap http://bioinf-galaxian.erasmusmc.nl/galaxy/u/crausch/h/goecks- exome-pipeline-hg19nodedupvcap Before looking at the results of the variant analysis pipeline, check that our seq. data has good quality, using program FastQC (logically that would be the first one would do…)

36 Overview of the Variant Analysis workflow Legend: see next slide

37 Overview of the Variant Analysis workflow 1.Input: 2 fastq files, of the forward and reverse reads. Make sure that sequencing adapters and parts of reads with low quality have been removed. 2.Mapping of the reads to the reference genome hg19, output format: BAM 3.Sorting: Reads are sorted according to their coordinate position on the genome 4.Marking and Removing of ‘duplicate reads’: Reads with the identical position on the genome are likely duplicates created during the PCR amplification step. Exome sequencing typically relies on hybridization-based selection of genomic shared DNA fragments in the so-called target regions. Because the DNA sharing step is (assumed to be) a random process, reads starting at exactly the same position are more likely to be due to PCR amplification than to originate from two independent fragments starting at the same position. Note: this step is omitted when processing Amplicon sequencing reads, because reads of a given amplicon all start at the same position but can be copies of different original templates. 5.Summary of Alignment Statistics 6.MPileup: Variant counts per position and statistics (1st step of variant calling) 7.Varscan2: variant calling with program Varscan 2. Check: different analysis types are possible. In this course: “Analysis type: single nucleotide variation” is selected. Output format: VCF (variant call file). 8.Label and filter out variants with low Variant Allele frequency. 9.Slice VCF: discard all genetic regions except the exons (defined in the input BED file). 10.ANNOVAR: filter and annotate variants. 11.Annotate with DGI db (Drug Gene Interaction database).

38 Variant filtering and visualization Open the Annovar output in a new tab Go with the mouse over the lines of the VCF file and open the file in trackster (click on bar chart symbol) In trackster, load the sorted BAM file as additional track Choose one mutation that is annotated by the last annotation step, DGI. What was done in the last filter step before DGI was run? Read all annotations that the mutation has that you have selected Browse to this genomic location in Trackster. What is the coverage? Is this variant reliably covered?

39

40 d


Download ppt "Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015."

Similar presentations


Ads by Google