National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.

Slides:

Advertisements

Similar presentations

ComfortLink™ II Control. ComfortLink™ II Smart Control This is not just a thermostat. It’s an energy command center. Trane ComfortLink™ II is an easy-to-use,

Advertisements

The Maize Inflorescence Project Website Tutorial Nov 7, 2014.

DNAseq analysis Bioinformatics Analysis Team

Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.

NETW-240 Shells Last Update Copyright Kenneth M. Chipps Ph.D. 1.

Using HapMap.Org A Tutorial Lincoln Stein, Cold Spring Harbor Laboratory.

Paint Shop Tutorial. Essential Overview New Corel Paint Shop Pro Photo X2 is the ideal choice for any aspiring photographer's digital darkroom. It's filled.

PayDox applications All features can be used independently.

CS1020: Intro Workshop. Topics CS1020Intro Workshop Login to UNIX operating system 2. …………………………………… 3. …………………………………… 4. …………………………………… 5. ……………………………………

Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani

NGS Analysis Using Galaxy

Guideline for ClinLabGeneticist tool Jinlian Wang

Introduction to RNA-Seq and Transcriptome Analysis

Polymorphism and Variant Analysis Lab

Customized cloud platform for computing on your terms !

Customer Portal – Customer User. You will receive an indicating that your Customer Portal registration is complete. A link to the Customer Portal,

Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.

Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.

Guideline for ClinLabGeneticist tool Jinlian Wang

MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,

Network Management Tool Amy Auburger. 2 Product Overview Made by Ipswitch Affordable alternative to expensive & complicated Network Management Systems.

ECT 250: Survey of E-Commerce Technology FrontPage Publishing pages Unix.

GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,

Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.

Downloading and Installing Autodesk Revit 2016

An Introduction to CCP4i The CCP4 Graphical User Interface Peter Briggs CCP4.

Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.

ChrGeneticist introduction for reviewer Jinlian Wang 10/8/2014.

Downloading and Installing Autodesk Inventor Professional 2015 This is a 4 step process 1.Register with the Autodesk Student Community 2.Downloading the.

Configuring IQmol for Windows machines, use version!

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.

Copyright OpenHelix. No use or reproduction without express written consent1.

Copyright OpenHelix. No use or reproduction without express written consent1.

Getting Started with IGV Programming for Biology 2015 Madelaine Gogol Programmer Analyst Computational Biology Core Stowers Institute Kansas City, Missouri.

Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.

Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.

Renesas Technology America Inc. 1 SKP8CMINI Tutorial 2 Creating A New Project Using HEW.

Copyright OpenHelix. No use or reproduction without express written consent1.

Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.

Personalized genomics

Guideline for ClinLabGeneticist tool Jinlian Wang

Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.

Welcome to the combined BLAST and Genome Browser Tutorial.

Canadian Bioinformatics Workshops

TRACKSTER &CIRCSTER DEMO Slides: /g/funcgen/trainings/visualization/Demos/Trackster+Circster.ppt Galaxy: Galaxy Dev:

IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.

Visualizing data from Galaxy

Canadian Bioinformatics Workshops

Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.

Editing, Transferring, and Running Files on Vieques Daniel Malmer Dowell Lab Short Reads Course 6/9/15.

Canadian Bioinformatics Workshops

Day 5 Mapping and Visualization

Canadian Bioinformatics Workshops

CS1010: Intro Workshop.

Dowell Short Read Class Phillip Richmond

RNA Sequencing Day 7 Wooohoooo!

Integrative Genomics Viewer (IGV)

NGS Analysis Using Galaxy

Variant Calling Workshop

How to access your work from home or another computer

Part 3 – Remote Connection, File Transfer, Remote Environments

MiSeq Validation Pipeline

Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng

Using the Omega3P Eigensolver

Welcome to the GrameneMart Tutorial

Introduction to RNA-Seq & Transcriptome Analysis

Regulatory Genomics Lab

Presentation transcript:

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop

Overview There will be two parts to the workshop: Variant calling analysis (on the cluster) Visualization (on the desktop) using IGV Command prompts (what you will type) will be in boxes preceded by ‘$’. Output will be in red: $ mkdir foo $ cd foo $ ls -la total 96 drwxrwxr-x 2 cjfields cjfields Jun 23 22:51. drwxr-x cjfields cjfields Jun 23 22:51..

Prelude : Variant Calling Setup 1.Log into the cluster using your classroom account. 2.Create a work folder (I call mine ‘mayo_test’): $ mkdir mayo_test $ cd mayo_test $ ll total 0

Part Ia : Variant Calling Setup 3.Link in all scripts from the main work folder to this directory: $ ln -s /home/mirrors/gatk_bundle/mayo_workshop/*.sh. $ ls annotate_snpeff.sh call_variants_ug.sh hard_filtering.sh post_annotate.sh

Data for this workshop is from the 1000 Genomes project and is WGS, 60x coverage The initial part of the GATK pipeline (alignment, local realignment, base quality score recalibration) has been done, and the BAM file has been reduced for a portion of human chromosome 20 Otherwise, we would not even finish the alignment within the next few days, let alone the other steps Part Ia : Variant Calling Setup

Part Ia : Variant Calling Start the variant calling job. Check the status of the job using ‘qstat’: $ qsub call_variants_ug.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default call_variants_ug gb -- R 00:01

Part Ia : Variant Calling Discussion: what did we just do? We ran the GATK UnifiedGenotyper to call variants Show the script…

Part Ia : Variant Calling Job done yet? Should only be a few minutes… What do the data look like? (anyone here use UNIX?) $ qstat -u $ ll *vcf* -rw-rw-r-- 1 cjfields cjfields Jun 23 23:10 raw_indels.vcf -rw-rw-r-- 1 cjfields cjfields 2829 Jun 23 23:10 raw_indels.vcf.idx -rw-rw-r-- 1 cjfields cjfields Jun 23 23:08 raw_snps.vcf -rw-rw-r-- 1 cjfields cjfields Jun 23 23:08 raw_snps.vcf.idx $ tail -n 2 raw_indels.vcf rs CAGAC AC=1;AF=0.500;AN=2;BaseQRankSum=3.130;DB;DP=75;FS=0.936;MLEAC=1;MLEAF=0.500;MQ=57.75;MQ0=0;MQRan kSum=0.407;QD=5.80;ReadPosRankSum=0.371GT:AD:DP:GQ:PL0/1:44,26:75:99:1343,0, rs GTG AC=1;AF=0.500;AN=2;BaseQRankSum=3.814;DB;DP=83;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=57.12;MQ0=0;MQRan kSum=-1.411;QD=18.11;ReadPosRankSum=1.387GT:AD:DP:GQ:PL0/1:33,36:76:99:1540,0,1253

Part Ia : Variant Calling How many SNPs and Indels were called? Any found in dbSNP? $ grep -c -v '^#' raw_snps.vcf $ grep -c -v '^#' raw_indels.vcf 1070 $ grep -c 'rs[0-9]*' raw_snps.vcf $ grep -c 'rs[0-9]*' raw_indels.vcf 1019

Part Ib : Hard filtering We need to filter the variant calls Generally, for human data we would use variant quality score recalibration, but we have a very small set of variants, so here we use hard filtering

Part Ib : Hard filtering Start the hard filtering step. This will be fast: You will have two new VCF files in a minute: hard_filtered_snps.vcf hard_filtered_indels.vcf $ qsub hard_filtering.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default hard_filtering.s gb -- R --

Part Ib : Hard filtering What are we doing? Questions: Did we lose any variants? How many PASS’ed the filter? What is the difference in the filtered and raw output?

Part Ib : Hard filtering What are we doing? Questions: Did we lose any variants? How many PASS’ed the filter? What is the difference in the filtered and raw output? $ grep -c 'PASS' hard_filtered_snps.vcf 8270 $ grep -c 'PASS' hard_filtered_indels.vcf 1041

Part Ic : Annotate the variants (SnpEff) Run the next job, which uses SnpEff to add annotation to the VCF: This takes a couple of minutes… Two new VCF: hard_filtered_snps_annotated.vcf hard_filtered_indels_annotated.vcf $ qsub annotate_snpeff.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default annotate_snpeff gb -- R --

Part Ic : Annotate the variants (SnpEff) SnpEff adds information about where the variants are in relation to specific genes The IDs for the human assembly version we use are from Ensembl (ENSGXXXXXXXXXXX) The Ensembl ID for FOXA2 is ENSG

Part Ic : Annotate the variants (SnpEff) The Ensembl ID for FOXA2 is ENSG Are there any variants called for FOXA2?

Part Ic : Annotate the variants (SnpEff) The Ensembl ID for FOXA2 is ENSG Are there any variants called for FOXA2? SnpEff also creates some additional output files; we’ll see those in a bit $ grep -c 'ENSG ' hard_filtered_snps_annotated.vcf 3 $ grep -c 'ENSG ' hard_filtered_indels_annotated.vcf 0

Part Id : GATK VariantAnnotator SnpEff adds a lot of information to the VCF. GATK VariantAnnotator helps remove a lot of the extraneous information

Part Id : GATK VariantAnnotator The last step: This may take about 5-10 minutes $ qsub post_annotate.sh biocluster.igb.illinois.edu $ qstat -u biocluster.igb.illinois.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time biocluste cjfields default post_annotate.sh gb -- R 00:01

While this is going on… Let’s start a little tutorial on the Integrated Genome Viewer (also from Broad)

Prelude to Part II We need to download the results from your user folders to the local desktop We’ll use FileZilla for this

FileZilla

Transfer folder to the desktop

Part II : Viewing Results in IGV Open IGV Switch genome to ‘Human (b37)’

Part II : Viewing Results in IGV Load the VCF files marked ‘final’ Click on the ‘20’ (chr 20)

Part II : Viewing Results in IGV Click and drag from the ‘20 mb’ mark to roughly the centromeric region on the chromosome (~2.6 mb)

Part II : Viewing Results in IGV Click and drag from the ‘20 mb’ mark to roughly the centromeric region on the chromosome (~2.6 mb)

Part II : Viewing Results in IGV Should look something like this:

Part II : Viewing Results in IGV Right click on the track to bring up a menu, and then select ‘Set Feature Visibility Window’ Set to ‘ ’ (10 million, or 10 mb)

Part II : Viewing Results in IGV In the search box, enter in ‘FOXA2’ Browser jumps to that gene

Part II : Viewing Results in IGV How many SNPs are here? How many indels? How many SNPs are ‘hets’?

Part II : Viewing Results in IGV Now we’ll load in a ‘small’ BAM file. This is the same BAM file you analyzed on the cluster In the ‘gatk_bundle’ folder there is a BAM file named ‘ NA12878.HiSeq.WGS.bwa. cleaned.recal.b37.20_arm1. bam ’. Load that file. What is happening here?

Part II : Viewing Results in IGV Right click on BAM track and select ‘Show coverage track’ Note colors in coverage track

Part II : Viewing Results in IGV Right click on BAM track and select ‘Show coverage track’ Note colors in coverage track

Part II : Viewing Results in IGV Zoom in to regions with SNPs Mouse-over the SNP regions in the VCF tracks to get more information (including annotation from SnpEff and VariantAnnotator)

Browse around… How are filtered SNPs are displayed? How about low-quality bases? Can you change the display to… Sort BAM reads by strand?

Part III : SnpEff Results SnpEff gives a nice summary HTML file that should have been downloaded with your result (one for SNP, one for indel results) Open those up in a browser.

Part II : Viewing Results in IGV This gives us raw reads, but we need to calculate coverage. Let’s use igvtools. Set command to ‘Count’, then select input file (BAM file). Hit ‘Run’