Presentation is loading. Please wait.

Presentation is loading. Please wait.

Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Similar presentations


Presentation on theme: "Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &"— Presentation transcript:

1 Steve Newhouse 28 Jan 2011

2  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code & theory  Based on the 1KG pipeline from the Broad Institute using their Genome Analysis Tool Kit (GATK).  Focus on Illumina paired-end sequence data  Alignment with BWA or Novoalign  SNP & Indel calling with GATK  NB: This is one way processing the data that works well

3  BRC Cluster Software : http://compbio.brc.iop.kcl.ac.uk/cluster/software.phphttp://compbio.brc.iop.kcl.ac.uk/cluster/software.php  Maq: http://maq.sourceforge.net/http://maq.sourceforge.net/  Fastqc : http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/  Fastx: http://hannonlab.cshl.edu/fastx_toolkit/http://hannonlab.cshl.edu/fastx_toolkit/  cmpfastq.pl : http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.phphttp://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php  BWA: http://bio-bwa.sourceforge.net/bwa.shtmlhttp://bio-bwa.sourceforge.net/bwa.shtml  Novoalign: http://www.novocraft.comhttp://www.novocraft.com  Genome Analysis Toolkit: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit  PICARD TOOLS: http://picard.sourceforge.net/http://picard.sourceforge.net/  SAMTOOLS: http://samtools.sourceforge.net/http://samtools.sourceforge.net/  VCFTOOLS: http://vcftools.sourceforge.net/http://vcftools.sourceforge.net/  FASTQ Files : http://en.wikipedia.org/wiki/FASTQ_format,http://en.wikipedia.org/wiki/FASTQ_format  SAM/BAM Format : http://samtools.sourceforge.net/SAM1.pdfhttp://samtools.sourceforge.net/SAM1.pdf  PHRED Scores: http://en.wikipedia.org/wiki/Phred_quality_scorehttp://en.wikipedia.org/wiki/Phred_quality_score  Next Generation Sequencing Library: http://ngslib.genome.tugraz.athttp://ngslib.genome.tugraz.at  http://seqanswers.com http://seqanswers.com  http://www.broadinstitute.org/gsa/wiki/index.php/File:Ngs_tutorial_depristo_1210. pdf http://www.broadinstitute.org/gsa/wiki/index.php/File:Ngs_tutorial_depristo_1210. pdf

4  Convert Illumina Fastq to sanger Fastq  QC raw data  Mapping (BWA, QC-BWA, Novoalign)  Convert Sequence Alignment/Map (SAM) to BAM  Local realignment around Indels  Remove duplicates  Base Quality Score Recalibration  Variant Discovery

5 Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

6  Fastq Format :*_sequence.txt; ◦ s_1_1_sequence.txt = lane 1, read 1 ◦ s_1_2_sequence.txt = lane 1, read 2  Text file storing both nucleotide sequence and quality scores.  Both the sequence letter and quality score are encoded with a single ASCII character for brevity.  Standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer  http://en.wikipedia.org/wiki/FASTQ_format

7  Raw Data :- @315ARAAXX090414:8:1:567:552#0 TGTTTCTTTAAAAAGGTAAGAATGTTGTTGCTGGGCTTAGAAATATGAATAACCATATGCCAGATAGATAGATGGA + ; ====<<<;;;:::::99999988887766655554443333222211111000//  @315ARAAXX090414: the unique instrument name  8: flowcell lane  1: tile number within the flowcell lane  567: 'x'-coordinate of the cluster within the tile  552: 'y'-coordinate of the cluster within the tile  # :index number for a multiplexed sample (0 for no indexing)  0 :the member of a pair, /1 or /2 (paired-end or mate-pair reads only)  http://en.wikipedia.org/wiki/FASTQ_format

8 Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs Convert Illumina Fastq to sanger Fastq QC raw data

9  Convert Illumina Fastq to sanger Fastq $: maq ill2sanger s_1_1_sequence.txt foo.1.sanger.fastq $: maq ill2sanger s_1_2_sequence.txt foo.2.sanger.fastq

10  FastQC: Provides a simple way to do some quality control checks on raw sequence data. ◦ Quick impression of whether the data has problems. ◦ Import of data from BAM, SAM or FastQ ◦ Summary graphs and tables to quickly assess your data ◦ Export of results to an HTML based permanent report ◦ Offline operation to allow automated generation of reports without running the interactive application $: fastqc foo.1.sanger.fastq; $: fastqc foo.2.sanger.fastq; http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

11

12 Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM

13  Available genomes ◦ Homo_sapiens_assembly18.fasta ◦ human_b36_both.fasta ◦ human_g1k_v37.fasta (1000 genomes)  Indexed for use with BWA or Novoalign  Location: /scratch/data/reference_genomes/human  Human reference sequences and dbSNP reference metadata are available in a tarball: ◦ ftp://ftp.broadinstitute.org/pub/gsa/gatk_resources.tgz ftp://ftp.broadinstitute.org/pub/gsa/gatk_resources.tgz

14 ## Align with BWA $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: bwa aln -t 8 $REF foo.1.sanger.fastq > foo.1.sai; $: bwa aln -t 8 $REF foo.2.sanger.fastq > foo.2.sai; ## Generate alignment in the SAM format $: bwa sampe $REF foo.1.sai foo.2.sai foo.1.sanger.fastq foo.2.sanger.fastq > foo.bwa.raw.sam; ## Sort bwa SAM file using PICARD TOOLS SortSam.jar - this will also produce the BAM file $: java -jar SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT \ INPUT= foo.bwa.raw.sam OUTPUT= foo.bwa.raw.bam; ## samtools index $: samtools index foo.novo.raw.bam;  Use option -q15 if the quality is poor at the 3' end of reads  http://bio-bwa.sourceforge.net/bwa.shtml http://bio-bwa.sourceforge.net/bwa.shtml

15  Fastx: http://hannonlab.cshl.edu/fastx_toolkithttp://hannonlab.cshl.edu/fastx_toolkit ◦ QC filter raw sequence data ◦ always use -Q 33 for sanger phred scaled data (-Q 64) $: cat foo.1.sanger.fastq | \ fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 -v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | \ fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | \ fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | \ fastx_artifacts_filter -Q 33 -v | \ fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo.1.sanger.qc.fastq; $: cat foo.2.sanger.fastq | \ fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 –v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | \ fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | \ fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | \ fastx_artifacts_filter -Q 33 -v | \ fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo.2.sanger.qc.fastq;#

16  Compare QCd fastq files ◦ One end of each read could be filtered out in QC ◦ BWA cant deal with mixed PE & SE data ◦ Need to id reads that are still paired after QC ◦ Need to id reads that are no longer paired after QC $: perl cmpfastq.pl foo.1.sanger.qc.fastq foo.2.sanger.qc.fastq  Reads matched on presence/absence of id's in each file : ◦ foo.1.sanger.qc.fastq : @315ARAAXX090414:8:1:567:552#1 ◦ foo.2.sanger.qc.fastq : @315ARAAXX090414:8:1:567:552#2  Output: 2 files for each *QC.fastq file ◦ foo.1.sanger.qc.fastq -common-out (reads in foo.1.* == reads in foo.2.*) ◦ foo.1.sanger.qc.fastq -unique-out (reads in foo.1.* not in foo.2.*) ◦ foo.2.sanger.qc.fastq -commont-out ◦ foo.2.sanger.qc.fastq -unique-out http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php

17  Align with BWA $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: bwa aln -t 8 $REF foo.1.sanger.qc.fastq-common-out > foo.1.common.sai; $: bwa aln -t 8 $REF foo.2.sanger.qc.fastq-common-out > foo.2.common.sai; $: bwa aln -t 8 $REF foo.1.sanger.qc.fastq-unique-out > foo.1.unique.sai; $: bwa aln -t 8 $REF foo.1.sanger.qc.fastq-unique-ou > foo.2.unique.sai;  Multi threading option : -t N  http://bio-bwa.sourceforge.net/bwa.shtml http://bio-bwa.sourceforge.net/bwa.shtml

18  Generate alignments in the SAM/BAM format $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta ## bwa sampe for *common* $: bwa sampe $REF foo.1.common.sai foo.2.common.sai foo.1.sanger.qc.fastq-common-out foo.2.sanger.qc.fastq-common-out > foo.common.sam; ## bwa samse for *unique* $: bwa samse $REF foo.1.unique.sai foo.1.sanger.qc.fastq-unique-out > foo.1.unique.sam; $: bwa samse $REF foo.2.unique.sai foo.2.sanger.qc.fastq-unique-out > foo.2.unique.sam; ## merge SAM files using PICARD TOOLS MergeSamFiles.jar - this will also sort the BAM file $: java -jar MergeSamFiles.jar INPUT=foo.common.sam INPUT=foo.1.unique.sam INPUT=foo.2.unique.sam ASSUME_SORTED=false VALIDATION_STRINGENCY=SILENT OUTPUT=foo.bwa.raw.bam; ## samtools index samtools index foo.bwa.raw.bam;  Details SAM/BAM Format : http://samtools.sourceforge.net/SAM1.pdf

19  Has options for adaptor stripping and quality filters – and much more  More accurate than BWA but slower unless running MPI version  $1,990/year for full set of tools – worth it! $: REF=/scratch/data/reference_genomes/human/human_g1k_v37 $: novoalign -d $REF -F STDFQ -f foo.1.sanger.fastq foo.2.sanger.fastq \ -a GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG ACACTCTTTCCCTACACGACGCTCTTCCGATCT \ -r Random -i PE 200,50 -c 8 --3Prime -p 7,10 0.3,10 -k -K foo.novo.test \ -o SAM $'@RG\tID:foo\tPL:Illumina\tPU:Illumina\tLB:tumour\tSM:foo' \ > foo.novo.stats > foo.novo.raw.sam;  http://www.novocraft.com http://www.novocraft.com

20  Novoalign produces a name sorted SAM file which needs to be co-ordinate sorted for downstream processing ## Sort novo SAM file using PICARD TOOLS SortSam.jar - this will also produce the BAM file $: java -jar SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT \ INPUT= foo.novo.raw.sam OUTPUT= foo.novo.raw.bam; ## samtools index $: samtools index foo.novo.raw.bam;

21  Local realignment around Indels  Remove duplicate reads  Base Quality Score Recalibration  GATK: http://www.broadinstitute.org/gsa/wiki/index.php/T he_Genome_Analysis_Toolkit http://www.broadinstitute.org/gsa/wiki/index.php/T he_Genome_Analysis_Toolkit  PICARD TOOLS: http://picard.sourceforge.nethttp://picard.sourceforge.net  SAMTOOLS: http://samtools.sourceforge.nethttp://samtools.sourceforge.net  Many other quality stats/options for processing files using these tools : see web documentation

22 Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs Local realignment around Indels

23  Sequence aligners are unable to perfectly map reads containing insertions or deletions ◦ Alignment artefacts ◦ False positives SNPs  Steps to the realignment process: ◦ Step 1: Determining (small) suspicious intervals which are likely in need of realignment ◦ Step 2: Running the realigner over those intervals ◦ Step 3: Fixing the mate pairs of realigned reads http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels

24 Original BAM file forRealigner.intervals (interval file) Realigned BAM file RealignerTargetCreator (GATK) IndelRealigner (GATK) Co-ordinate sorted Realigned BAM file SortSam (PICARD) Co-ordinate sorted Realigned fixed BAM file FixMateInformation (PICARD)

25 $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod $: TMPDIR=~/scratch/tmp $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar ## 1. Creating Intervals : RealignerTargetCreator $: java –Xmx5g -jar $GATK -T RealignerTargetCreator -R $REF -D $ROD \ -I foo.novo.bam -o foo.novo.bam.forRealigner.intervals; ## 2. Realigning : IndelRealigner $: java -Djava.io.tmpdir=$TMPDIR –Xmx5g -jar $GATK -T IndelRealigner \ -R $REF -D $ROD -I foo.novo.bam -o foo.novo.realn.bam \ -targetIntervals foo.novo.bam.forRealigner.intervals; ## samtools index $: samtools index foo.novo.realn.bam;

26 ## 3. Sort realigned BAM file using PICARD TOOLS SortSam.jar ## GATK IndelRealigner produces a name sorted BAM $: java –Xmx5g -jar SortSam.jar \ INPUT= foo.novo.realn.bam OUTPUT=foo.novo.realn.sorted.bam \ SORT_ORDER=coordinate TMP_DIR=$TMPDIR VALIDATION_STRINGENCY=SILENT; ## samtools index $: samtools index foo.novo.realn.soretd.bam; ## 4. Fixing the mate pairs of realigned reads using Picard tools FixMateInformation.jar $: java -Djava.io.tmpdir=$TMPDIR -jar -Xmx6g FixMateInformation.jar \ INPUT= foo.novo.realn.sorted.bam OUTPUT= foo.novo.realn.sorted.fixed.bam \ SO=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=$TMPDIR; ## samtools index samtools index foo.novo.realn.sorted.fixed.bam ;

27 Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs Remove duplicates

28  Examine aligned records in the supplied SAM or BAM file to locate duplicate molecules and remove them $: TMPDIR=~/scratch/tmp ## Remove duplicate reads with Picard tools MarkDuplicates.jar $: java -Xmx6g –jar MarkDuplicates.jar \ INPUT= foo.novo.realn.sorted.fixed.bam \ OUTPUT= foo.novo.realn.duperemoved.bam \ METRICS_FILE=foo.novo.realn.Duplicate.metrics.file \ REMOVE_DUPLICATES=true \ ASSUME_SORTED=false TMP_DIR=$TMPDIR \ VALIDATION_STRINGENCY=SILENT; ## samtools index samtools index foo.novo.realn.duperemoved.bam;

29 Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs Base Quality Score Recalibration Analysis-ready reads

30  Correct for variation in quality with machine cycle and sequence context  Recalibrated quality scores are more accurate  Closer to the actual probability of mismatching the reference genome  Done by analysing the covariation among several features of a base ◦ Reported quality score ◦ The position within the read ◦ The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine ◦ Probability of mismatching the reference genome ◦ Known SNPs taken into account (dbSNP 131)  Covariates are then used to recalibrate the quality scores of all reads in a BAM file http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

31 Co-ordinate sorted Realigned fixed BAM file Covariates table (.csv) Final Recalibrated BAM file CountCovariates TableRecalibraion Recalibrated covariates table (.csv) Recalibrated covariates table (.csv) CountCovariates AnalyzeCovariates Post-recalibration analysis plots Post-recalibration analysis plots Pre-recalibration analysis plots Pre-recalibration analysis plots AnalyzeCovariates

32 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## 1. GATK CountCovariates java -Xmx8g -jar $GATK -T CountCovariates -R $REF --DBSNP $ROD \ -I foo.novo.realn.duperemoved.bam \ -recalFile foo.novo.realn.duperemoved.bam.recal_data.csv \ --default_platform Illumina \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov CycleCovariate \ -cov DinucCovariate \ -cov TileCovariate \ -cov HomopolymerCovariate \ -nback 5; http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

33 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: GATKacov=/share/apps/gatk_20100930/Sting/dist/AnalyzeCovariates.jar $: GATKR=/share/apps/gatk_20100930/Sting/R $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod $: Rbin=/share/apps/R_current/bin/Rscript ## 2. GATK AnalyzeCovariates java -Xmx5g –jar $GATKacov \ -recalFile foo.novo.realn.duperemoved.bam.recal_data.csv \ -outputDir foo.novo.realn.duperemoved.bam.recal.plots \ -resources $GATKR \ -Rscript $Rbin; http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

34  Generate the final analysis ready BAM file for Variant Discovery and Genotyping ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## 3. GATK TableRecalibration $: java –Xmx6g -jar $GATK -T TableRecalibration -R $REF \ -I foo.novo.realn.duperemoved.bam \ --out foo.novo.final.bam \ -recalFile foo.novo.realn.duperemoved.bam.recal_data.csv \ --default_platform Illumina; ##samtools index $: samtools index foo.novo.final.bam; http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

35 Illumina Raw fastq Convert Illumina Fastq to sanger Fastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs SNP & Indel calling with GATK

36 Final Recalibrated BAM file IndelGenotyperV2 gatk.raw.indels.verbose.output.bed gatk.raw.indels.bed gatk.raw.indels.detailed.output.vcf gatk.indels.filtered.bed filterSingleSampleCalls.pl gatk.filtered.indels.simple.bed... chr1 556817 556817 +G:3/7 chr1 3535035 3535054 -TTCTGGGAGCTCCTCCCCC:9/21 chr1 3778838 3778838 +A:15/48... gatk.filtered.indels.simple.bed... chr1 556817 556817 +G:3/7 chr1 3535035 3535054 -TTCTGGGAGCTCCTCCCCC:9/21 chr1 3778838 3778838 +A:15/48... UnifiedGenotyper makeIndelMask.py gatk.raw.snps.vcf VariantFiltration indels.mask.bed gatk.filtered.snps.vcf

37 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: GATKPERL=/share/apps/gatk_20100930/Sting/perl $: GATKPYTHON=/share/apps/gatk_20100930/Sting/python $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## Generate raw indel calls $: java -Xmx6g -jar $GATK -T IndelGenotyperV2 -R $REF --DBSNP $ROD \ -I foo.novo.final.bam \ -bed foo.gatk.raw.indels.bed \ -o foo.gatk.raw.indels.detailed.output.vcf \ --metrics_file foo.gatk.raw.indels.metrics.file \ -verbose foo.gatk.raw.indels.verbose.output.bed \ -minCoverage 8 -S SILENT –mnr 1000000; ## Filter raw indels $: perl $GATKPERL/filterSingleSampleCalls.pl --calls foo.gatk.raw.indels.verbose.output.bed \ --max_cons_av_mm 3.0 --max_cons_nqs_av_mm 0.5 --mode ANNOTATE > foo.gatk.filtered.indels.bed http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0

38  The output of the IndelGenotyper is used to mask out SNPs near indels.  “makeIndelMask.py” creates a bed file representing the masking intervals based on the output of IndelGenotyper. $: GATKPYTHON=/share/apps/gatk_20100930/Sting/python ## Create indel mask file $: python $GATKPYTHON/makeIndelMask.py foo.gatk.raw.indels.bed 5 indels.mask.bed  The number in this command stands for the number of bases that will be included on either side of the indel. http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0

39 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## SNP Calling $: java -Xmx5g -jar $GATK -T UnifiedGenotyper -R $REF -D $ROD \ -baq CALCULATE_AS_NECESSARY -baqGOP 30 -nt 8 \ -A DepthOfCoverage -A AlleleBalance -A HaplotypeScore -A HomopolymerRun -A MappingQualityZero -A QualByDepth -A RMSMappingQuality -A SpanningDeletions \ -I foo.novo.final.bam -o foo.gatk.raw.snps.vcf \ -verbose foo.gatk.raw.snps.vcf.verbose -metrics foo.gatk.raw.snps.vcf.metrics;  This results in a VCF (variant call format) file containing raw SNPs. ◦ VCF is a text file format. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome (SNP/INDEL calls). http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper

40  VariantFiltration is used annotate suspicious calls from VCF files based on their failing given filters.  It annotates the FILTER field of VCF files for records that fail any one of several filters: ◦ Variants that lie in clusters, using the specified values to define a cluster (i.e. the number of variants and the window size). ◦ Any variant which overlaps entries from a masking rod. ◦ Any variant whose INFO field annotations match a specified expression (i.e. the expression is used to describe records which should be filtered out). ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## VariantFiltration & annotation $: java –Xmx5g -jar $GATK -T VariantFiltration -R $REF -D $ROD \ -o foo.gatk.VariantFiltration.snps.vcf \ -B variant,VCF, foo.gatk.raw.snps.vcf \ -B mask,Bed, indels.mask.bed --maskName InDel \ --clusterSize 3 --clusterWindowSize 10 \ --filterExpression "DP <= 8" --filterName "DP8" \ --filterExpression "SB > -0.10" --filterName "StrandBias" \ --filterExpression "HRun > 8" --filterName "HRun8" \ --filterExpression "QD < 5.0" --filterName "QD5" \ --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "hard2validate";  More information on the parameters used can be found in: http://www.broadinstitute.org/gsa/wiki/index.php/VariantFiltrationWalker http://www.broadinstitute.org/gsa/wiki/index.php/Using_JEXL_expressions http://www.broadinstitute.org/gsa/wiki/index.php/VariantFiltrationWalker http://www.broadinstitute.org/gsa/wiki/index.php/Using_JEXL_expressions

41  VCFTOOLS: methods for working with VCF files: filtering,validating, merging, comparing and calculate some basic population genetic statistics.  Example of some basic hard filtering: ## filter poor quality & suspicious SNP calls vcftools --vcf foo.gatk.VariantFiltration.snps.vcf \ --remove-filtered DP8 --remove-filtered StrandBias --remove-filtered LowQual \ --remove-filtered hard2validate --remove-filtered SnpCluster \ --keep-INFO AC --keep-INFO AF --keep-INFO AN --keep-INFO DB \ --keep-INFO DP --keep-INFO DS --keep-INFO Dels --keep-INFO HRun --keep-INFO HaplotypeScore -- keep-INFO MQ --keep-INFO MQ0 --keep-INFO QD --keep-INFO SB --out foo.gatk.good.snps ;  this produces the file " foo.gatk.good.snps.recode.vcf"

42  VCFTOOLS can be used to generate useful QC measures, PLINK ped/map, Impute input and more.... ## QC & info $: for MYQC in missing freq2 counts2 depth site-depth site-mean-depth geno- depth het hardy singletons;do vcftools --vcf foo.gatk.good.snps.recode.vcf --$MYQC \ --out foo.gatk.good.snps.QC; done ## write genotypes, genotype qualities and genotype depth to separate files $: for MYFORMAT in GT GQ DP;do vcftools --vcf foo.gatk.good.snps.recode.vcf \ --extract-FORMAT-info $MYFORMAT --out foo.gatk.good.snps; done ## make PLINK ped and map files vcftools --vcf foo.gatk.good.snps.recode.vcf --plink --out foo.gatk.good.snps http://vcftools.sourceforge.net/

43  See : http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_fennell.pdfhttp://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_fennell.pdf

44  Email : stephen.newhouse@kcl.ac.ukstephen.newhouse@kcl.ac.uk  More useful links: ◦ http://www.broadinstitute.org/gsa/wiki/index.php/Prer equisites http://www.broadinstitute.org/gsa/wiki/index.php/Prer equisites ◦ http://www.broadinstitute.org/gsa/wiki/index.php/Buil ding_the_GATK http://www.broadinstitute.org/gsa/wiki/index.php/Buil ding_the_GATK ◦ http://www.broadinstitute.org/gsa/wiki/index.php/Dow nloading_the_GATK http://www.broadinstitute.org/gsa/wiki/index.php/Dow nloading_the_GATK ◦ http://www.broadinstitute.org/gsa/wiki/index.php/Inpu t_files_for_the_GATK http://www.broadinstitute.org/gsa/wiki/index.php/Inpu t_files_for_the_GATK ◦ http://www.broadinstitute.org/gsa/wiki/index.php/Prep aring_the_essential_GATK_input_files:_the_reference_gen ome http://www.broadinstitute.org/gsa/wiki/index.php/Prep aring_the_essential_GATK_input_files:_the_reference_gen ome ◦ http://www.broadinstitute.org/gsa/wiki/index.php/The_ DBSNP_rod http://www.broadinstitute.org/gsa/wiki/index.php/The_ DBSNP_rod


Download ppt "Steve Newhouse 28 Jan 2011.  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &"

Similar presentations


Ads by Google