3 What is DNAseq ?DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule.The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.
4 Why dnaSeq ? Whole genome sequencing: Whole exome sequencing: Whole genome SNV detectionStructural variantCapture the regulatory region informationCancer analysisDe novo genome assemblyWhole exome sequencing:CheaperCapture the coding region informationRare diseases analysis
5 What the DNAseq problem is about ? Strings of 100 to ≈1kb lettersPuzzle of 3,000,000,000 lettersUsually have 120,000,000,000 letters you need to fitMany pieces don’t fit :sequencing error/SNP/Structural variantMany pieces fit in many places:Low complexity region/microsatellite/repeat
16 Read Filtering Clip Illumina adapters: Trim trailing quality < 30 Filter for read length ≥ 32 bpusadellab.org
17 Assembly vs. Mapping mapping all vs reference Reference reads contig1 all vs all
18 RNA-seq: Assembly vs Mapping Reference-based mappingRef. GenomeDNA-seqreadscontig1contig2De novo assembly
19 Read Mapping Mapping problem is challenging: Need to map millions of short reads to a genomeGenome = text with billons of lettersMany mapping locations possibleNOT exact matching: sequencing errors and biological variants (substitutions, insertions, deletions, splicing)Clever use of the Burrows-Wheeler Transform increases speed and reduces memory footprintUsed mapper: BWAOther mappers: Bowtie, STAR, GEM, etc.
20 SAM/BAM Used to store alignments SAM = text, BAM = binary Sample1.bambetween 10Gb ot 500Gb each bamSample2.bamSRR M =NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGUsed to store alignmentsSAM = text, BAM = binarySRR M =NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGA…Read nameFlagReference PositionCIGARMate PositionBasesBase QualitiesSAM: Sequence Alignment/Map format
21 Sort, View, Index, Statistics, Etc. The BAM/SAM formatsamtools.sourceforge.netpicard.sourceforge.netSort, View, Index, Statistics, Etc.$ samtools flagstat C1.bamin total (QC-passed reads + QC-failed reads)0 + 0 duplicatesmapped (100.00%:nan%)paired in sequencingread1read2properly paired (85.06%:nan%)with itself and mate mappedsingletons (3.44%:nan%)with mate mapped to a different chrwith mate mapped to a different chr (mapQ>=5)$
28 Single Nucleotide Variant calling Aim: differentiate real SNPs from sequencing errorsAn accurate SNP discovery is closely linked with a good base quality and a sufficient depth of coveragesequencing errorsSNP
29 SNP and genotype calling workflow Variants from multiple samples are called simultaneously using the mpileUp method from samtools and quality filtered using bcftoolsBayesian apporachsMLE apporachsNielsen et al June 2011
30 The variant format : vcf Variant Call FormatColumn FORMAT defines “:”separated valuesGT = GenotypeDP = depth…vcftools.sourceforge.net/specs.html
31 VCF visualization in IGV broadinstitute.org/igv/viewing_vcf_files
33 Variant annotation Hypo- or hyper-mappabilty flag dbSNP [SnpSift] Mark SNV in low confidence regionsdbSNP [SnpSift]Mark already known variantVariant effects [SnpEff]predict the effects of variants on genes (such as amino acid changes)dbNSFP [SnpSift]Functional annotations of the changeCosmic[SnpSift]Known somatic mutations
34 SNV statistics Statistics are generated from the SNPeff stats outputs Example of one of the SNv metrics graph
36 Home-made Rscript Generate report Files generated: Noozle-based html report which describe the entire analysis and provide QC, summary statistics as well as the entire set of resultsFiles generated:index.html, links to detailed statistics and plotsFor examples of report generated while using our pipeline please visit our website