3What is DNAseq ?DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule.The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.
4Why dnaSeq ? Whole genome sequencing: Whole exome sequencing: Whole genome SNV detectionStructural variantCapture the regulatory region informationCancer analysisDe novo genome assemblyWhole exome sequencing:CheaperCapture the coding region informationRare diseases analysis
5What the DNAseq problem is about ? Strings of 100 to ≈1kb lettersPuzzle of 3,000,000,000 lettersUsually have 120,000,000,000 letters you need to fitMany pieces don’t fit :sequencing error/SNP/Structural variantMany pieces fit in many places:Low complexity region/microsatellite/repeat
16Read Filtering Clip Illumina adapters: Trim trailing quality < 30 Filter for read length ≥ 32 bpusadellab.org
17Assembly vs. Mapping mapping all vs reference Reference reads contig1 all vs all
18RNA-seq: Assembly vs Mapping Reference-based mappingRef. GenomeDNA-seqreadscontig1contig2De novo assembly
19Read Mapping Mapping problem is challenging: Need to map millions of short reads to a genomeGenome = text with billons of lettersMany mapping locations possibleNOT exact matching: sequencing errors and biological variants (substitutions, insertions, deletions, splicing)Clever use of the Burrows-Wheeler Transform increases speed and reduces memory footprintUsed mapper: BWAOther mappers: Bowtie, STAR, GEM, etc.
20SAM/BAM Used to store alignments SAM = text, BAM = binary Sample1.bambetween 10Gb ot 500Gb each bamSample2.bamSRR M =NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGUsed to store alignmentsSAM = text, BAM = binarySRR M =NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGA…Read nameFlagReference PositionCIGARMate PositionBasesBase QualitiesSAM: Sequence Alignment/Map format
21Sort, View, Index, Statistics, Etc. The BAM/SAM formatsamtools.sourceforge.netpicard.sourceforge.netSort, View, Index, Statistics, Etc.$ samtools flagstat C1.bamin total (QC-passed reads + QC-failed reads)0 + 0 duplicatesmapped (100.00%:nan%)paired in sequencingread1read2properly paired (85.06%:nan%)with itself and mate mappedsingletons (3.44%:nan%)with mate mapped to a different chrwith mate mapped to a different chr (mapQ>=5)$
28Single Nucleotide Variant calling Aim: differentiate real SNPs from sequencing errorsAn accurate SNP discovery is closely linked with a good base quality and a sufficient depth of coveragesequencing errorsSNP
29SNP and genotype calling workflow Variants from multiple samples are called simultaneously using the mpileUp method from samtools and quality filtered using bcftoolsBayesian apporachsMLE apporachsNielsen et al June 2011
30The variant format : vcf Variant Call FormatColumn FORMAT defines “:”separated valuesGT = GenotypeDP = depth…vcftools.sourceforge.net/specs.html
31VCF visualization in IGV broadinstitute.org/igv/viewing_vcf_files
33Variant annotation Hypo- or hyper-mappabilty flag dbSNP [SnpSift] Mark SNV in low confidence regionsdbSNP [SnpSift]Mark already known variantVariant effects [SnpEff]predict the effects of variants on genes (such as amino acid changes)dbNSFP [SnpSift]Functional annotations of the changeCosmic[SnpSift]Known somatic mutations
34SNV statistics Statistics are generated from the SNPeff stats outputs Example of one of the SNv metrics graph
36Home-made Rscript Generate report Files generated: Noozle-based html report which describe the entire analysis and provide QC, summary statistics as well as the entire set of resultsFiles generated:index.html, links to detailed statistics and plotsFor examples of report generated while using our pipeline please visit our website