Presentation on theme: "Previous Lecture: Next-Generation DNA Sequencing Technology."— Presentation transcript:
Previous Lecture: Next-Generation DNA Sequencing Technology
NGS Alignment Slides are from Stratos Efstathiadis, Cole Trapneli, Steven Salzberg, Ben Langmead, Thomas Keane, Dennis Wall, Jianhua Ruan, J Fass This Lecture
Learning Objectives Challenge of NGS read alignment Suffix Arrays FM index BWA algorithm SAM/BAM format CIGAR string for variants Software to work directly with SAM/BAM files Sequence viewers
Short Read Alignment Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists – Approximate answer to: where in genome did the read originate? …TGATCATA… GATCAA …TGATCATA… GAGAAT better than What is “good”? For now, we concentrate on: …TGATATTA… GATcaT …TGATcaTA… GTACAT better than –Fewer mismatches is better –Failing to align a low-quality base is better than failing to align a high-quality base –Match Uniqueness of the alignment
Multiple mapping A single tag may occur more than once in the reference genome. The user may choose to ignore tags that appear more than n times. As n gets large, you get more data, but also more noise in the data.
Inexact matching An observed tag may not exactly match any position in the reference genome. Sometimes, the tag almost matches one or more positions. Such mismatches may represent a SNP (single-nucleotide polymorphism, see wikipedia) or a bad read-out.wikipedia The user can specify the maximum number of mismatches, or a phred- style quality score threshold. As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags. ?
Read Length is Not As Important For Resequencing Jay Shendure
New alignment algorithms must address the requirements and characteristics of NGS reads – Hundreds of Millions of reads per run (30x genome coverage) – Short Reads (as short as 36bp) – Different types of reads (single-end, paired-end, mate-pair, etc.) – Base-calling quality factors (should the aligner use them?) – Sequencing errors ( ~ 1%) – Repetitive regions – Sequencing sample vs. reference genome – Must adjust to evolving sequencing technologies and data formats
Index: an auxiliary data structure Two classes of indexing algorithms: (1)Hash tables (the “old” way) Hash of Reads (MAQ, ELAND, ZOOM, …) Smaller but variable memory requirements (depends on the amount of reads). Hash of Reference (SOAP, MOSAIK, … ) Predictable memory requirements. (2)Suffix arrays (the “new” way) BWA, Bowtie, SOAP2, …
Indexing Genomes and reads are too large for direct approaches like dynamic programming Indexing is required Choice of index is key to performance Suffix tree Suffix array Seed hash tables Many variants, incl. spaced seeds
Suffix Array Is the process reversible ? Find "ctat” in the reference
Invented by David Wheeler in 1983 (Bell Labs). Published in “A Block Sorting Lossless Data Compression Algorithm” Systems Research Center Technical Report No 124. Palo Alto, CA: Digital Equipment Corporation, Burrows M, Wheeler DJ Originally developed for compressing large files (bzip2, etc.) Lossless, Fully Reversible Alignment Tools based on BWT: bowtie, BWA, SOAP2, etc. Approach: – Align reads on the transformed reference genome, using an efficient index (FM index) – Solve the simple problem first (align one character) and then build on that solution to solve a slightly harder problem (two characters) etc. Results in great speed and efficiency gains (a few GigaByte of RAM for the entire H. Genome). Other approaches require tens of GigaBytes of memory and are much slower. NGS Read Alignment Burrows Wheeler Transformation (BWT)
FM-index Burrows-Wheeler Transform is the basis of the bzip2 file compression tool. B-W uses an FM-index (Ferragina & Manzini) which allows efficient finding of substring matches within compressed text – algorithm is sub-linear with respect to time and storage space required for a certain set of input data (reference genome) Reduced memory footprint, faster execution.
Burrows-Wheeler Store entire reference genome. Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution. Jianhua Ruan The University of Texas at San Antonio
Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac See the suffix array?
Key observation 1 $acaacg 1 2 aacg$ac 1 1 acaacg$ 1 3 acg$aca 2 1 caacg$a 1 2 cg$acaa 3 1 g$acaac 2 a1c1a2a3c2g1$1a1c1a2a3c2g1$1 “last first (LF) mapping” The i-th occurrence of character X in the last column corresponds to the same text character as the i-th occurrence of X in the first column.
Exact match (another example) $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac BWT(agcagcagact) = tgcc$ggaaaac $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac Search for pattern: gca $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac gca Test with your own seq and pattern at:
Exact Matching with FM Index To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) – Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc
FM Index is Small Entire FM Index on DNA reference consists of: – BWT (same size as T) – Checkpoints (~15% size of T) – SA sample (~50% size of T) Total: ~1.65x the size of T >45x>15x ~1.65x Assuming 2-bit-per-base encoding and no compression, as in Bowtie Assuming a 16-byte checkpoint every 448 characters, as in Bowtie Assuming Bowtie defaults for suffix- array sampling rate, etc
FASTQ Format: The de-facto file format for sharing sequence read data Sequence and a per-base quality score SAM (Sequence Alignment/Map) format: A unified format for storing read alignments to a reference genome. Generally large files (a byte per bp) Very compact in size but computationally efficient to access. BAM (Binary Alignment/Map) format: A Binary equivalent to SAM. Developed for fast processing and indexing
FASTQ GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG + B;30; A2?0 79## 36 bps read The de-facto file format for sharing DNA sequence read data 4 Lines per read Sequence line and a per-base Phred quality score line per read FASTQ Files are Text files There is No file Header 36 Quality scores Sequence Id (Illumina)
is the Error Probability: The probability that a base call is wrong. Q: Phred Quality Score Q Probability the base call in wrong (confidence) % (99.99%) % (99.9%) % (99%) % (90%) An Introduction to Phred Quality Score Phred Quality Score encoding in FASTQ/SAM files: ASCII Character = Q + 33 FASTQ Files: Q represents Base Call Quality: Probability the base call is wrong. SAM Files: Q represents Mapping Quality: Probability the mapping position of the read is incorrect. $perl –e ‘print chr(33);’
The SAM file
SAM data fields
Mapping Quality (MAPQ) in BWA BWA Mapping Quality 0 A read aligns equally well to multiple positions (hits). BWA picks randomly one of the positions and assigns MAPQ=0 1 – Only 1 Best hit (with no suboptimal hits) with more than 2 mismatches. Or Only 1 Best hit, with 1 suboptimal hit. 37 Only 1 Best hit (no suboptimal hits), with up to 2 mismatches (edit distance could be more than 2) Mapping Quality is a function of Edit Distance and the Uniqueness of the alignment.
SAM/BAM format V00-HWI-EAS132:3:38:959:2035#0 147 chr M = 79 0 GATCTGATGGCAGAAAACCCCTCTCAGTCCGTCGTG aaX`[\`Y^Y^]ZX``\EV_BBBBBBBBBBBBBBBB NM:i:1 V00-HWI-EAS132:4:99:122:772#0 177 chr M = AAAGGATCTGATGGCAGAAAACCCCTCTCAGTCCGT aaaaaa\OWaI_\WL\aa`Xa^]\ZUaa[XWT\^XR NM:i:1 V00-HWI-EAS132:4:44:473:970#0 25 chr M * 0 0 GTCGTGGTGAAGGATCTGATGGCAGAAAACACCTCT __YaZ`W[aZNUZ[U[_TL[KVVX^QURUTDRVZBB NM:i:2 V00-HWI-EAS132:4:29:113:1934#0 99 chr M = GGGTTTTCTGCCATCAGATCCTTTACCACGACAGAC aaaQaa__``]\\_^``^a^`a`_^^^_XQ[ZS\XX NM:i:1 Query NameRef sequence position of alignment query sequence (same strand as ref)query SN:chr20 AS:HG18 ID:L1 PU:SC_1_10 LB:SC_1 ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 Header section Alignment section
Post-processing: Tools and programming APIs for parsing and manipulating alignments: Samtools: Convert SAM to BAM and vice versa Sort and Index BAM files Merge multiple BAM files Show alignments in text viewer Remove Duplicates from PCR amplification step Picard Tools: (Java-based)
BAM is Indexed & Binary Compressed SAM BAM is indexed by genome location. Software toolkit allows other software to extract sequence data from BAM at specific genome – do not need to store entire data file in memory during operation of program!
SAMTools/Picard SAMTools is a simple toolkit to transform SAM to BAM, merge, sort, index Can also calculate statistics (mean quality, depth of coverage, etc.) filter duplicate reads create multiple alignments of all reads over a genomic interval call variants
SAMTools commands Download and install SAMTools Make the BAM file: samtools view –bt ref.fa –o data.bam data.sam Sort the BAM file: samtools sort data.bam data.sorted.bam Index the BAM file: samtools index data.sorted.bam data.st_index.bam
Manual Reference Pages - samtools (1) DESCRIPTION Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly. SYNOPSIS samtools view -bt ref_list.txt -o aln.bam aln.sam.gz samtools sort aln.bam aln.sorted samtools index aln.sorted.bam samtools idxstats aln.sorted.bam samtools view aln.sorted.bam chr2:20,100,000-20,200,000 samtools merge out.bam in1.bam in2.bam in3.bam samtools faidx ref.fasta samtools pileup -vcf ref.fasta aln.sorted.bam samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam samtools tview aln.sorted.bam ref.fasta bcftools index in.bcf bcftools view in.bcf chr2: > out.vcf bcftools view -vc in.bcf > out.vcf 2> out.afs Heng Li from the Sanger Institute wrote the C version of samtools.
Visualization BAM file format also simplifies visualization of NGS data. Must make a.BAI index for each BAM file using samtools index command. BAM and.BAI files must be located on your own computer Index allows viewer to quickly find and load only read data for a specific genomic interval. Integrative Genomics Viewer (IGV) from Broad Institute is our current favorite. Use the Java Webstart to download and run IGV (use the 1.2 GB version):
Challenge of NGS read allignment Suffix Array FM index BWA algorithm SAM/BAM format CIGAR string for variants Software to work directly with SAM/BAM files Sequence viewers Summary