Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

Similar presentations


Presentation on theme: "Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers"— Presentation transcript:

1 Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Mapping Sequence Tags Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

2 Genomic Data Analysis Course Moscow July 2013 Mark Reimers, Ph.D
Mapping Sequence Tags Genomic Data Analysis Course Moscow July 2013 Mark Reimers, Ph.D

3 Raw Data Formats Most common format is FASTQ
Four rows per read @HWI-EAS255_4_FC2010Y_1_43_110_790 TTAATCTACAGAATAGATAGCTAGCATATATTT + hhhhhhhhhhhhhhhdhhhhhhhhhhhdRehdh First row identifies read by experiment and location on lane or plate Second row gives sequence as read Fourth row gives quality information

4 Quality Scores Modelled after PHRED scores: 10*log10( P(error))
Calibrated by manufacturer based on previous experience May be systematically conservative or optimistic

5 Representing Quality Scores
SOLiD QV and 454 represent quality scores as decimal numbers Illumina compresses scores as ASCII characters: quality scores range from 0 to 62 using ASCII 64 to 126 (normally scores are between 0 and 40) hhhhhhhhhhhhhhhdhhhhhhhhhhhdRehdh R 82 (82-64)=18 e 101 (101-64)=37 h 104 (104-64)=40 d 100 (100-64)=36

6 Variation: SOLiD CSFASTA & QV
Color Space FASTA >1_51_64_F3 T >1_51_127_F3 T Each second line contains color space with last base of primer QV file contains integers with PHRED scores

7 Alignment

8 Pre-processing – Mapping Reads
Huge read numbers (10M – 100M) Classical Dynamic Programming alignment is far too slow Reads have more errors than typical Sanger sequence RNA-Seq reads may be spliced and map between two locations on genome

9 Mapping Software BLAST and even BLAT (2002 high-speed method) are too slow Many mapping algorithms Eland mrFAST MAQ SOAP GSNAP MoM BowTie BWA

10 Fast Alignment Approaches
Key to rapid alignment is efficient error-tolerant indexing of reference genome MAQ uses spaced indexing SOAP uses seed and hash look-up table for binary representation of query and reference Bowtie & BWA both use Burrows-Wheeler transform to compress query and reference and then map Go from here to slide 19

11 Spaced Seed Method (MAQ)
Each position in the reference is cut into equal-sized pieces, called 'seeds' and these seeds are paired and stored in a lookup table. Each read is also cut up according to this scheme, and pairs of seeds are used as keys to look up matching positions in the reference. The index can be very large – human genome takes 50GB Trapnell & Salzburg, Nature Biotech, 2009

12 Burrows-Wheeler Transform
First generate all rotations of the text, then sort them in lexicographic order. The last column is the transform Repeated motifs get transformed into repeated characters – easily compressed Easily invertible Allows hierarchical searching Aligners maps compressed reads to compressed genome

13 Burrows-Wheeler Algorithm
Reads are aligned character by character from right to left against the transformed string. With each new character, the algorithm updates an interval (indicated by blue 'beams') in the transformed string. Each successively aligned new character winnows the list of positions to which the read might map. When all characters in the read have been processed, alignments are represented by any positions within the interval. Much faster than spaced seed method Very efficient storage (2GB for human genome Trapnell & Salzburg, Nature Biotech, 2009

14 BowTie Run from command line Arguments: genome, FASTQ file, output
>bowtie e_coli ./e_coli_1000.fq e_coli.map # reads processed: 1000 # reads with at least one reported alignment: 699 (69.90%) # reads that failed to align: 301 (30.10%) Reported 699 alignments to 1 output stream(s) Pre-built indices at SourceForge BowTie 2 - gapped alignments (v. Tophat) Myrna - BowTie alignment ‘in the cloud’

15 BWA Two stages: indexing and mapping
Stage 1 arguments: number of errors, number of errors in ‘seed’, reference, query and index filename bwa aln -n 6 -k 1 ref.db.fasta input.query.fq index.sai This allows 1 error in ‘seed’ but up to 6 errors in alignment Time taken grows exponentially with k Multi-thread option -t Durbin group – well engineered

16 BWA – Second Stage Transform index file into SAM file; still needs the reference file bwa samse -n 2 ref.db.fasta index.sai input.query.fq out.sam This writes out up to 2 matches for each read bwa sampe - paired end Usually convert SAM to BAM file for easy storage

17 Alignments in Color Space
Make color-space version of reference genome BW transform reference CS sequence Transform csfasta and qv to fastq Map reads

18 Multiple Mapping Many short reads will map to multiple loci
Defaults are uniquely mapped reads only Low complexity short reads will often map to very many possible loci Most aligners don’t do a very good job of handling more than two loci Hard to work with SAM/BAM information for multiply mapped reads Rather important for low-complexity regions

19 Using Quality Scores in Mapping
Alignment is already so demanding that most aligners do not use quality information Some SNP callers use quality information

20 SNP Calling Variants in 3’ end of reads are more likely to be errors than variants in middle of reads In Illumina data a true G is read as a T with higher probability than other substitutions Therefore G->T variants are more likely than other variants to be errors (Bayes Theorem)


Download ppt "Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers"

Similar presentations


Ads by Google