Presentation is loading. Please wait.

Presentation is loading. Please wait.

SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396 The GNUMAP algorithm: unbiased probabilistic.

Similar presentations


Presentation on theme: "SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396 The GNUMAP algorithm: unbiased probabilistic."— Presentation transcript:

1 SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396 The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing Nathan Clement et al. Bioinformatics (2010) 26: 38-45 Presented by: Xia Li

2 Short-read mapping software SoftwareTechniqueReference GNUMAP Hashing refs + base quality + repeated regions Clement et al., 2010 NovoalignHashing refsNovocraft, unpublished SOAPHashing refsLi et al., 2008 SeqMapHashing readsJiang et al., 2008 RMAPHashing reads + read qualitySmith et al., 2008 ElandHashing readsCox, unpublished BowtieBWTLangmead et al., 2009 Slider lexicographically sorting + base quality Malhis et al., 2009

3 SeqMap Motivation – Hashing genome usually needs large memory (e.g. SOAP needs 14GB memory when mapping to the human genome) – Allow more substitutions and insertion/deletion

4 SeqMap Pigeonhole principle – Spaced seed alignment – ELAND, SOAP, RMAP Hash reads Insertion/deletion: 2/4 combinations with 1/2 shifted one nucleotide to its left or right Short Read Short read look up table (indexed by 2 parts) Split into 4 parts All combinations of 2/4 parts Reference Genome Image credit: J. Ruan

5 Experiment & Result

6 Deal with more substitutions and insertion/deletion Randomly generate a DNA sequence of a length of 1Mb, add 100Kb random substitutions, N’s and insertion/deletions

7 GNUMAP Motivation – Base uncertainty Such as nearly equal or low probabilities to A, C, G or T Filter low quality reads [RMAP] -> discard up to half of the reads (Harismendy et al., 2009) – Repeated regions in the genome Discard them -> loss of up to half of the data (Harismendy et al., 2009) Record one -> unequal mapping to some of the repeat regions Record all -> each location having 3 times the correct score

8 GNUMAP Flow-chart

9 Probabilistic Needleman-Wunsch

10 Alignment Score ACTGAACCATACGGGTACTGAACCATGAA AACCAT GGGTACAACCATTAC Read from sequencer GGGTAC AACCAT Read is added to both repeat regions proportionally to their match quality weighted by its # of occurrences in the genome Slide credit: N. Clement

11 Experiment & Result

12 Comments SeqMap – Pos: dealing with more substations/insertion/deletion – Cons: memory consuming, not fast GNUMAP – Pos: consider base quality and repeated regions -> generate more useful information and achieves best performance (~15% increase) – Cos: memory consuming, slow, more noise


Download ppt "SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396 The GNUMAP algorithm: unbiased probabilistic."

Similar presentations


Ads by Google