SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396 The GNUMAP algorithm: unbiased probabilistic.

Slides:



Advertisements
Similar presentations
SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Stratified Sampling for Stochastic Transparency
GNUMap: Unbiased Probabilistic Mapping of Next- Generation Sequencing Reads Nathan Clement Computational Sciences Laboratory Brigham Young University Provo,
Fast and accurate short read alignment with Burrows–Wheeler transform
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Next Generation Sequencing, Assembly, and Alignment Methods
Sequence Alignment technology Chengwei Lei Fang Yuan Saleh Tamim.
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Permutations. Motivation Shuffling – Games – Music players Brute-force algorithms.
Using a Genetic Algorithm for Approximate String Matching on Genetic Code Carrie Mantsch December 5, 2003.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Protein Sequence Alignment and Database Searching.
Massive Parallel Sequencing
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
Next Generation Sequencing
Expected accuracy sequence alignment Usman Roshan.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Randomized Algorithms Chapter 12 Jason Eric Johnson Presentation #3 CS Bioinformatics.
From Smith-Waterman to BLAST
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
CS 6293 AT: Current Bioinformatics HW2 Papers 1
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Speaker: Chun-Yuan Lin Assistant Professor, CSIE Chang Gung University Development of Next-Generation Sequencing Tools based on Graphics Processing Units.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
 ABO blood typing  Lacks power of discrimination  RFLP analysis using minisatellite probes  High power of discrimination  Laborious  STR analysis.
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO.
SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
VCF format: variants c.f. S. Brown NYU
Multiple sequence alignment (msa)
paper study for class presentation on Nov16th, 2005 slider by 陳奕先
Department of Computer Science
Homology Search Tools Kun-Mao Chao (趙坤茂)
Merge sort merge sort: Repeatedly divides the data in half, sorts each half, and combines the sorted halves into a sorted whole. The algorithm: Divide.
Fast Sequence Alignments
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Searching Similar Segments over Textual Event Sequences
Merge sort merge sort: Repeatedly divides the data in half, sorts each half, and combines the sorted halves into a sorted whole. The algorithm: Divide.
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Canadian Bioinformatics Workshops
Precomputing Edit-Distance Specificity of Short Oligonucleotides
Lecture 20 Hashing Amortized Analysis
By Nuno Dantas GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping Mohammed Alser §, Hasan Hassan †, Hongyi.
Presentation transcript:

SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing Nathan Clement et al. Bioinformatics (2010) 26: Presented by: Xia Li

Short-read mapping software SoftwareTechniqueReference GNUMAP Hashing refs + base quality + repeated regions Clement et al., 2010 NovoalignHashing refsNovocraft, unpublished SOAPHashing refsLi et al., 2008 SeqMapHashing readsJiang et al., 2008 RMAPHashing reads + read qualitySmith et al., 2008 ElandHashing readsCox, unpublished BowtieBWTLangmead et al., 2009 Slider lexicographically sorting + base quality Malhis et al., 2009

SeqMap Motivation – Hashing genome usually needs large memory (e.g. SOAP needs 14GB memory when mapping to the human genome) – Allow more substitutions and insertion/deletion

SeqMap Pigeonhole principle – Spaced seed alignment – ELAND, SOAP, RMAP Hash reads Insertion/deletion: 2/4 combinations with 1/2 shifted one nucleotide to its left or right Short Read Short read look up table (indexed by 2 parts) Split into 4 parts All combinations of 2/4 parts Reference Genome Image credit: J. Ruan

Experiment & Result

Deal with more substitutions and insertion/deletion Randomly generate a DNA sequence of a length of 1Mb, add 100Kb random substitutions, N’s and insertion/deletions

GNUMAP Motivation – Base uncertainty Such as nearly equal or low probabilities to A, C, G or T Filter low quality reads [RMAP] -> discard up to half of the reads (Harismendy et al., 2009) – Repeated regions in the genome Discard them -> loss of up to half of the data (Harismendy et al., 2009) Record one -> unequal mapping to some of the repeat regions Record all -> each location having 3 times the correct score

GNUMAP Flow-chart

Probabilistic Needleman-Wunsch

Alignment Score ACTGAACCATACGGGTACTGAACCATGAA AACCAT GGGTACAACCATTAC Read from sequencer GGGTAC AACCAT Read is added to both repeat regions proportionally to their match quality weighted by its # of occurrences in the genome Slide credit: N. Clement

Experiment & Result

Comments SeqMap – Pos: dealing with more substations/insertion/deletion – Cons: memory consuming, not fast GNUMAP – Pos: consider base quality and repeated regions -> generate more useful information and achieves best performance (~15% increase) – Cos: memory consuming, slow, more noise