MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads 2009-09-10 Hua Bao Sun Yat-sen University, Guangzhou,

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

GPU and machine learning solutions for comparative genomics Usman Roshan Department of Computer Science New Jersey Institute of Technology.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Predicting the Function of Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Todd J. Treangen, Steven L. Salzberg
KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.
GBS Bioinformatics Pipeline(s) Overview
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
CDS predictions using DOGFISH-C David Carter Wellcome Trust Sanger Institute 6th May 2005.
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Managing Next Generation Sequence Data with GMOD Dave Clements 1, Scott Cain 2, Paul Hohenlohe 3, Nicholas Stiffler 3, Paul Etter 3, Eric Johnson 3, William.
10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Next Generation Sequencing
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
Paul VanRaden and Chuanyu Sun Animal Genomics and Improvement Lab USDA-ARS, Beltsville, MD, USA National Association of Animal Breeders Columbia, MO, USA.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Schematic of the single variant polymorphism (SNP) genotyping assay.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Indexing genomic sequences 逢甲大學 資訊工程系 許芳榮. Outline Introduction Unique markers Multi-layer unique markers Locating SNP on genome Aligning EST to genome.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Short Read Sequencing Analysis Workshop
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Extract DNA and RNA from the same E. coli culture
VCF format: variants c.f. S. Brown NYU
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
EMC Galaxy Course November 24-25, 2014
Example of a common SNP in dogs
MapView: visualization of short reads alignment on a desktop computer
Discovery tools for human genetic variations
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Finding regulatory modules
Affine Gap Alignment - An improved global alignment
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
BLAT Blast Like Alignment Tool
A modest but significant effect of CGB5 gene promoter polymorphisms in modulating the risk of recurrent miscarriage  Kristiina Rull, M.D., Ph.D., Ole.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou, China Evolution.sysu.edu.cn InCoB 2009

Next-generation sequencing High-throughput (tens of millions reads per lane) Read length is short (25-50bp) Sequencing error rate is relatively higher than Sanger sequencing Applications: genome sequencing, transcriptome sequencing, pooled population sequencing

The objective 1. Unspliced alignment of reads onto the genome 2. Spliced alignment of transcript reads over exon-intron boundaries 3. SNP detection from population sequences

Seed hash table Read 1 TACACCACGGTCAGACTTGCATCACAACTGTTAAGC Read 2 AGACTTGCATCACAACTGTTAAGCTACACCACGGTC Read n … … Seed hash table TACACCACGGTC Position 1, Read 1, + ; Position 25, Read 2, + ; … AGACTTGCATCA Position 13, Read 1, + ; Position 1, Read 2, +; … TGATGCAAGTCT Position 25, Read 1, - ; Position 13, Read 2, -; … Other seed (K-mer) … GACCGTGGTGTA Position 1, Read 1, - ; Position 25, Read 2, - ; …

Seed hash table Coding A: 0 T: 1 G: 2 C: 3 k-mer CCGATT key = 3* * * * *4 1 +1*4 0 Seed hash table [0] (read id, position, strand) [1] [2] [..] [n] (1,1,+) (2,13,-) … Reads [0] Read sequence [1] CCGATTGGCTAAA … [2] [..] [n] Key computation of the seed Key=n

Unspliced alignment Genome TACACCACGGTCAGACTTGCATCA … Seed hash table [0] (read id, position,strand) [1] [2] [3] (1,1,+) (2,13,-) … [n] Key=3 Reads [0] Read sequence [1] [2] [3] [n] Extension O(1) K-mer:8-12bp Step-size: 1bp

Spliced alignment Genome TACACCACGGTCAGACTTGCATCA … Hash table [0] (read id, posi,strand) [1] [2] (1,H,+) (2,T,-) … [n] Key=2 Seed hit list [0] (Genome posi, read posi, strand) [1] (1,H,+) (780,T,+) … [2] (1,T,-) … O(1) Reads [0] Read sequence [1] TACACCACG … [2] [n] K-mer:6-10bp Step-size: 1bp TACACCACGGTCAGA GTGCCATGGCTAGT TACACCACGGTCAGA gt ac … cc ag GTGCCATGGCTAGT 1 780

Accuracy of alignment A total of reads (35bp length, spliced and unspliced) from 5796 coding DNA sequences of chromosome I of Arabidopsis thaliana for the query dataset were simulated. ProgramUnspliced alignmentSpiced alignment True positive (%) False positive (%) Running time (s) True positive (%) False positive (%) Running time (m) SHRiMP N/A SeqMap N/A SOAP N/A MAQ N/A QpalmaN/A MapNext

SNP detection from population sequences … TACACACGGTCAGACTAGCATCAGTCCGTAATGCT … CACGGTCAGACGAGCATCAGTCC CACACGGTCAGACGAGCATCAGT GGTCAGACGAGCATCAGTCCGTA CAGACTAGCATCAGTCCGTAATG CACACGGTCAGACTAGCATCAGT GGTCAGACTAGCATCAGACCGTA GGTCAGACTAGCATCAGTCCGTA CGGTCAGACTAGCATCAGTCCG Quality control : minimum quality score (MQS), minimum neighbour quality score (MNQS) Significance control : minimum coverage (MC) , minimum minor allele frequency (MMAF)

SNP detection from population sequences N N N Y Clustered short reads Reads that passed QC? Polymorphism sites are covered by MC number of reads? The frequency of minor allele is higher than MMAF? Candidate SNPs Y Y

Accuracy of SNP detection from population sequencing CoverageTrue positiveFalse positive 4X1961 (90.70%)690 (29.51%) 6X1998 (92.41 %)23 (1.06%) 8X2015 (93.20%)8 (0.37%) 10X2043 (94.50%)0 (0.00%) 12X2068 (95.65%)0 (0.00%) There were 2162 true SNPs in 50 individuals (haploid) in our simulation. Coverage equals sequencing depth per individual. MQV, MNQV, MMAF and MC were set at 25, 20, 0.01 and 50 (1X per individual), respectively.

Accuracy of MAF estimation from population sequencing Real minor allele frequency Estimated minor allele frequency

Summary 1. MapNext supports both spliced and unspliced alignments of the short reads. And for spliced alignments, a training process is not needed. 2. MapNext can detect SNPs and estimate minor allele frequency from population sequences.

Thank you! MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads