Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data Bassam Tork Georgia State University Atlanta, GA 30303, USA.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
RNA-seq: the future of transcriptomics ……. ?
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Ion Mandoiu Computer Science and Engineering Department
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1,
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Bioinformatics Tools for Personalized Cancer Immunotherapy
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
NGS Transcriptomic Workflows Hugh Shanahan & Jamie al-Nasir Royal Holloway, University of London.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.
Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Todd J. Treangen, Steven L. Salzberg
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
VirVarSeq vs ViVaMBC Pictured above: The structure of HIV.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Computational methods for genomics-guided immunotherapy
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.
Inferring Viral Quasispecies Spectra from NGS Reads Ion Măndoiu Computer Science & Engineering Department University of Connecticut.
Introduction to RNAseq
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Canadian Bioinformatics Workshops
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Cancer Vaccine Design Ion Mandoiu
VCF format: variants c.f. S. Brown NYU
RNA-Seq analysis in R (Bioconductor)
Computational methods for genomics-guided immunotherapy
Alexander Zelikovsky Computer Science Department
Reference based assembly
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering University of Connecticut

Haplotype Spectra Reconstruction Given NGS reads, reconstruct: – Full length sequences – Sequence frequencies Example applications: – Single individual haplotyping – Allele specific transcriptome reconstruction – Viral quasispecies reconstruction

Single Individual Haplotyping Somatic cells are diploid, containing two nearly identical copies of each autosomal chromosome – Heterozygous loci found by mapping reads to reference genome – Long haplotype fragments can be generated by sequencing fosmid pools [Duitama et al. 2012]

Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus n f1f1 *01100 f2f2 110*11 f3f * fmfm **1*11

Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus n f1f1 *01100 f2f2 110*11 f3f * fmfm **1*11

Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus n f1f1 *01100 f2f2 110*11 f3f * fmfm **1*11

Single Individual Haplotyping Input: Matrix M of m fragments covering n loci Locus n f1f1 *01100 f2f2 110*11 f3f * fmfm **1*11

RefHap Algorithm [Duitama et al. 12] Reduce the problem to Max-Cut Solve Max-Cut Build haplotypes according with the cut Locus12345 f1f1 *0110 f2f2 110*1 f3f3 1**0* f4f4 *00*1 3 f1f1 1 1 f4f4 f2f2 f3f3 h h Chr. 22, 32k SNPs, 14k fragments

Haplotype Spectra Reconstruction Given short sequence fragments, reconstruct: – Full length sequences – Sequence frequencies Example applications: – Single individual haplotyping – Allele specific transcriptome reconstruction – Viral quasispecies reconstruction

Transcriptome Reconstruction Challenge: Alternative Splicing [Griffith and Marra 07]

t 1 : t 2 : t 3 :t 4 :

Map the RNA-Seq reads to genome Construct Splice Graph - G(V,E) – V : exons – E: splicing events Generate candidate transcripts – Depth-first-search (DFS) Filter candidate transcripts – Fragment length distribution (FLD) – Integer programming Genome TRIP Transciptome Reconstruction using Integer Programming

How to filter? Select the smallest set of putative transcripts that yields a good statistical fit between – empirically determined during library preparation – implied by “mapping” read pairs Mean : 500; Std. dev. 50

Allele Specific Expression

Haplotype Spectra Reconstruction Given short sequence fragments, reconstruct: – Full length sequences – Sequence frequencies Example applications: – Single individual haplotyping – Allele specific transcriptome reconstruction – Viral quasispecies reconstruction

RNA Virus Replication High mutation rate (~10 -4 ) Lauring & Andino, PLoS Pathogens 2011

How Are Quasispecies Contributing to Virus Persistence and Evolution? Variants differ in – Virulence – Ability to escape immune response – Resistance to antiviral therapies – Tissue tropism Lauring & Andino, PLoS Pathogens 2011

Shotgun reads starting positions distributed ~uniformly Amplicon reads have predefined start/end positions covering fixed overlapping windows Shotgun vs. Amplicon Reads

Reconstruction from Shotgun Reads: ViSpA Read Error Correction Read Alignment Preprocessing of Aligned Reads Read Graph Construction Contig Assembly Frequency Estimation Shotgun reads Quasispecies sequences w/ frequencies

Reconstruction from Amplicon Reads: VirA Reference in FASTA format Error- corrected SAM/BAM Read data Estimate Amplicons Max-Bandwidth Paths Viral population variants with frequencies Amplicon Read Graph Frequency Estimation

K amplicons represented by K-layer read graph Vertices ⇔ distinct reads Edges ⇔ reads with consistent overlap Vertices have count function c(v) Amplicon Read Graph

Read Graph Transformation Heuristic to reduce edges in dense graphs Replace bipartite cliques with star subgraphs

Challenges Scalability Exploit inherent sparsity of biological instances E.g., exact scaffolding algorithm using non-serial dynamic programming based on SPQR trees Flexibility Long (noisy) reads + short Heterogeneous data, e.g., RNA-Seq + TSSeq + PolyA-Seq Quantifying reconstruction uncertainty Compute intensive, e.g., bootstrapping

Acknowledgements Jorge Duitama Sahar Al Seesi Mazhar Kahn Rachel O’Neill Alexander Artyomenko Adrian Caciula Nicholas Mancuso Serghei Mangul Bassam Tork Alex Zelikovsky Irina Astrovskaya Pavel Skums