Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

Slides:



Advertisements
Similar presentations
RNAseq.
Advertisements

Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
DNAseq analysis Bioinformatics Analysis Team
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq: the future of transcriptomics ……. ?
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
High Throughput Sequencing
mRNA-Seq: methods and applications
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Next Generation DNA Sequencing
Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
RNA-seq workshop ALIGNMENT
Verna Vu & Timothy Abreo
The iPlant Collaborative
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Sackler Medical School
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Supplemental Figure 1. Bias-corrected NGS bioinformatics strategies. Paired-end DNA sequencing reveals the sequence of the genomic clone, the sample ID.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Accessing and visualizing genomics data
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Canadian Bioinformatics Workshops
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Canadian Bioinformatics Workshops
Considerations for multi-omics data integration Michael Tress CNIO,
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
SNP and Genomic analysis SNP/genomic signature Clinical sampling Personalized chemotherapy Personalized Targeted therapy Personalized RNA therapy Personalized.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Short Read Sequencing Analysis Workshop
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Cancer Genomics Core Lab
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Disease risk prediction
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
EMC Galaxy Course November 24-25, 2014
From: TopHat: discovering splice junctions with RNA-Seq
Reliable Identification of Genomic Variants from RNA-Seq Data
RNA sequencing (RNA-Seq) and its application in ovarian cancer
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI

CONTENTS  Use of finding genome variants  Current techniques for identifying gene variants  Limitations of current protocols  What is RNA-seq?  SNIPiR protocol  Results  Conclusions

Finding genomic variants important?  Differentiate genotype and phenotype  Basis of diseases like cancer and mendelian diseases Current methods:  WGS, WES Proposed method:  SNPiR – an RNA-seq based method

RNA-seq seq.html seq.html

Why RNA-seq over WGS?  Cost effective  Answers multiple questions - Gene expression - Alternative splicing - Allele specific expression - Gene fusion - RNA editing  Validates variants found from WGS data  De novo calling – identifies new variants  Heterogeneity of diseases – Variant calling from WGS data

Difficulty:  Splicing  Errors in read alignment Solution:  Strong filtering  Analysing data from multiple individuals

Incorrect mapping of RNA reads to reference genome  Highly similar regions  Artifacts in library construction  Not mapping reads in a splice-aware manner - Average gene length 150 bp - Read length – 100 bp (high probability of splice sites)  Alternative splicing  RNA editing

SNPiR protocol:  Mapping read in a splice aware manner  Variant calling using GATK (errors produced during library preparation, difficulty due to highly similar regions)  Vigorous filtering of false positives (comparison of two well characterized samples)

CANDIDATE SEQUENCES  GM12878 lymphoblastoid cells & peripheral- blood mononuclear cells (PBMCs)  The transcriptome, exome, and whole genome of these samples have been deeply sequenced.  The matched RNA and DNA samples enable verification of RNA SNP calls because they can be compared to variation present in the DNA  The GM12878 cell line has been extensively studied, and SNPs detected in its genome have been continuously deposited into dbSNP

Mapping RNA-seq data :  Whole GM12878 lymphoblastoid cells (from ENCODE) -- HiSeq: two replicates of and million paired-end 76 bp reads  Peripheral-blood mononuclear cells (PBMCs) from one healthy individual -- HiSeq: 20-point time series - 3,232 million paired-end 101 bp reads  Burrow-Wheeler Aligner – reference genome, transcriptome, hg19 + exonic sequences surrounding known splice sites

Selection Criteria  Alignment of the splice regions  BLAT step – reads with q>10 selected  SAMtools – remove identical reads mapped to the same location  Retain reads with the highest mapping quality

RNA-seq: variant calling and filtering GATK IndelRealigner (Local realignment) Table Recalibtration (Base-score recalibration) Unified Genotyper (Candidate variant calling)

Filtering candidate variants :  Loose filtering – reads with Q>20 selected  BLAT –To remap all the reads supporting a variant  Ignored variants in homopolymer runs >5bp, 4bp of splice junctions, first six bases  Removed variants in RNA editing sites  ANNOVAR - predict variants based on gene models (GENCODE, RefSeq, Ensembl, UCSC browser)  Categorised variants: --Known – found support in WGS data/SNP database --Novel -- found only in RNA-seq

Categorised variants  Known – found support in WGS data/SNP database  Novel -- found only in RNA-seq

WES,WGS variant calling and filtering  Lymphoblastoid cells – 1000 genomes project (coverage 44x)  PBMC – Sequence read archives  Mapping with BWA  Variant calling with GATK with same parameters as in 1000 genomes project  Generate gold standard for reads

Results - known sites:  99.6% (172,322) variants - GM12878  97.7% (292,224) variants - PBMC supported by evidence from WGS or dbSNP  For known sites ts/tv ratios were 2.25 (approx ), exonic regions ~3

Novel sites:  ~27% of novel variants in GM12878 and ~7% in PBMC were supported by variant reads in WGS data  Higher ts/tv ratios than for the known sites  Remaining novel sites -- enrichment of A>G and T>C variation -- RNA editing  RNA editing catalogue is not yet complete

Enrichment of Variants in Functional Categories  SNPiR detects variants better in coding exons, UTRs, and introns  33.4% of SNPs identified by WES in coding regions of GM12878 cells were also identified by SNPiR

High sensitivity  WGS and WES – no variant detection in coding region  SNPiR – 40.2% and 47.7% variants in GM12878 and PMNCs  When we compared the RNA-seq variants to WGS variants in expressed genes, the sensitivity to >70% (many genes are expressed at low levels)  Similar results obtained for random samplings of 5, 10, 20, 50, and 100 million reads from the GM12878 RNA-seq data set (showing high precision for low depth regions)

Comparison of Sensitivity and Precision between RNA-Seq and WES Experiments  Consensus coding region:  PBMC WES library 94.1 million mapped reads  22,052 variants through WGS  17,922 variants through WES (81.3%)  9,892 (44.9%) of them through RNA-seq  Exon Regions : 23,693 (38.2%) WGS variants by using WES and 24,987(40.3%) variants by using RNA-seq.

Comparison to RNASEQR A Bowtie based mapping program with high accuracy  Smaller number of variants identified by SNPiR  Novel SNPs show enrichment of A>G and T>C  ts/tv ratios of variants identified by RNASEQR were low (false positive hits)  Novel SNPs did not show enrichment of A>G and T>C

 A portion of novel RNQSEQR variants did not show support in WGS data  Of the 23,878 coding SNPs identified from WGS, SNPiR identified 9,607 (40.2%) and RNASEQR identified 5,571

Conclusions  A computational approach for accurate identification of genomic variants from transcriptome sequencing through the combination of a splice-aware RNA-seq read-mapping procedure and subsequent variant filtering that takes the specifics of experiments  Highest possible accuracy => reads simultaneously mapped to the reference genome & short pseudochromosomes created from sequences around all currently known splice junctions

 More precise and sensitive than TopHat2  More novel RNA variants would be found in previously unstudied data sets (potential roles in diseases)  Future directions: accurate read mapping without a well assembled genome and a well-annotated transcriptome

Questions??????