Download presentation
Presentation is loading. Please wait.
Published byNicholas Warren Modified over 8 years ago
1
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI
2
CONTENTS Use of finding genome variants Current techniques for identifying gene variants Limitations of current protocols What is RNA-seq? SNIPiR protocol Results Conclusions
3
Finding genomic variants important? Differentiate genotype and phenotype Basis of diseases like cancer and mendelian diseases Current methods: WGS, WES Proposed method: SNPiR – an RNA-seq based method
4
RNA-seq http://bioinfo.mc.vanderbilt.edu/NGS/rna- seq.html http://bioinfo.mc.vanderbilt.edu/NGS/rna- seq.html
5
Why RNA-seq over WGS? Cost effective Answers multiple questions - Gene expression - Alternative splicing - Allele specific expression - Gene fusion - RNA editing Validates variants found from WGS data De novo calling – identifies new variants Heterogeneity of diseases – Variant calling from WGS data
6
Difficulty: Splicing Errors in read alignment Solution: Strong filtering Analysing data from multiple individuals
7
Incorrect mapping of RNA reads to reference genome Highly similar regions Artifacts in library construction Not mapping reads in a splice-aware manner - Average gene length 150 bp - Read length – 100 bp (high probability of splice sites) Alternative splicing RNA editing
8
SNPiR protocol: Mapping read in a splice aware manner Variant calling using GATK (errors produced during library preparation, difficulty due to highly similar regions) Vigorous filtering of false positives (comparison of two well characterized samples)
9
CANDIDATE SEQUENCES GM12878 lymphoblastoid cells & peripheral- blood mononuclear cells (PBMCs) The transcriptome, exome, and whole genome of these samples have been deeply sequenced. The matched RNA and DNA samples enable verification of RNA SNP calls because they can be compared to variation present in the DNA The GM12878 cell line has been extensively studied, and SNPs detected in its genome have been continuously deposited into dbSNP
10
Mapping RNA-seq data : Whole GM12878 lymphoblastoid cells (from ENCODE) -- HiSeq: two replicates of 235.8 and 263.7 million paired-end 76 bp reads Peripheral-blood mononuclear cells (PBMCs) from one healthy individual -- HiSeq: 20-point time series - 3,232 million paired-end 101 bp reads Burrow-Wheeler Aligner – reference genome, transcriptome, hg19 + exonic sequences surrounding known splice sites
11
Selection Criteria Alignment of the splice regions BLAT step – reads with q>10 selected SAMtools – remove identical reads mapped to the same location Retain reads with the highest mapping quality
12
RNA-seq: variant calling and filtering GATK IndelRealigner (Local realignment) Table Recalibtration (Base-score recalibration) Unified Genotyper (Candidate variant calling)
13
Filtering candidate variants : Loose filtering – reads with Q>20 selected BLAT –To remap all the reads supporting a variant Ignored variants in homopolymer runs >5bp, 4bp of splice junctions, first six bases Removed variants in RNA editing sites ANNOVAR - predict variants based on gene models (GENCODE, RefSeq, Ensembl, UCSC browser) Categorised variants: --Known – found support in WGS data/SNP database --Novel -- found only in RNA-seq
14
Categorised variants Known – found support in WGS data/SNP database Novel -- found only in RNA-seq
15
WES,WGS variant calling and filtering Lymphoblastoid cells – 1000 genomes project (coverage 44x) PBMC – Sequence read archives Mapping with BWA Variant calling with GATK with same parameters as in 1000 genomes project Generate gold standard for reads
16
Results - known sites: 99.6% (172,322) variants - GM12878 97.7% (292,224) variants - PBMC supported by evidence from WGS or dbSNP For known sites ts/tv ratios were 2.25 (approx. 2-2.1), exonic regions ~3
17
Novel sites: ~27% of novel variants in GM12878 and ~7% in PBMC were supported by variant reads in WGS data Higher ts/tv ratios than for the known sites Remaining novel sites -- enrichment of A>G and T>C variation -- RNA editing RNA editing catalogue is not yet complete
18
Enrichment of Variants in Functional Categories SNPiR detects variants better in coding exons, UTRs, and introns 33.4% of SNPs identified by WES in coding regions of GM12878 cells were also identified by SNPiR
19
High sensitivity WGS and WES – no variant detection in coding region SNPiR – 40.2% and 47.7% variants in GM12878 and PMNCs When we compared the RNA-seq variants to WGS variants in expressed genes, the sensitivity to >70% (many genes are expressed at low levels) Similar results obtained for random samplings of 5, 10, 20, 50, and 100 million reads from the GM12878 RNA-seq data set (showing high precision for low depth regions)
21
Comparison of Sensitivity and Precision between RNA-Seq and WES Experiments Consensus coding region: PBMC WES library 94.1 million mapped reads 22,052 variants through WGS 17,922 variants through WES (81.3%) 9,892 (44.9%) of them through RNA-seq Exon Regions : 23,693 (38.2%) WGS variants by using WES and 24,987(40.3%) variants by using RNA-seq.
23
Comparison to RNASEQR A Bowtie based mapping program with high accuracy Smaller number of variants identified by SNPiR Novel SNPs show enrichment of A>G and T>C ts/tv ratios of variants identified by RNASEQR were low (false positive hits) Novel SNPs did not show enrichment of A>G and T>C
24
A portion of novel RNQSEQR variants did not show support in WGS data Of the 23,878 coding SNPs identified from WGS, SNPiR identified 9,607 (40.2%) and RNASEQR identified 5,571
25
Conclusions A computational approach for accurate identification of genomic variants from transcriptome sequencing through the combination of a splice-aware RNA-seq read-mapping procedure and subsequent variant filtering that takes the specifics of experiments Highest possible accuracy => reads simultaneously mapped to the reference genome & short pseudochromosomes created from sequences around all currently known splice junctions
26
More precise and sensitive than TopHat2 More novel RNA variants would be found in previously unstudied data sets (potential roles in diseases) Future directions: accurate read mapping without a well assembled genome and a well-annotated transcriptome
27
Questions??????
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.