Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

Similar presentations


Presentation on theme: "Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI."— Presentation transcript:

1 Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI

2 CONTENTS  Use of finding genome variants  Current techniques for identifying gene variants  Limitations of current protocols  What is RNA-seq?  SNIPiR protocol  Results  Conclusions

3 Finding genomic variants important?  Differentiate genotype and phenotype  Basis of diseases like cancer and mendelian diseases Current methods:  WGS, WES Proposed method:  SNPiR – an RNA-seq based method

4 RNA-seq http://bioinfo.mc.vanderbilt.edu/NGS/rna- seq.html http://bioinfo.mc.vanderbilt.edu/NGS/rna- seq.html

5 Why RNA-seq over WGS?  Cost effective  Answers multiple questions - Gene expression - Alternative splicing - Allele specific expression - Gene fusion - RNA editing  Validates variants found from WGS data  De novo calling – identifies new variants  Heterogeneity of diseases – Variant calling from WGS data

6 Difficulty:  Splicing  Errors in read alignment Solution:  Strong filtering  Analysing data from multiple individuals

7 Incorrect mapping of RNA reads to reference genome  Highly similar regions  Artifacts in library construction  Not mapping reads in a splice-aware manner - Average gene length 150 bp - Read length – 100 bp (high probability of splice sites)  Alternative splicing  RNA editing

8 SNPiR protocol:  Mapping read in a splice aware manner  Variant calling using GATK (errors produced during library preparation, difficulty due to highly similar regions)  Vigorous filtering of false positives (comparison of two well characterized samples)

9 CANDIDATE SEQUENCES  GM12878 lymphoblastoid cells & peripheral- blood mononuclear cells (PBMCs)  The transcriptome, exome, and whole genome of these samples have been deeply sequenced.  The matched RNA and DNA samples enable verification of RNA SNP calls because they can be compared to variation present in the DNA  The GM12878 cell line has been extensively studied, and SNPs detected in its genome have been continuously deposited into dbSNP

10 Mapping RNA-seq data :  Whole GM12878 lymphoblastoid cells (from ENCODE) -- HiSeq: two replicates of 235.8 and 263.7 million paired-end 76 bp reads  Peripheral-blood mononuclear cells (PBMCs) from one healthy individual -- HiSeq: 20-point time series - 3,232 million paired-end 101 bp reads  Burrow-Wheeler Aligner – reference genome, transcriptome, hg19 + exonic sequences surrounding known splice sites

11 Selection Criteria  Alignment of the splice regions  BLAT step – reads with q>10 selected  SAMtools – remove identical reads mapped to the same location  Retain reads with the highest mapping quality

12 RNA-seq: variant calling and filtering GATK IndelRealigner (Local realignment) Table Recalibtration (Base-score recalibration) Unified Genotyper (Candidate variant calling)

13 Filtering candidate variants :  Loose filtering – reads with Q>20 selected  BLAT –To remap all the reads supporting a variant  Ignored variants in homopolymer runs >5bp, 4bp of splice junctions, first six bases  Removed variants in RNA editing sites  ANNOVAR - predict variants based on gene models (GENCODE, RefSeq, Ensembl, UCSC browser)  Categorised variants: --Known – found support in WGS data/SNP database --Novel -- found only in RNA-seq

14 Categorised variants  Known – found support in WGS data/SNP database  Novel -- found only in RNA-seq

15 WES,WGS variant calling and filtering  Lymphoblastoid cells – 1000 genomes project (coverage 44x)  PBMC – Sequence read archives  Mapping with BWA  Variant calling with GATK with same parameters as in 1000 genomes project  Generate gold standard for reads

16 Results - known sites:  99.6% (172,322) variants - GM12878  97.7% (292,224) variants - PBMC supported by evidence from WGS or dbSNP  For known sites ts/tv ratios were 2.25 (approx. 2-2.1), exonic regions ~3

17 Novel sites:  ~27% of novel variants in GM12878 and ~7% in PBMC were supported by variant reads in WGS data  Higher ts/tv ratios than for the known sites  Remaining novel sites -- enrichment of A>G and T>C variation -- RNA editing  RNA editing catalogue is not yet complete

18 Enrichment of Variants in Functional Categories  SNPiR detects variants better in coding exons, UTRs, and introns  33.4% of SNPs identified by WES in coding regions of GM12878 cells were also identified by SNPiR

19 High sensitivity  WGS and WES – no variant detection in coding region  SNPiR – 40.2% and 47.7% variants in GM12878 and PMNCs  When we compared the RNA-seq variants to WGS variants in expressed genes, the sensitivity to >70% (many genes are expressed at low levels)  Similar results obtained for random samplings of 5, 10, 20, 50, and 100 million reads from the GM12878 RNA-seq data set (showing high precision for low depth regions)

20

21 Comparison of Sensitivity and Precision between RNA-Seq and WES Experiments  Consensus coding region:  PBMC WES library 94.1 million mapped reads  22,052 variants through WGS  17,922 variants through WES (81.3%)  9,892 (44.9%) of them through RNA-seq  Exon Regions : 23,693 (38.2%) WGS variants by using WES and 24,987(40.3%) variants by using RNA-seq.

22

23 Comparison to RNASEQR A Bowtie based mapping program with high accuracy  Smaller number of variants identified by SNPiR  Novel SNPs show enrichment of A>G and T>C  ts/tv ratios of variants identified by RNASEQR were low (false positive hits)  Novel SNPs did not show enrichment of A>G and T>C

24  A portion of novel RNQSEQR variants did not show support in WGS data  Of the 23,878 coding SNPs identified from WGS, SNPiR identified 9,607 (40.2%) and RNASEQR identified 5,571

25 Conclusions  A computational approach for accurate identification of genomic variants from transcriptome sequencing through the combination of a splice-aware RNA-seq read-mapping procedure and subsequent variant filtering that takes the specifics of experiments  Highest possible accuracy => reads simultaneously mapped to the reference genome & short pseudochromosomes created from sequences around all currently known splice junctions

26  More precise and sensitive than TopHat2  More novel RNA variants would be found in previously unstudied data sets (potential roles in diseases)  Future directions: accurate read mapping without a well assembled genome and a well-annotated transcriptome

27 Questions??????


Download ppt "Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI."

Similar presentations


Ads by Google