Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aziz Khan, Ning Wang April 22, 2013

Similar presentations


Presentation on theme: "Aziz Khan, Ning Wang April 22, 2013"— Presentation transcript:

1 Aziz Khan, Ning Wang April 22, 2013
Landscape of transcription in human cells S. Djebali, C. A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, A. Tanzer, J. Lagarde, W. Lin, F. Schlesinger, C. Xue, G. K. Marinov, J. Khatun, B. A. Williams, C. Zaleski, J. Rozowsky, M. R¨oder, F. Kokicinski, R. F. Abdelhamid, T. Alioto, I. Antoshechkin, M. T. Baer, N. S. Bar, P. Batut, K. Bell, I. Bell, S. Chakrabortty, X. Chen, J. Chrast, J. Curado, T. Derrien, J. Drenkow, E. Dumais, J. Dumais, R. Duttagupta, E. Falconnet, M. Fastuca, K. Fejes-Toth, P. Ferreira, S. Foissac, M. J. Fullwood, H. Gao, D. Gonzalez, A. Gordon, H. Gunawardena, C. Howald, S. Jha, R. Johnson, P. Kapranov, B. King, C. Kingswood, O. J. Luo, E. Park, K. Persaud, J. B. Preall, P. Ribeca, B. Risk, D. Robyr, M. Sammeth, L. Schaffer, L.-H. See, A. Shahab, J. Skancke, A. M. Suzuki, H. Takahashi, H. Tilgner, D. Trout, N. Walters, H. Wang, H. Wrobel, Y. Yu, X. Ruan, Y. Hayashizaki, J. Harrow, M. Gerstein, T. Hubbard, A. Reymond, S. E. Anonarakis, G. Hannon, M. C. Giddings, Y. Ruan, B. Wold, P. Carninci, R. Guig´o, & T. R. Gingeras. Nature 489 (2012) 101–108. Aziz Khan, Ning Wang April 22, 2013 GROUP: 3

2 Outline Motivation & goal Workflow Data generation Results Conclusion
Long RNA expression landscape Short RNA expression landscape RNA editing & allele-specific expression Repeat region transcription Characterization of enhancer RNA Conclusion Landscape of transcription in human cells, Djebali et al. Nature 2012

3 Motivation and goal ENCODE pilot phase (2003–2007)
Examine 1% of the human genome The ENCODE second phase ( ) To interrogate the complete human genome 80% of the human genome have at least one biochemical function Goal: Provide a genome-wide catalogue of the produced RNAs Identify the subcellular localization for the produced RNAs ENCODE project started in September 2003, brought together an international group of scientists with the goal to to identify all functional elements in the human genome sequence. Initially focusing on 1% of the genome, the pilot project was expanded to the whole genome in 2007 which completed in 2012. Here they report identification and characterization of annotated and novel RNAs that are enriched in either of the two major cellular subcompartments (nucleus and cytosol) for all 15 cell lines studied, and in three additional subnuclear compartments in one cell line (K562). Landscape of transcription in human cells, Djebali et al. Nature 2012

4 Experimental workflow
K562 cell line They performed subcellular compartment fractionation (whole cell, nucleus and cytosol) before RNA isolation in the used cell line to interrogate deeply the human transcriptome. Total RNA was then isolated and partitioned into RNA > 200bp (long) and < 200bp (short). The long RNA was further partitioned into polyA+ and polyA- transcripts. The K562 cell line also underwent additional fractionation into nucleoli, nucleoplasm and chromatin. RNA-seq was conducted on polyA+, polyA- and total (K562) RNA samples. CAGE( cap analysis gene expression) was conducted primarily on polyA+ and total RNA but also on some polyA- samples. RNA-PET(pair end tags) was conducted on PolyA+ samples only RNA-PET(pair end tags) sites of 5’ & 3’ transcripts termini CAGE(cap analysis of gene expression) sites of initiation of transcription Landscape of transcription in human cells, Djebali et al. Nature 2012

5 15 ENCODE cell lines To facilitate integration of data between the contributing research groups, the ENCODE Consortium has identified common cell types for use by ENCODE contributors. These cell types are divided into two Tiers. Tier1 cells are of higher priority, and should be used within experiments before Tier2 cells. Additional cell types beyond Tier1 and Tier2 may be used, but must be registered with the DCC before submitting data. These additional cell types are designated Tier3. Landscape of transcription in human cells, Djebali et al. Nature 2012

6 RNA data processing Raw read data (FASTQs) from each biological replicate is independently mapped against the hg19 ENCODE sex-specific assemblies according to the gender of the sample to generate .BAM (alignment) and .wig (signal) files. Each data type is processed though a custom pipeline developed by the production lab and subsequently distilled into elements, like splice junctions, contigs, de novo assembled genes and transcripts, CAGE and PET clusters. The mapped data is also used to quantify Gencode annotated features, such as genes, transcripts and exons. All elements and features are assessed for their reproducibility using npIDR or IDR (irreproducable detection rate). Finally, all data is sent to the Data Co-ordination Centre (DCC) at UCSC. Landscape of transcription in human cells, Djebali et al. Nature 2012

7 RNA data & processing software
The mapped data was used to assemble and quantify annotated GENCODE v7 elements Elements and quantifications were further assessed for reproducibility between replicates using a non-parametric version of IDR Landscape of transcription in human cells, Djebali et al. Nature 2012

8 Long RNA expression landscape.

9 Detection of annotated transcripts
70% of annotated splice junctions, transcripts and genes were cumulatively detected ~ 85% of annotated exons with an average of coverage 96% (by RNA-seq). Small variation in the proportion of detected GENCODE elements Only a small proportion of GENCODE elements are detected exclusively in the Poly-A− RNA fraction Figure shows GENCODE-detected elements in the Poly-A+ and Poly-A- fractions of cellular compartments (cumulative counts for both RNA fractions and compartments refer to elements present in any of the fractions or compartments). Each box plot is generated from values across all cell lines, thus capturing the dispersion across cell lines. The largest point shows the cumulative value over all cell lines. GENCODE elements are detected by RNA-seq data Landscape of transcription in human cells, Djebali et al. Nature 2012

10 Detection of novel transcripts
The identified novel elements covered 78% of the intronic nucleotides and 34% of the intergenic sequences Used Cufflinks to predict (over all long RNA-seq samples) the following elements in intergenic and antisense regions: 94,800 exons (19%) 69,052 splice junctions (22%) 73,325 transcripts (45%) 41,204 genes (80%) Landscape of transcription in human cells, Djebali et al. Nature 2012

11 Independent validation of multi-exonic transcript models
Using overlapping targeted Roche FLX 454 paired-end reads and mass spectrometry ~ 3,000 intergenic and antisense transcript models tested Validation rates from70%to 90%were observed These experiments identified >22,000 novel splice Almost 8x increase compared to the sites originally detected with RNA-seq At a 1% false discovery rate (FDR), we identified 419 novel modelswith 5 ormore spectral and/or 2 ormore peptide hits, of which only 56 were intergenic or antisense to GENCODE genes Thus, most novel transcripts seem to lack protein-coding capacity Landscape of transcription in human cells, Djebali et al. Nature 2012

12 The transcriptome of nuclear subcompartments (K562)
For K562 cell line, total RNA isolated from: Chromatin, Nucleolus and Nucleoplasm 51.64% (18,330) of the GENCODE (v7) annotated genes detected for all 15 cell lines (35,494) were identified Only a small fraction of annotated/novel elements was unique to that compartment Thus, by analysing short and long RNAs in the different subcellular compartments: They confirm that splicing predominantly occurs during transcription. Landscape of transcription in human cells, Djebali et al. Nature 2012

13 Co-transcriptional splicing Short read mappings for exon-based splicing completion
Read mappings that allow assessment of splicing completion around exons are shown. Reads providing evidence of splicing completion for the region containing the exon (with either exon inclusion (a, b) or exclusion (c)) are shown. Reads providing evidence for the splicing of the region containing the exon not being completed yet are indicated by d and e. There previous study showed that the fraction of RNA molecules in which the region containing the exon has already been spliced. So, a coSI value of 1 means splicing completed, whereas a value of 0 indicates that splicing has not yet been initiated. The complete splicing index (coSI): a, b = exon inclusion c = exon exclusion d, e = exons not being completed coSI = Landscape of transcription in human cells, Djebali et al. Nature 2012

14 Co-transcriptional splicing Distribution of coSI scores computed on GENCODE internal exons
By using RNA-seq to measure the degree of completion of splicing (Fig. 2a), they observed that around most exons, introns are already being spliced in chromatin-associated RNA—the fraction that includes RNAs in the process of being transcribed (Fig. 2b) This shows that co-transcriptional splicing provides an explanation for the increasing evidence connecting chromatin structure to splicing regulation, and they have observed that exons in the process of being spliced are enriched in a number of chromatin marks. Distribution in the total chromatin RNA fraction Distribution in cytosolic Ploy-A+ RNA fraction Landscape of transcription in human cells, Djebali et al. Nature 2012

15 Gene expression across cell lines
The distribution of gene expression is very similar across cell lines Protein-coding genes having on average higher expression levels than lncRNAs. Two-dimensional kernel density plots of nuclear over cytosolic enrichment (y axis) versus overall gene expression in the whole cell extract (x axis), for protein coding, long non-coding and novel genes over all cell lines. RPKM(Reads Per Kilobase per Million mapped reads) is a method of quantifying gene expression from RNA sequencing data by normalizing for total read length and the number of sequencing reads. Only genes present in all three RNA extracts are displayed, as well as two representative genes (ACTG1 in red and H19 in blue), for which the expression in each individual cell line is shown. The actual values of the estimated kernel density are indicated by contour lines and colour shades. Abundance of gene types in cellular compartments lncRNAs contributes more to cell-line specificity than protein-coding genes Abundance of gene types in cellular compartments Landscape of transcription in human cells, Djebali et al. Nature 2012

16 Isoform expression within a gene
a, Number of expressed isoforms per gene per cell line. Genes tend to express many isoforms simultaneously as the number of genes grows. How ever is that linear at(10-12). It b, Relative expression of the most abundant isoform per gene per cell line. There is generally one dominant isoform in a given condition. Each box plot was constructed using the number of genes with 1, 2, 3, 4, etc. up to 25 isoforms. Cannot distinguish whether this is the result of multiple isoforms expressed in the same cell or of different isoforms expressed in different cells. Landscape of transcription in human cells, Djebali et al. Nature 2012

17 Transcription initiation
Workflow of CAGE processing and elements: Raw CAGE reads  mapped to hg19 genome (using Delve) Delve: a probabilistic mapper using HMM Reads with bad mapping quality were discarded Mapped reads  clustered using paraclu Clusters shorter than 200bp were selected Using TSS predictor: A non-supervised classifier based on modeling sequences surrounding CAGE regions via HMMs. To capture sequence motifs of length 2–8 present at a certain distance from the middle of each cluster Landscape of transcription in human cells, Djebali et al. Nature 2012

18 Transcription initiation
82,783 non-redundant TSSs were identified ~ 48% of the CAGE-identified TSSs are located within 500bp of an annotated RNA-seq detected GENCODE TSS. 3% are within 500bp of a novel TSS. Landscape of transcription in human cells, Djebali et al. Nature 2012

19 Transcription termination
128,824 sites mapping within annotated GENCODE transcripts were identified. Trim unmapped RNA-seq reads with long terminal poly-As first. ~ 20% mapped proximal to annotated poly-A sites (PAS). ~ 80% correspond to novel PAS. It increased the average number of PAS per gene from 1.1 to 2.5 A cell-specific preference for proximal PAS in the cytosol compared to the nucleus. Landscape of transcription in human cells, Djebali et al. Nature 2012

20 Short RNA expression landscape

21 Annotated small RNAs 1,989 (28) miRNA snoRNA snRNA tRNA Other†
Expression of GENCODE (v7) annotated small RNA genes Gene type* GENCODE total Detected genes (% detected) No. genes expressed in only one cell line (% detected) No. genes expressed in 12 cell lines (% detected) miRNA guide fragment‡ miRNA passenger fragment§ Internal fragments∥ of annotated small RNA (average per detected gene) miRNA 1,756 497 (28) 59 (12) 147 (30) 454 (454) 175 (175) 18 snoRNA 1,521 458 (30) 73 (16) 223 (49) NA 60 snRNA 1,944 378 (19) 123 (33) 41 (11) 36 tRNA 624 465 (75) 29 (6) 197 (42) 52 Other† 1,209 191 (16) 69 (36) 24 (13) 32 Total GENCODE 7,054 1,989 (28) 353 (18) 632 (32) 40 Small rnas include miRNA, (small nucleolar)snoRNA, (small nuclear)snRNA and (transfer RNA)tRNA 28% annotated small RNAs are expressed in at least one cell line *Includes all other GENCODE small transcript biotypes except for pseudogenes. †All elements that have passed npIDR (0.1). ‡Number of detected miRNAs with an expressed annotated guide (with an annotated guide in mirbase). §Number of detected miRNAs with an expressed annotated passenger (with an annotated passenger in mirbase). ∥Short RNA-seq mapping for which the 5′ end starts 5 bp after the start and ends 5 bp before the end of a detected gene. Landscape of transcription in human cells, Djebali et al. Nature 2012

22 Annotated small RNAs (K562)
As known to perform their functions, miRNA and tRNA enriched in cytosol and snoRNA enriched in nucleolus Landscape of transcription in human cells, Djebali et al. Nature 2012

23 Unannotated short RNAs
Two types: 1. Subfragments of annotated small RNAs 2. Second and largest source of unannotated short RNAs corresponds to novel short RNAs , associated with promoter, terminator regions of annotated genes ,and termini-associated short RNAs . Landscape of transcription in human cells, Djebali et al. Nature 2012

24 Genealogy of short RNAs
Proportion of annotated elements in different genomic domains that overlap different classes of small RNAs. In fact, when controlling for relative exon/intron length, we found that exons from lncRNAs are compara- tively enriched as hosts for snoRNAs Landscape of transcription in human cells, Djebali et al. Nature 2012

25 Genealogy of short RNAs
About 6% of all annotated long transcripts overlap with small RNAs and are probably precursors to these small RNAs. Many long RNAs seem to have dual roles, as functional RNAs, and as precursors for many important classes of small RNAs. Landscape of transcription in human cells, Djebali et al. Nature 2012

26 RNA editing & allele-specific expression.

27 RNA editing Cells can make discrete changes to specific nucleotide sequences within a RNA molecule after it has been generated by DNA A-I editing (main form of RNA editing in mammals) and C-U editing They found: A-I editing (88%) and T-C (5%) Notice:Their results do not support a recent report of a substantial number of non-canonical SNV edits in the RNA of human lymphoblastoid cells30. Landscape of transcription in human cells, Djebali et al. Nature 2012

28 Allele-specific expression (GM12878 RNA-seq datasets)
18% of both GENCODE annotated protein-coding & non- coding genes exhibit allele-specific expression. Similar proportion of genes with allele-specific expression in whole-cell, cytoplasm, & nucleus. Used AlleleSeq pipeline [Rozowsky et al. Mol. Syst. Biol. 2011] Landscape of transcription in human cells, Djebali et al. Nature 2012

29 Repeat region transcription

30 Repeat region transcription
18% (14,828) of CAGE-defined TSS regions overlap repetitive elements. They found that CAGE clusters mapping to repeat regions were noticeably more narrowly expressed than CAGE clusters mapping within genic regions (Shown by Shannon Entropy) Cell-line specificity Indicate they may have real functions Cell line specific expression of repeat elements. Shannon Entropy of CAGE cluster expression profiles across all experiments separated by broad annotation categories. A low entropy means a narrow expression across experiments, thus intergenic LINE, SINE and LTR repeats are noticeably more narrowly expressed than other genomic categories. Landscape of transcription in human cells, Djebali et al. Nature 2012

31 Characterization of enhancer RNA

32 Transcription at enhancers
RNA polymerase II binds some distal enhancer regions and produce enhancer-associated transcripts (eRNAs) Landscape of transcription in human cells, Djebali et al. Nature 2012

33 Pattern of RNA elements around enhancer predictions
Previously reported that eRNA lacked modification by polyadenylation However, this paper reported that there is eRNA with polyadenylation That is interesting. Landscape of transcription in human cells, Djebali et al. Nature 2012

34 Enhancer transcripts differ from promoter transcripts
b, Enhancer transcripts differ from promoter transcripts. Some eRNA seemed to be polyadenylated in the nucleus. And this pattern was significantly compared to transcripts Landscape of transcription in human cells, Djebali et al. Nature 2012

35 Chromatin state at transcribed enhancers
Transcribed enhancers on average show a significantly different pattern of chromatin modification than non-transcribed ones All associated with transcriptional initiation and elongation Landscape of transcription in human cells, Djebali et al. Nature 2012

36 Enhancer activity & transcription is cell-type specific
d, Enhancer activity and transcription is cell-type specific.. Landscape of transcription in human cells, Djebali et al. Nature 2012

37 Summary Extend the current genome-wide annotated catalogue of long poly- adenylated & small RNAs of GENCODE 62.1% and 74.7% of the human genome are covered by either processed or primary transcripts Primary = contigs + introns + GENCODE genes Processed = contigs + GENCODE exons On average one cell line transcribe 39% of the genome No cell line transcribes more than 56.7% of the transcriptomes across all cell lines The consequent reduction in the length of intergenic regions leads to a significant overlapping of neighbouring genic regions and prompts a redefinition of a gene Landscape of transcription in human cells, Djebali et al. Nature 2012

38 Summary A tendency for genes to express many isoforms simultaneously
(plateau: 10–12 expressed isoforms per gene per cell line) Cell-type-specific enhancers are promoters differentiable from other regulatory regions Coding & non-coding transcripts are predominantly localized in the cytosol and nucleus, respectively ~ 6% of all annotated coding and non-coding transcripts overlap with small RNAs and probably precursors to these small RNAs Landscape of transcription in human cells, Djebali et al. Nature 2012

39 Redefinition of the concept of a gene
Novel genes increase the proportion of small intergenic regions Landscape of transcription in human cells, Djebali et al. Nature 2012

40 Thank you! Landscape of transcription in human cells, Djebali et al. Nature 2012


Download ppt "Aziz Khan, Ning Wang April 22, 2013"

Similar presentations


Ads by Google