Presentation is loading. Please wait.

Presentation is loading. Please wait.

ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Similar presentations

Presentation on theme: "ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence."— Presentation transcript:

1 ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence


3 ENCODE experiments

4 AreaAssayGroups ProteinsManual annotation, RT-PCR Guigo, Harrow+Hubbard, Reymond TranscriptsTiling ArraysGingeras, Snyder TranscriptsTag seq.Yijun, Riken General Chromatin Marks Tiling Arrays, ChIPDunham, Reng Sequence sp. Factors Tiling Arrays, ChIPSnyder, Gingeras, Farnham, Dunham DNaseI sens.PCR, Tiling arraysStam., Crawford ReplicationTiling arraysDutta ConservationComparative sequence Green, Sidow, Miller DNA structureHydroxyl radicalLeib PromoterReporter assaysMyers

5 ENCODE Pilot Considered too expensive and too risky to decide on winning technologies (started in 2004) 1% of the genome (30MB) chosen - all experiments on the same 1% Pilot phase ended –Analysis and publication –Scale up to genome wide now funded

6 A lot of Chip/Chip

7 Nowdays, a lot of Chip/seq

8 Transcription

9 Lots of it –And not all of it genes –And even when it is inside a gene, not all of it with open reading frames –And even when it has an open reading frame, not all of it making sense! (evolutionary or structurally) Not technical false positives

10 Protein coding loci are far more complex than we think On average 5 transcripts per locus Many do not encode proteins (as far as we can see) Even the ones which do encode proteins, many of these proteins look weird

11 Unplausible structures

12 Many effects on potential function

13 Signal peptides, TM Helices 1097 protein transcripts from 487 loci –219 have signal peptides (107 loci) –12 loci have an isoform without the signal peptide –41 transcripts have a gain or loss of a tansmembrane helix (sometimes up to 8!)

14 a inactive, "stressed" (d) (e) b active (beta inserted) (c) (f) The Clade B Serpins Potential Missing fragments

15 Transcription Start Sites

16 Technologies on TSS Gencode Manual Ann. Unbiased TxFrag Ditag data Cage data Histone mod. Dnase I sens Sequence sp Factors (eg Myc)

17 Integration Strategy Anchor on 5 ends GenCode 5 and CAGE/DiTag Categorise and assess using Transcript based evidence Exons, TxFrags, CpG islands Assess categories with Histone and TF data 16,051 unique TSS 8,587 TSS tight clusters 5 different classes First 4 low-Pvalues First 4 categories have Biological signals: 4,491 TSS

18 TSS Categories CategoryNumber (non-redundant) P-value of overlap GenCode 517302e-70 Exon(sense)14376e-39 Exon(anti)5213e-8 TxFrag6397e-63 CpG1644e-90 No support2666

19 GenCode 5 ends

20 Unsupported tags

21 Novel TSSs

22 Conclusion There are 4,418 TSS with multiple lines of evidence supporting them This is ~10 fold more than the number of Genes Only 38% would be traditionally classified as TSS (less if one took Ensembl or RefSeq)

23 Implications of many more TSSs Consistent with considerable diversity of transcripts Independently integrating Chip/Chip data suggested ~1,000 Regulatory Clusters –25% proximal considering Ensembl/Refseq –65% when this TSS catalog is considered

24 More subtle conclusions Sequence specific factors are distributed symmetrically around the TSS –Should we only be taking upstream regions for reporter genes? Histone information is highly correlated with gene on/off status –Generalising many locus specific studies

25 Gene On/Off

26 Gene status prediction

27 Distal sites

28 Finding distal sites Chip/Chip not great –Most look close to one of these new TSSs –Factor bias? DNaseI Hypersenstive Sites –All factors give a DHS signal –55% of DHSs are distal to any TSS

29 Distal DHS

30 Most surveyed factors are proximal

31 Replication

32 H3K27me3 is correlated

33 Evolutionary conservation and ENCODE

34 Evolutionary conservation

35 …but not everything is constrained

36 Why is there a discrepancy? False positives in the experiments –But experiments validate at >80% and cross- validate each other False negatives in the constraint detection –But can detect up to 8bp elements, and within neutral zone of alignability Neutral turnover model

37 Neutral biochemical events Time

38 Lineage specific Time

39 Functional conservation Human Mouse

40 Special case: Transcription GeneRegulatory Information Constrained sequence Pre-miRNAs

41 What should we learn from ENCODE whacky transcription is real (but god knows what it does) –Unconventional Transcript Lots more TSSs than we understand –Many distal regions are actually close to promoters Broad specificity marks are more useful –DNaseI sites, Histone marks

42 Neutral model for biochemical events on the genome Because things happen reproducibly in multiple tissues does not imply selection (this is not the same as experimental variance) Could imply functional conservation outside of orthologous bases –Comparative genomics sequencing not enough (but a great starting point!) –Comparative functional investigation

43 Consortia work ENCODE –Experimentally lead consortia –Needs a lot of computational collaboration Biosapiens –Computationally lead consortia –Needs experimental collaboration (!) DNA: ENCODE Protein: Biosapiens

44 What happens next?

45 Ensembl Regulatory Build elements Chr 14, 5677077-567896 Status GM06990 Cells, Myc bound

46 Initial Regulatory Build DNaseI Hypersenstive sites, 6 histone modifications, CTCF binding ~110,000 elements, ~2MB of DNA 6,000 promoter associated by inherent pattern (DNaseI + H3K36me3) Available now This year: Mouse, More classification

47 Regulatory build

48 Ensembl - at your service Web browser MySQL DB access BioMart Geek for a week –You send someone to use for a week Xose for a day –We send someone to you for a day

49 The ENCODE Project Consortium Damian Keefe, Yutao Fu, Zhiping Weng, Mike Snyder, Elliott Marguilles, John Stam., Manolis Dermitzakis, Tom Gingeras, Roderic Guigo, Ian Dunham, Christophe Koch, Anindya Dutta Paul Flicek and 293 others… The Biosapiens Network of Excellence Michael Tress, Alfonso Valencia, Janet Thornton, Roderic Guigo, Soren Brunak, David Jones, Martin Vingron, Anna Tramontano, Jacques van Helden and 57 others…

Download ppt "ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence."

Similar presentations

Ads by Google