[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Methods to read out regulatory functions
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 15:
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Gene Set Enrichment Analysis (GSEA)
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
CAVEAT 1 MICROARRAY EXPERIMENTS ARE EXPENSIVE AND COMPLICATED. MICROARRAY EXPERIMENTS ARE THE STARTING POINT FOR RESEARCH. MICROARRAY EXPERIMENTS CANNOT.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Detecting DNA-protein Interactions Xinghua Lu Dept Biomedical Informatics BIOST 2055.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Gene Ontology John Pinney
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
[Bejerano Fall10/11] 1 Thank you for the midterm feedback! Projects will be assigned shortly.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
[Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean.
[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
Fuzzy K means.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
[Bejerano Fall09/10] 1 Thank you for the midterm feedback!
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
CS273A Lecture 5: Genes Enrichment, Gene Regulation I
CS173 Lecture 14: Personal Genomics, GSEA/GREAT
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
Comparative Genomics II: Functional comparisons Caterino and Hayes, 2007.
NGS Analysis Using Galaxy
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Gene Set Enrichment Analysis (GSEA)
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
Networks and Interactions Boo Virk v1.0.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
RNAseq analyses -- methods
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Copyright OpenHelix. No use or reproduction without express written consent1.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
[Bejerano Fall10/11] 1.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Statistical Testing with Genes Saurabh Sinha CS 466.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
[BejeranoFall15/16] 1 MW 1:30-2:50pm in Clark S361* (behind Peet’s) Profs: Serafim Batzoglou & Gill Bejerano CAs: Karthik Jagadeesh.
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
Overview of ENCODE Elements
CS173 Lecture 9: Transcriptional regulation III
Module 5: Future 1 Canadian Bioinformatics Workshops
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Accessing and visualizing genomics data
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215.
Functional annotation of ChIP-peaks
Networks and Interactions
The Human Genome Source Code
Regulatory Genomics Lab
Presentation transcript:

[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos Achlioptas CS273A Lecture 16: Functional Genomics

[BejeranoFall13/14] 2 Gene set enrichment analysis: The gene regulatory version

Cluster all genes for differential expression [BejeranoFall13/14] 3 Most significantly up-regulated genes Unchanged genes Most significantly down-regulated genes Experiment Control (replicates) genes

TJL The Gene Sets to test come from the Ontologies

ES/NES statistic + - Exper. Control Gene Set 1 Gene Set 2 Gene Set 3 Gene set 3 up regulated Gene set 2 down regulated Ask about whole gene sets [BejeranoFall13/14]

6 Combinatorial Regulatory Code Gene 2,000 different proteins can bind specific DNA sequences. A regulatory region encodes 3-10 such protein binding sites. When all are bound by proteins the regulatory region turns “on”, and the nearby gene is activated to produce protein. Proteins DNA Protein binding site

ChIP-Seq: first glimpses of the regulatory genome in action Cis-regulatory peak 77 [BejeranoFall13/14] Peak Calling

Gene transcription start site What is the transcription factor I just assayed doing? Cis-regulatory peak 88 [BejeranoFall13/14] Collect known literature of the form Function A: Gene1, Gene2, Gene3,... Function B: Gene1, Gene2, Gene3,... Function C:... Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above. Form hypothesis and perform further experiments.

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile 9 Gene transcription start site SRF binding ChIP-seq peak ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells 1 SRF is known as a “master regulator of the actin cytoskeleton” In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation. [BejeranoFall13/14]

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile 10 Existing, gene-based method to analyze enrichment: Ignore distal binding events. Count affected genes. Rank by enrichment hypergeometric p-value. Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin cytoskeleton’) N = 8 genes in genome K = 3 genes annotated with n = 2 genes selected by proximal peaks k = 1 selected gene annotated with P = Pr(k ≥1 | n=2, K =3, N=8) [BejeranoFall13/14]

We have (reduced ChIP-Seq into) a gene list! What is the gene list enriched for? 11 Microarray tool Microarray data Gene regulation data [BejeranoFall13/14] Pro: A lot of tools out there for the analysis of gene lists. Cons: These tools are built for microarray analysis. Does it matter ??

SRF Gene-based enrichment results 12 Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched 1 [1] Valouev A. et al., Nat. Methods, 2008 SRF SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding 12 [BejeranoFall13/14] Where’s the signal? Top “actin” term is ranked #28 in the list.

Associating only proximal peaks loses a lot of information 13 Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets Restricting to proximal peaks often leads to complete loss of key enrichments [BejeranoFall13/14]

Bad Solution: Associating distal peaks brings in many false enrichments 14 Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream. [BejeranoFall13/14] Term Bonferroni corrected p-value nervous system development 5x10 -9 system development 8x10 -9 anatomical structure development 7x10 -8 multicellular organismal development 1x10 -7 developmental process 2x10 -6 SRF ChIP-seq set has >2,000 binding events. Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis? Large “gene deserts” are often next to key developmental genes

Real Solution: Do not convert to gene list. Analyze the set of genomic regions 15 Gene transcription start site Ontology term ( ‘actin cytoskeleton’) P = Pr binom (k ≥5 | n=6, p =0.33) p = 0.33 of genome annotated with n = 6 genomic regions k = 5 genomic regions hit annotation [BejeranoFall13/14] Gene regulatory domain Genomic region (ChIP-seq peak) Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments. GREAT = Genomic Regions Enrichment of Annotations Tool

How does GREAT know how to assign distal binding peaks to genes? 16 Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc. Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks [BejeranoFall13/14]

GREAT infers many specific functions of SRF from its binding profile 17 Ontology Term # Genes Binomial Experimental P-value support * Gene Ontology actin cytoskeleton actin binding 7x x10 -5 Miano et al * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT Pathway Commons TRAIL signaling Class I PI3K signaling 5x x10 -6 Bertolotto et al Poser et al TreeFam 1x Chai & Tarnawski 2002 TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 5x x x x10 -4 Positive control ChIp-Seq support Natesan & Gilman Top gene-based enrichments of SRF Top GREAT enrichments of SRF (top actin-related term 28 th in list) FOS gene family [BejeranoFall13/14] Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq [McLean et al., Nat Biotechnol., 2010]

Limb P300: I was blind and I can see 18 [BejeranoFall13/14] Gene List

GREAT works with ANY cis-regulatory rich set Example: GWAS Compendium set 19 [BejeranoFall13/14] Height- associated unlinked SNPs

GREAT analysis of histone mark combinations [BejeranoFall13/14] 20

GREAT includes multiple ontologies 21 Michael Hiller Twenty ontologies spanning broad categories of biology 44,832 total ontology terms tested in each GREAT run (2,800 terms) (5,215) (834) (5,781) (427) (456) (150) (1,253) (288) (706) (6,700) (3,079) (911) (615) (19) (222) (9) (6,857) (8,272) (238) [BejeranoFall13/14]

Advantages of the GREAT approach Tailored to the biology of gene regulation: Distal sites are incorporated, not ignored Variable length gene regulatory domains Multiple bindings next to same target gene rewarded Binding sites associated to (both) TSS, not gene body Extensive ontologies, some home-made [BejeranoFall13/14]22

[BejeranoFall13/14] 23 Algorithmic Optimization: A it works; B make it efficient

24 enter GREAT.stanford.edu Choose genome Input peak list [BejeranoFall13/14] Hit submit!

25 GREAT web app: (Optional): alter association rules Three association rule choices Literature-curated domains for a small subset of genes Lnp Evx2 HoxD cluster [adapted from Spitz, Gonzalez, & Duboule, Cell, 2003] [BejeranoFall13/14]

26 Additional ontologies, term statistics, multiple hypothesis corrections, etc. GREAT web app: output summary Ontology-specific enrichments [BejeranoFall13/14] Cool visualization opportunities!

27 GREAT web app: term details page Genes annotated as “actin binding” with associated genomic regions Genomic regions annotated with “actin binding” Drill down to explore how a particular peak regulates Plectin and its role in actin binding [BejeranoFall13/14]

You can also submit any track straight from the UCSC Table Browser 28 [BejeranoFall13/14] A simple, well documented programmatic interface allows any tool to submit directly to GREAT. (See our Help / Inquiries welcome!)

GREAT web app: export data 29 HTML output displays all user selected rows and columns Tab-separated values also available for additional postprocessing [BejeranoFall13/14]

GREAT Web Stats [BejeranoFall13/14] job submissions per day, from 8,000 IP addrs Over 175,000 jobs served, hundreds of citations

31 Adding a new species to GREAT We need: 1.A good assembly 2.A high quality gene set 3.Good gene annotations* *Most valuable for species with independent annotations! [BejeranoFall13/14]

Test case: early neocortex development E12.5E14.5 E16.5 Dissected the dorsal cerebral wall We Performed p300, H3K27Ac ChIP-seq. What have we learned? 32

[BejeranoFall13/14]33 Collect, Call Peaks: Mostly Distal Did the experiment work?

[BejeranoFall13/14]34 In silico validation Most enriched terms: what you want, where & when you want it

CP SVZ-IZ VZ Are all cell populations represented? [Ayoub et al., PNAS. 2011] [BejeranoFall13/14]

36 What are different enhancer groups doing?

[BejeranoFall13/14]37 Which genes are most regulated? [Ayoub et al, 2011]

[BejeranoFall13/14]38 The most densely regulated genes

[BejeranoFall13/14]39 Mn1 – Cryba4 gene desert

40 ChIP-seq is a powerful technique crosslink fragment high throughput sequencing map to genome + find peaks antibody immunoprecip. [BejeranoFall13/14] ChIP-seq lessons: Transcription factors (TF) bind next to many relevant target genes. Example: SRF regulates actin cytoskeleton. Binds near 40 actin genes in immune cells. Most binding is distal. Ex: 75% of SRF binding sites near actin genes well over 1kb from TSS (upto 300kb). Binding is context specific. Ex: SRF also regulates muscle development. SRF is not enriched near muscle genes, when assayed in immune cells. Many TF functions yet undiscovered.

41 How can we discover new TF functions? [BejeranoFall13/14] Exciting new technologies: designer genome editing. Which TF to knock out next? Where to look for its effects? ChIP-seq requires Quality antibody Enough cells of right context Sequencing costs  Seldom used in exploratory mode (ENCODE sampled 100 mostly basal TFs in very few contexts) TF function exploration: expression perturbations, genetics screens. Which TF next? Where to look? Educated guesses. Goal: Devise a rapid method for TF function prediction

42 How can we discover new TF functions? [BejeranoFall13/14] Observations: 1.Gene function annotation is very rich. (Examples: GO, MGI expression & phenotype, etc. etc.) 2.Conserved non-coding, likely cis-regulatory, sequence makes up 5-10% of the mammalian genomes. 3.Conserved binding site prediction has become accurate. (Predict only with strong evidence, minimize false positives) Idea: Use transcription factor tendency to bind next to many context-relevant genes to make function predictions.

43 Transcription Factor Function Prediction [BejeranoFall13/14] SRF Predict SRF to regulate: Actin cytoskeleton, Muscle development, Curate a rich library of TF binding site motifs. Pro: hundreds are now known. (inc. from Tim, Jussi’s work) 2. Predict cross-species conserved binding sites. Pro: careful conserved binding site prediction is accurate. 3. Search for extreme binding site concentration next to genes of any particular function. Pro: Leverage the observed phenomenon, and the very large body of knowledge about all (target) genes in the genome. Con: You will not get all binding sites of any function. Pro: You may predict many diverse TF functions.

44 This is what we did: 1. Build motif library 300 motifs 332 motifs for 300 transcription factors (updating for many more) [BejeranoFall13/14]

TTTCCCTTAAAAGGCTTAAATAAACTCACCAGTGTTTAATT T T G G......T......T C AT..TG.....C T C....AT...G.....C T TG T C...TG.....G.G.....T TG CG...T TG T T...G......G......T T TTG T GAC A...T C T T.-.CA T G T T TG Predict conserved binding sites. = same as human Human Chimp Gorilla Orangutan Rhesus Tarsier Mouse lemur Bushbaby Tree shrew Mouse Guinea pig Squirrel Rabbit Alpaca Cow Cat Microbat Megabat Hedgehog Rock hyrax Tenrec Armadillo Sloth Human TTTCCCTTAAAAGGCTTAAATAAACTCACCAGTGTTTAATT [BejeranoFall13/14] We in fact allow: Imperfect motif matches Binding site / alignment wobble Subset of species support Guard against alignment fragmentations Predict efficiently Improve state of the art using “Excess conservation” scoring

46 Compare to shuffled motifs & weed out! SRF motif shuffle #1 shuffle #2 shuffle #3 shuffle #4 shuffle #10 [BejeranoFall13/14] …

47 Use three reference genomes rich in gene function annotations [BejeranoFall13/14] Predicted for human, for mouse and for zebrafish

48 3. Predict binding site/TF functions gene transcription start site SRF binding site BMP4 MYOG MYF6NKX2-5 ACTA1CAV3 Enhancer to gene association  SRF must regulate muscle structure development: predicted to bind next to 157 genes (p=7.43× ) [BejeranoFall13/14]

Control false positives when predicting function for hundreds of transcription factors: 49 Conservative multiple testing correction SRF motif shuffle #1 shuffle #2 shuffle #3 shuffle #4 shuffle #10 Remove terms with E[occurrences] ≥ 1 kidney morphogenesis absent semicircular canals dorsal spinal cord development … Retains only 5% of GREAT predictions. 2,543 transcription factor to function links (16% FDR) [BejeranoFall13/14]

50 PRISM vs. ChIP-seq TermPRISMChIP-seq actin cytoskeletonKnown structural constituent of muscleKnown dilated heart ventriclesKnown regulation of insulin secretionNovel SRF T cell SRF heart muscle [BejeranoFall13/14] Actin cytoskeleton

51 PRISM vs. ChIP-seq TermPRISMChIP-seq actin cytoskeletonKnown structural constituent of muscleKnown dilated heart ventriclesKnown regulation of insulin secretionNovel SRF T cell SRF heart muscle [BejeranoFall13/14] Actin cytoskeleton Every known function is supported by dozens or hundreds of novel binding sites.

TFfunctionp-valuetarget genes SRFmuscle structure development7.43× PRISM re-discovers known functions GLI2skeletal system development7.07× CRXretinal photoreceptor degeneration1.30× ARabnormal spermiogenesis1.19× Is the number of re-discovered known functions impressive? [BejeranoFall13/14]

53 PRISM re-discovers many known functions [BejeranoFall13/14] SRF Genes involved in “muscle structure development”

54 Distal binding sites are important 3.6× [BejeranoFall13/14]

55 PRISM predicts many novel functions TFfunctionp-valuetarget genes MYF6abnormal pancreas development1.67× GABPAtranscription by RNA polymerase I3.64× GATA6abnormal pancreas development5.69× … [BejeranoFall13/14] Function prediction word cloud (word size correlates with its frequency in our function predictions) Nature Genetics, Dec 2011

56 PRISM works: Experimental validation TFfunctionp-valuetarget genes MYF6abnormal pancreas development1.67× [BejeranoFall13/14]

57 PRISM is built for the experimentalist Search for: Transcription factor: e.g. “MYOG”, Function: e.g. “insulin”, Target gene: e.g. “ACTA1”, or Target genomic region: e.g. 9p21 [BejeranoFall13/14] Search: Human, Mouse, or Zebrafish