CS173 Lecture 14: Personal Genomics, GSEA/GREAT

Name: CS173 Lecture 14: Personal Genomics, GSEA/GREAT
Uploaded: 2017-11-24T16:19:05+00:00
Duration: PTM38S5
Channel: Camilla Chandler
Description: CS173 Lecture 14: Personal Genomics, GSEA/GREAT

CS173 Lecture 14: Personal Genomics, GSEA/GREAT
MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu [BejeranoWinter12/13]

Announcements Coming Monday 3/4 lecture is again in LK101 (see class website for room reminders) I’ll be working on grad student admissions – Harendra will lecture about his work. (we’ll prepare the ground today) [BejeranoWinter12/13]

Quick recap [BejeranoWinter12/13]

Sequencing Public project: Celera project:

Human Structural Variation
[BejeranoWinter12/13]

Human Disease Cancer Congenital defects Disease Association studies
Genic and cis-regulatory contributions [BejeranoWinter12/13]

Personal genomics [BejeranoWinter12/13]

Gameplan 1. As your budget allows, characterize all the variants in an individual’s genome: Against the reference genome. Against variants known in the population. If possible, against unaffected relatives. 2 Compare the structural variants you observe to the body of knowledge about genome content & function. Seek culprit mutations. 3. Having detected a smoking gun mutation, attempt to recreate it in a cell population or organism to obtain a “disease model”. Variant Types Single Nucleotide Variants(SNVs) Small Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence [BejeranoWinter12/13]

Targeted Sequencing, or looking under the lamp is 50x cheaper
Exome Library Shotgun Genomic DNA Exon 1 Exon 2 Capture Methods vs. Shotgun Targeted sequencing allows for much higher coverage at less cost Will only capture known sites These methods also introduce significant captures bias, including failure to capture sites that differ significantly from the reference genome. (analogous to microarrays) Problem is that need a large amount of sequence in order to have accurate SNV calls regardless of the model Targeted sequencing refers to methods that enrich for a certain portion of the genome prior to the actual sequencing procedure Use oligos with baits (biotinylated) attached to beads (streptavidin) Modified from Meyerson et al Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics 11, no. 10 (October):

Consumer genomics [BejeranoWinter12/13]

Gameplan 1 Collect scientific literature about all structural variant correlations with human disease & traits. 2 Genotype customers for as many informative loci as is commercially viable. 3 Offer counseling for your findings, and their meaning. 4 Ask customers to phenotype themselves. 5 Discover new associations! [BejeranoWinter12/13]

Pay, send biosample, get genotyped

Trait associations

Disease Risk Alleles http://cs173.stanford.edu [BejeranoWinter12/13]

Side Effects: Serious Ethical Issues

Gene set enrichment analysis: The genic version

Imagine you did a microarray experiment

Cluster all genes for differential expression
Experiment Control (replicates) (replicates) Most significantly up-regulated genes genes Unchanged genes Most significantly down-regulated genes [BejeranoWinter12/13]

Determine cut-offs, examine individual genes
Experiment Control (replicates) (replicates) Most significantly up-regulated genes genes Unchanged genes Most significantly down-regulated genes [BejeranoWinter12/13]

Genes usually work in groups
Biochemical pathways, signaling pathways, etc. Asking about the expression perturbation of groups of genes is both more appealing biologically, and more powerful statistically (you sum perturbations). [BejeranoWinter12/13]

Ask about whole gene sets
+ Exper. Control Gene set 3 up regulated ES/NES statistic Gene set 2 down regulated - [BejeranoWinter12/13]

One approach: GSEA Dataset distribution Gene set 3 distribution
Number of genes Gene Expression Level [BejeranoWinter12/13]

Another popular approach: DAVID
Input: list of genes of interest (without expression values). [BejeranoWinter12/13]

Multiple Testing Correction
run tool Note that statistically you cannot just run individual tests on 1,000 different gene sets. You have to apply further statistical corrections, to account for the fact that even in 1,000 random experiments a handful may come out good by chance. (eg experiment = Throw a coin 10 times. Ask if it is biased. If you repeat it 1,000 times, you will eventually get an all heads series, from a fair coin. Mustn’t deduce that the coin is biased) [BejeranoWinter12/13]

What will you test? run tool
Also note that this is a very general approach to test gene lists. Instead of a microarray experiment you can do RNA-seq. Instead of up/down-regulated genes you can test all the genes in a personal genome where you see surprising mutations. Any gene list can be tested. [BejeranoWinter12/13]

Cataloging biological knowledge
Gene Sets: Cataloging biological knowledge [BejeranoWinter12/13]

Keyword lists are not enough
Anatomy keywords Sheer number of terms too much to remember and sort Need standardized, stable, carefully defined terms Need to describe different levels of detail So…defined terms need to be related in a hierarchy With structured vocabularies/hierarchies Parent/child relationships exist between terms Increased depth -> Increased resolution Can annotate data at appropriate level May query at appropriate level Organ system Cardiovascular system Heart organ system embryo cardiovascular heart … Anatomy Hierarchy Sheer number of terms is too much to remember and sort… a question of scale Need to describe domains at different levels of detail AND thus we started the GO

Annotate genes to most specific terms
TJL-2004

General Implementations for Vocabularies
organ system embryo cardiovascular heart … Hierarchy DAG chaperone regulator molecular function chaperone activator enzyme regulator enzyme activator Query for this term Returns things annotated to descendents Remind them of what a DAG is…. Annotate at any level, query at level…. What is structure buy you 1. Annotate at appropriate level, query at appropriate level 2. Queries for higher level terms include annotations to lower level terms

Gene Sets Gene Ontology (“GO”) Pathway Databases Biological Process
Molecular Function Cellular Location Pathway Databases KEGG BioCarta Broad Institute

Other Gene Sets Transcription factor targets
All the genes regulated by particular TF’s Protein complex components Sets of genes whose protein products function together Ion channel receptors RNA / DNA Polymerase Paralogs Families of genes descended (in eukaryotic times) from a common ancestor

Natural Language Processing (NLP) Opportunities
Ontology Map genes to ontology using literature Literature Genes [BejeranoWinter12/13]

Gene set enrichment analysis: The gene regulatory version

Combinatorial Regulatory Code
2,000 different proteins can bind specific DNA sequences. DNA Proteins Protein binding site Gene DNA A regulatory region encodes 3-10 such protein binding sites. When all are bound by proteins the regulatory region turns “on”, and the nearby gene is activated to produce protein. [BejeranoWinter12/13]

ChIP-Seq: first glimpses of the regulatory genome in action
Peak Calling Cis-regulatory peak [BejeranoWinter12/13] 35

What is the transcription factor I just assayed doing?
Collect known literature of the form Function A: Gene1, Gene2, Gene3, ... Function B: Gene1, Gene2, Gene3, ... Function C: ... Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above. Form hypothesis and perform further experiments. Cis-regulatory peak Gene transcription start site [BejeranoWinter12/13] 36

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile
Gene transcription start site SRF binding ChIP-seq peak ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells1 SRF is known as a “master regulator of the actin cytoskeleton” In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation. Jurkat (Human T cell lymphoblast-like cell line) Description: serum response factor (c-fos serum response RefSeq Summary (NM_003131): This gene encodes a ubiquitous nuclear protein that stimulates both cell proliferation and differentiation. It is a member of the MADS (MCM1, Agamous, Deficiens, and SRF) box superfamily of transcription factors. This protein binds to the serum response element (SRE) in the promoter region of target genes. This protein regulates the activity of many immediate-early genes, for example c-fos, and thereby participates in cell cycle regulation, apoptosis, cell growth, and cell differentiation. This gene is the downstream target of many pathways; for example, the mitogen-activated protein kinase pathway (MAPK) that acts through the ternary complex factors (TCFs). [provided by RefSeq]. [BejeranoWinter12/13]

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile
Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin cytoskeleton’) π π π π Existing, gene-based method to analyze enrichment: Ignore distal binding events. Count affected genes. Rank by enrichment hypergeometric p-value. N = 8 genes in genome K = 3 genes annotated with n = 2 genes selected by proximal peaks k = 1 selected gene annotated with π π π π P = Pr(k ≥1 | n=2, K =3, N=8) π π [BejeranoWinter12/13]

Pro: A lot of tools out there for the analysis of gene lists.
We have (reduced ChIP-Seq into) a gene list! What is the gene list enriched for? Pro: A lot of tools out there for the analysis of gene lists. Cons: These tools are built for microarray analysis. Does it matter ?? Microarray data Microarray data Gene regulation data Microarray tool [BejeranoWinter12/13]

SRF Gene-based enrichment results
Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1 SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding SRF Z ~ SRF Where’s the signal? Top “actin” term is ranked #28 in the list. ~ [1] Valouev A. et al., Nat. Methods, 2008 [BejeranoWinter12/13] 40

Associating only proximal peaks loses a lot of information
Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets Restricting to proximal peaks often leads to complete loss of key enrichments [BejeranoWinter12/13]

Bad Solution: Associating distal peaks brings in many false enrichments
π π π Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream. SRF ChIP-seq set has >2,000 binding events. Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis? Term Bonferroni corrected p-value nervous system development x10-9 system development x10-9 anatomical structure development x10-8 multicellular organismal development 1x10-7 developmental process x10-6 Large “gene deserts” are often next to key developmental genes [BejeranoWinter12/13]

Real Solution: Do not convert to gene list
Real Solution: Do not convert to gene list. Analyze the set of genomic regions Gene transcription start site Ontology term ( ‘actin cytoskeleton’) Gene regulatory domain Genomic region (ChIP-seq peak) π π π π π GREAT = Genomic Regions Enrichment of Annotations Tool p = 0.33 of genome annotated with π n = 6 genomic regions k = 5 genomic regions hit annotation Fraction of genome resulting in annotation explicitly used in enrichment calculation P = Prbinom(k ≥5 | n=6, p =0.33) π π π Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments. [BejeranoWinter12/13]

How does GREAT know how to assign distal binding peaks to genes?
Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc. Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks [BejeranoWinter12/13]

GREAT infers many specific functions of SRF from its binding profile
Top GREAT enrichments of SRF Ontology Term # Genes Binomial Experimental P-value support* Top gene-based enrichments of SRF Gene Ontology actin cytoskeleton actin binding 30 31 7x10-9 5x10-5 Miano et al. 2007 Pathway Commons TRAIL signaling Class I PI3K signaling 32 26 5x10-7 2x10-6 Bertolotto et al. 2000 Poser et al. 2000 TreeFam FOS gene family 5 1x10-8 Chai & Tarnawski 2002 (top actin-related term 28th in list) TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 84 28 44 23 5x10-76 4x10-9 1x10-6 2x10-4 Positive control ChIp-Seq support Natesan & Gilman 1995 SRF is “master regulator of the actin cytoskeleton.” SRF is key regulator of FOS oncogene and has been shown to act in conjunction with YY1 to regulate FOS. Demonstrated associations between SRF and TRAIL signaling. SRF is needed for PI3K-dependent cell proliferation. cFOS and FOSB are known targets of SRF. * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT. Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq [McLean et al., Nat Biotechnol., 2010] [BejeranoWinter12/13]

Limb P300: I was blind and I can see
Gene List Fraction of genome resulting in annotation explicitly used in enrichment calculation [BejeranoWinter12/13]

GREAT works with ANY cis-regulatory rich set Example: GWAS Compendium set
Height-associated unlinked SNPs Fraction of genome resulting in annotation explicitly used in enrichment calculation [BejeranoWinter12/13]

GREAT analysis of histone mark combinations

GREAT includes multiple ontologies
Twenty ontologies spanning broad categories of biology 44,832 total ontology terms tested in each GREAT run (2,800 terms) (6,700) (5,215) (3,079) (834) (911) (5,781) (615) (427) (19) (456) (222) (9) (150) (1,253) (6,857) (288) (8,272) (706) (238) [BejeranoWinter12/13] Michael Hiller

Advantages of the GREAT approach
Tailored to the biology of gene regulation: Distal sites are incorporated, not ignored Variable length gene regulatory domains Multiple bindings next to same target gene rewarded Extensive ontologies, some home-made [BejeranoWinter12/13]

Algorithmic Optimization: A it works; B make it efficient

enter GREAT.stanford.edu
Choose genome Input peak list Hit submit! [BejeranoWinter12/13]

(Optional): alter association rules
GREAT web app: (Optional): alter association rules Three association rule choices Lnp Evx2 HoxD cluster Literature-curated domains for a small subset of genes [BejeranoWinter12/13] [adapted from Spitz, Gonzalez, & Duboule, Cell, 2003]

GREAT web app: output summary
Additional ontologies, term statistics, multiple hypothesis corrections, etc. Ontology-specific enrichments Cool visualization opportunities! [BejeranoWinter12/13]

GREAT web app: term details page
Genes annotated as “actin binding” with associated genomic regions Genomic regions annotated with “actin binding” Drill down to explore how a particular peak regulates Plectin and its role in actin binding [BejeranoWinter12/13]

You can also submit any track straight from the UCSC Table Browser
A simple, well documented programmatic interface allows any tool to submit directly to GREAT. (See our Help / Inquiries welcome!) [BejeranoWinter12/13]

GREAT web app: export data
HTML output displays all user selected rows and columns Tab-separated values also available for additional postprocessing [BejeranoWinter12/13]

GREAT Web Stats 200-400 job submissions per day, from 7,000 IP addrs

Adding a new species to GREAT
We need: A good assembly A high quality gene set Good gene annotations* *Most valuable for species with independent annotations! [BejeranoWinter12/13]

Adapting GREAT for zebrafish
We need: A good assembly A high quality gene set Good gene annotations # Scaffolds Avg. Scaffold Length # Assembly Gaps Zv8 11,724 129Kb ~55,000 Zv9 1,133 1,250 Kb ~27,000 Zv9 = UCSC danRer7 older assemblies?  liftover to Zv9/danRer7 [BejeranoWinter12/13]

We need: A good assembly A high quality gene set Good gene annotations Carefully combine (95% identity, 80% coverage) RefSeq transcripts Ensembl coding genes RefSeq proteins Uniprot proteins  Obtain 14,567 genes, all with ZFIN gene identifiers Using only RefSeq would miss 1,912 annotated genes Using only Ensembl would miss 1,218 annotated genes [BejeranoWinter12/13]

We need: A good assembly A high quality gene set Good gene annotations Curate zebrafish: Gene Ontology (GO) - Function, Process, Cellular Component ZFIN Phenotype Wiki Pathways ZFIN Wildtype Expression InterPro - protein domains, families and functional sites TreeFam - gene families of paralogs [BejeranoWinter12/13]

96% of our gene set is annotated
At least one gene is annotated with the term [BejeranoWinter12/13]

CS173 Lecture 14: Personal Genomics, GSEA/GREAT

Similar presentations

Presentation on theme: "CS173 Lecture 14: Personal Genomics, GSEA/GREAT"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS173 Lecture 14: Personal Genomics, GSEA/GREAT

Similar presentations

Presentation on theme: "CS173 Lecture 14: Personal Genomics, GSEA/GREAT"— Presentation transcript:

Similar presentations

About project

Feedback