Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier.

Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier

Utility of Gene Expression for Human Disease

Microarray Technology

Big Picture

Data Access

Gene Expression Microarray Repositories Gene Expression Omnibus (GEO)  Hosted by: NCBI  Platform: All accepted  Normalization: Experiment by experiment basis  Access: R (GEOquery), EUtils  Meta-Information: GEOMetaDB ArrayExpress  Hosted by: EMBL  Platform: All accepted  Normalization: Experiment by experiment basis  Access: Web interface, EMBL API  Meta-Information: ? (API) Many smaller repositories which have more phenotypic information for specific diseases  Phenotypic information may be hard to access

Gene Expression Omnibus

Samples Per Platform in GEO HGU133 Plus 2.0 HGU133A Latest 3’ Affymetrix Array Affymetrix arrays account for ~67% of human gene expression data in public repositories.

Affymetrix Probesets ProbeProbe Pair Probeset (11 Probe Pairs) Perfect Match Mismatch GeneChip U133 Plus 2.0 Array (Image stored as CEL file.) >54,000 Probesets 25 nucleotides

Pre-Processing 101

Pre-Processing Gene Expression Data

Removing Miss-Targeted and Non-Specific Probes CEL File CDF File Intensities Normally CDF File Comes from Affymetrix Zhang, et al. 2005 CEL File AltCDF File Intensities Alternative CDF File Thorougly Cleaned

What Makes Cells Different?

PANP: Presence/Absence Filtering Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution  NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand Utilize this background distribution from these NSMPs to threshold the entire dataset Output is a call for each array for each gene  Calls are: P = presence M = marginal A = Absence

Identifying Present Genes Filter out genes ≥ 50% absent  Whole dataset  Subsets Only present genes are utilized in future analyses

Removing Redundancy

Reason for Removing Redundancy Before Running

Removing Redundancy Collapse Affymetrix Probeset IDs to EntrezIDs Test for correlation between probesets  If correlation is ≥ 0.8 then combine probesets  If not then leave them separate

Pre-Processing Pipeline = Implemented in R = Implemented in Python

Big Picture

Glioma: A Deadly Brain Cancer Wikimedia commons

Brain Anatomy Wikimedia commons

What do they do?

Neurophysiology

Hierarchy of Nervous Tissue Tumors

Glioma WHO GradeTumor Type Percentage of CNS Tumors IPilocytic Astrocytoma 9.8% II Diffuse or Low-Grade Astrocytoma IIIAnaplastic Astrocytoma IVGlioblastoma Multiforme20.3% Gliomas account for 40% of all tumors and 78% of malignant tumors. Buckner et al., 2007

Glioma Survival http://www.neurooncology.ucla.edu/ 5 years 10 years

Repository of Molecular Brain Neoplasia Data (REMBRANDT) REMBRANDT (Madhavan et al., 2009)  Currently 257 individual specimens Glioblastoma multiforme (GBM) = 110 Astrocytoma = 50 Oligodendroglioma = 55 Mixed = 21 Non-Tumor = 21  Phenotypes Tumor type:  GBM, Astrocytoma, etc. WHO Grade:  176 individuals Age:  253 individuals Sex:  250 individuals (partially inferred using Y chromosome genes) Survival (days post diagnosis):  169 individuals

REMBRANT: Chromosome Y Expression Sex specific gene expression FemaleMale Conversions of male to female should be more common than the other way, because it is difficult for females to express the Y chromosome. 4 females cluster with males 8 males cluster with females

REMBRANT: Chr. Y Expression – Intelligent Reassignment Sex specific gene expression FemaleMale Intelligent Reassignment – If previous call of sex is for other group then the call is turned into an NA. All unknowns are given a call.

Progression of Astrocytic Glioma Furnari, et al. (2007)

Modeling Glioma Increasing metastatic potential and severity of glioma could be modeled using this simple schema Correlation of model to survival post diagnosis is -0.68 0 1 2

Exploring Meta-Information Age explains 31% of survival post diagnosis Age explains 25% of the progression model Sex does not have a significant effect on either survival or the progression model  Yet it is known that glioblastoma is slightly more common in men than in women

Summary Very ample dataset with good amount of meta-information Ready for dimensionality reduction and network inference!

Big Picture

Clustering as Dimensionality Reduction

Big Picture

Likely Issues Size of eukaryotic genomes Added complexity of regulatory regions Tissue and cell type heterogeneity Patient genetic and environmental heterogeneity

Relative Genome Sizes

Solutions Pre-process genomic sequences Reduce data complexity by collapsing redundancies Utilize filters that select for only the most variant genes

Eukaryotic Gene Structure

Regulatory Regions 3’ UTR miRNA binding sites (4-9bp motifs) Promoter Transcription Factor Binding Sites (6-12bp motifs) No set length for promoters in eukaryotes. Grabbing 2Kbp, so we can use 2Kbp or smaller. Median 3’ UTR length is 831bp

Three Examples After Capture 85% (n = 36,177) of probesets are associated with a sequence

Solution Do motif detection on both promoter and 3’ UTR sequences Incorporate both of these regulatory regions into the cMonkey bi-cluster scoring matrix

Promoter Sequences Looking for transcription factor binding sites (TFBS)  Using MEME with 6-12bp motif widths Utilized RefSeq gene mapping to identify putative promoter regions  2Kbp of sequence upstream of transcriptional start site (TSS) was grabbed If two RefSeq gene mappings did not overlap then the longest transcripts promoter was taken

3’ UTR Sequences Looking for miRNA binding sites  miRNA are 21bp RNA molecules that bind to mRNA and alter expression  Using MEME with 4- 9bp motif widths

Complexity of Mammalian Systems

Cellular Heterogeneity in Tissues

What Makes Cells Different?

Solution Filter our genes that are not expressed for each tissue, leaving only those that are expressed Enhance the capability of the software to handle missing data

Intelligent Sample Collection Genetic and environmental heterogeneity are real world issues Can try to match for certain confounders Or stratify analyses based on particular confounders

Running cMonkey Running cMonkey on AEGIR cluster  10 nodes with 8 cores per node  1 node has 24GB ram  2 others have 16GB ram Completion time depending heavily on the size of the run

Beautiful New Result Interface

Looking at a Cluster

Chris’s Graphics Mods

Original cMonkey Output

Sorted cMonkey Output

Boxplot For All Samples

Boxplot for In Samples

Integrating Phenotypes

What to do when you find a cluster?

Checking Out PSSM #1

Known Motif?

Motif Known?

What do the genes do?

Functional Enrichment?

Functional Enrichment

Genes?

Interesting Cluster

Phenotype Correlations Survival –  Correlation coefficient = -0.48  P-value = 3.2 x 10 -11 Progression Model –  Correlation coefficient = 0.55  P-value = 6.7 x 10 -16 Age –  Correlation coefficient = 0.32  P-value = 2.2 x 10 -7 Sex –  Correlation coefficient = -0.27  P-value = 0.0012 Bonferroni corrected significant p-value ≤ (0.05 / (585*4)) ≤ 2.1 x 10 -5

Genes from Cluster AFFY_IDGene SymbolGene Name 212067_S_ATC1Rcomplement component 1, r subcomponent 208747_S_ATC1Scomplement component 1, s subcomponent 201743_ATCD14cd14 antigen 215049_X_ATCD163cd163 antigen 203854_ATCFIcomplement factor i 213060_S_ATCHI3L2chitinase 3-like 2 208146_S_ATCPVLcarboxypeptidase, vitellogenic-like 201798_S_ATFER1L3fer-1-like 3, myoferlin (c. elegans) 206584_ATLY96lymphocyte antigen 96 202180_S_ATMVPmajor vault protein 204150_ATSTAB1stabilin 1 204924_ATTLR2toll-like receptor 2 = Previously known to be differentially expressed in GBM.

Motif Matches PSSM #2 PSSM #1

Summary Very promising results Need to further develop certain aspects of cMonkey to better utilize the human data Then need to build network inference component

General Questions Biclustering or not? How many genes to run? How much sequence to feed MEME? Can more than one experiment be included?

Cluster Samples, or Not? Bi-clustering clusters not only on genes but also by experimental conditions (samples) Because we are using just one experiment it may not be necessary to cluster samples Although it may be useful again once other experiments are included

Bi-clustering or Not? Bi-clustering Gene Clustering Only

Brief Glance Looks like for this dataset it may make more sense to only cluster genes  More clusters with significant motifs Although this is likely to change once we add more experiments to the mix Need a method to quantify this

Maxing Out cMonkey Can cMonkey handle running all genes  Yes, without doing motif finding  With motif finding this will take a long time (weeks?), and tends to crash out Essentially need to balance sequence length for motif finding with cluster size and number of clusters Need a method to quantify this

Length for Promoters? MEME suggests 1Kbp or less for sequences as input Tried using 500bp, 1Kbp, 2Kbp, 2.5Kbp, and 5Kbp

Brief Glance So far looks like the 500bp give the most clusters with motifs Need a method to quantify this

Breast Cancer Metastasis Bos et al., 2009

cMonkey for Eukaryotes Future Modifications to cMonkey for eukaryotes:  Preprocess sequence data  Add 3’ UTR miRNA motif detection  Integrate 3’ UTR miRNA motif scores with promoter motif scores

Network Inference cMonkey software is utilized to produce the bi- clusters Inferelator can then be used to identify regulatory factors Simple correlation with phenotypes can relate bi- clusters to disease

Acknowledgements Baliga Lab Nitin David Chris Dan Hood Lab Burak Kutlu Luxembourg Project REMBRANDT

Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier.

Similar presentations

Presentation on theme: "Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier.

Similar presentations

Presentation on theme: "Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier."— Presentation transcript:

Similar presentations

About project

Feedback