Download presentation
Presentation is loading. Please wait.
Published byAlexina Cummings Modified over 8 years ago
1
Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier
2
Utility of Gene Expression for Human Disease
3
Microarray Technology
4
Big Picture
5
Data Access
6
Gene Expression Microarray Repositories Gene Expression Omnibus (GEO) Hosted by: NCBI Platform: All accepted Normalization: Experiment by experiment basis Access: R (GEOquery), EUtils Meta-Information: GEOMetaDB ArrayExpress Hosted by: EMBL Platform: All accepted Normalization: Experiment by experiment basis Access: Web interface, EMBL API Meta-Information: ? (API) Many smaller repositories which have more phenotypic information for specific diseases Phenotypic information may be hard to access
7
Gene Expression Omnibus
8
Samples Per Platform in GEO HGU133 Plus 2.0 HGU133A Latest 3’ Affymetrix Array Affymetrix arrays account for ~67% of human gene expression data in public repositories.
9
Affymetrix Probesets ProbeProbe Pair Probeset (11 Probe Pairs) Perfect Match Mismatch GeneChip U133 Plus 2.0 Array (Image stored as CEL file.) >54,000 Probesets 25 nucleotides
10
Pre-Processing 101
11
Pre-Processing Gene Expression Data
12
Removing Miss-Targeted and Non-Specific Probes CEL File CDF File Intensities Normally CDF File Comes from Affymetrix Zhang, et al. 2005 CEL File AltCDF File Intensities Alternative CDF File Thorougly Cleaned
13
Pre-Processing Gene Expression Data
14
What Makes Cells Different?
15
PANP: Presence/Absence Filtering Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand Utilize this background distribution from these NSMPs to threshold the entire dataset Output is a call for each array for each gene Calls are: P = presence M = marginal A = Absence
16
Identifying Present Genes Filter out genes ≥ 50% absent Whole dataset Subsets Only present genes are utilized in future analyses
17
Pre-Processing Gene Expression Data
18
Removing Redundancy
19
Reason for Removing Redundancy Before Running
20
Removing Redundancy Collapse Affymetrix Probeset IDs to EntrezIDs Test for correlation between probesets If correlation is ≥ 0.8 then combine probesets If not then leave them separate
21
Pre-Processing Gene Expression Data
22
Pre-Processing Pipeline = Implemented in R = Implemented in Python
23
Big Picture
24
Glioma: A Deadly Brain Cancer Wikimedia commons
25
Brain Anatomy Wikimedia commons
26
What do they do?
27
Neurophysiology
28
Hierarchy of Nervous Tissue Tumors
29
Glioma WHO GradeTumor Type Percentage of CNS Tumors IPilocytic Astrocytoma 9.8% II Diffuse or Low-Grade Astrocytoma IIIAnaplastic Astrocytoma IVGlioblastoma Multiforme20.3% Gliomas account for 40% of all tumors and 78% of malignant tumors. Buckner et al., 2007
30
Glioma Survival http://www.neurooncology.ucla.edu/ 5 years 10 years
32
Repository of Molecular Brain Neoplasia Data (REMBRANDT) REMBRANDT (Madhavan et al., 2009) Currently 257 individual specimens Glioblastoma multiforme (GBM) = 110 Astrocytoma = 50 Oligodendroglioma = 55 Mixed = 21 Non-Tumor = 21 Phenotypes Tumor type: GBM, Astrocytoma, etc. WHO Grade: 176 individuals Age: 253 individuals Sex: 250 individuals (partially inferred using Y chromosome genes) Survival (days post diagnosis): 169 individuals
33
REMBRANT: Chromosome Y Expression Sex specific gene expression FemaleMale Conversions of male to female should be more common than the other way, because it is difficult for females to express the Y chromosome. 4 females cluster with males 8 males cluster with females
34
REMBRANT: Chr. Y Expression – Intelligent Reassignment Sex specific gene expression FemaleMale Intelligent Reassignment – If previous call of sex is for other group then the call is turned into an NA. All unknowns are given a call.
35
Progression of Astrocytic Glioma Furnari, et al. (2007)
36
Modeling Glioma Increasing metastatic potential and severity of glioma could be modeled using this simple schema Correlation of model to survival post diagnosis is -0.68 0 1 2
37
Exploring Meta-Information Age explains 31% of survival post diagnosis Age explains 25% of the progression model Sex does not have a significant effect on either survival or the progression model Yet it is known that glioblastoma is slightly more common in men than in women
38
Summary Very ample dataset with good amount of meta-information Ready for dimensionality reduction and network inference!
39
Big Picture
40
Clustering as Dimensionality Reduction
41
Big Picture
42
Likely Issues Size of eukaryotic genomes Added complexity of regulatory regions Tissue and cell type heterogeneity Patient genetic and environmental heterogeneity
43
Relative Genome Sizes
44
Solutions Pre-process genomic sequences Reduce data complexity by collapsing redundancies Utilize filters that select for only the most variant genes
45
Likely Issues Size of eukaryotic genomes Added complexity of regulatory regions Tissue and cell type heterogeneity Patient genetic and environmental heterogeneity
46
Eukaryotic Gene Structure
50
Regulatory Regions 3’ UTR miRNA binding sites (4-9bp motifs) Promoter Transcription Factor Binding Sites (6-12bp motifs) No set length for promoters in eukaryotes. Grabbing 2Kbp, so we can use 2Kbp or smaller. Median 3’ UTR length is 831bp
51
Three Examples After Capture 85% (n = 36,177) of probesets are associated with a sequence
52
Solution Do motif detection on both promoter and 3’ UTR sequences Incorporate both of these regulatory regions into the cMonkey bi-cluster scoring matrix
53
Promoter Sequences Looking for transcription factor binding sites (TFBS) Using MEME with 6-12bp motif widths Utilized RefSeq gene mapping to identify putative promoter regions 2Kbp of sequence upstream of transcriptional start site (TSS) was grabbed If two RefSeq gene mappings did not overlap then the longest transcripts promoter was taken
54
3’ UTR Sequences Looking for miRNA binding sites miRNA are 21bp RNA molecules that bind to mRNA and alter expression Using MEME with 4- 9bp motif widths
55
Likely Issues Size of eukaryotic genomes Added complexity of regulatory regions Tissue and cell type heterogeneity Patient genetic and environmental heterogeneity
56
Complexity of Mammalian Systems
57
Cellular Heterogeneity in Tissues
58
What Makes Cells Different?
59
Solution Filter our genes that are not expressed for each tissue, leaving only those that are expressed Enhance the capability of the software to handle missing data
60
Likely Issues Size of eukaryotic genomes Added complexity of regulatory regions Tissue and cell type heterogeneity Patient genetic and environmental heterogeneity
61
Intelligent Sample Collection Genetic and environmental heterogeneity are real world issues Can try to match for certain confounders Or stratify analyses based on particular confounders
62
Running cMonkey Running cMonkey on AEGIR cluster 10 nodes with 8 cores per node 1 node has 24GB ram 2 others have 16GB ram Completion time depending heavily on the size of the run
63
Beautiful New Result Interface
64
Looking at a Cluster
65
Chris’s Graphics Mods
66
Original cMonkey Output
67
Sorted cMonkey Output
68
Boxplot For All Samples
69
Boxplot for In Samples
70
Integrating Phenotypes
71
What to do when you find a cluster?
72
Checking Out PSSM #1
73
Known Motif?
74
Motif Known?
75
What do the genes do?
76
Functional Enrichment?
77
Functional Enrichment
78
Genes?
79
Interesting Cluster
80
Phenotype Correlations Survival – Correlation coefficient = -0.48 P-value = 3.2 x 10 -11 Progression Model – Correlation coefficient = 0.55 P-value = 6.7 x 10 -16 Age – Correlation coefficient = 0.32 P-value = 2.2 x 10 -7 Sex – Correlation coefficient = -0.27 P-value = 0.0012 Bonferroni corrected significant p-value ≤ (0.05 / (585*4)) ≤ 2.1 x 10 -5
81
Genes from Cluster AFFY_IDGene SymbolGene Name 212067_S_ATC1Rcomplement component 1, r subcomponent 208747_S_ATC1Scomplement component 1, s subcomponent 201743_ATCD14cd14 antigen 215049_X_ATCD163cd163 antigen 203854_ATCFIcomplement factor i 213060_S_ATCHI3L2chitinase 3-like 2 208146_S_ATCPVLcarboxypeptidase, vitellogenic-like 201798_S_ATFER1L3fer-1-like 3, myoferlin (c. elegans) 206584_ATLY96lymphocyte antigen 96 202180_S_ATMVPmajor vault protein 204150_ATSTAB1stabilin 1 204924_ATTLR2toll-like receptor 2 = Previously known to be differentially expressed in GBM.
82
Motif Matches PSSM #2 PSSM #1
83
Summary Very promising results Need to further develop certain aspects of cMonkey to better utilize the human data Then need to build network inference component
84
General Questions Biclustering or not? How many genes to run? How much sequence to feed MEME? Can more than one experiment be included?
85
Cluster Samples, or Not? Bi-clustering clusters not only on genes but also by experimental conditions (samples) Because we are using just one experiment it may not be necessary to cluster samples Although it may be useful again once other experiments are included
86
Bi-clustering or Not? Bi-clustering Gene Clustering Only
87
Brief Glance Looks like for this dataset it may make more sense to only cluster genes More clusters with significant motifs Although this is likely to change once we add more experiments to the mix Need a method to quantify this
88
General Questions Biclustering or not? How many genes to run? How much sequence to feed MEME? Can more than one experiment be included?
89
Maxing Out cMonkey Can cMonkey handle running all genes Yes, without doing motif finding With motif finding this will take a long time (weeks?), and tends to crash out Essentially need to balance sequence length for motif finding with cluster size and number of clusters Need a method to quantify this
90
General Questions Biclustering or not? How many genes to run? How much sequence to feed MEME? Can more than one experiment be included?
91
Length for Promoters? MEME suggests 1Kbp or less for sequences as input Tried using 500bp, 1Kbp, 2Kbp, 2.5Kbp, and 5Kbp
92
Brief Glance So far looks like the 500bp give the most clusters with motifs Need a method to quantify this
93
General Questions Biclustering or not? How many genes to run? How much sequence to feed MEME? Can more than one experiment be included?
94
Breast Cancer Metastasis Bos et al., 2009
95
cMonkey for Eukaryotes Future Modifications to cMonkey for eukaryotes: Preprocess sequence data Add 3’ UTR miRNA motif detection Integrate 3’ UTR miRNA motif scores with promoter motif scores
96
Network Inference cMonkey software is utilized to produce the bi- clusters Inferelator can then be used to identify regulatory factors Simple correlation with phenotypes can relate bi- clusters to disease
97
Acknowledgements Baliga Lab Nitin David Chris Dan Hood Lab Burak Kutlu Luxembourg Project REMBRANDT
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.