Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier.

Similar presentations


Presentation on theme: "Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier."— Presentation transcript:

1 Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier

2 Utility of Gene Expression for Human Disease

3 Microarray Technology

4 Big Picture

5 Data Access

6 Gene Expression Microarray Repositories Gene Expression Omnibus (GEO)  Hosted by: NCBI  Platform: All accepted  Normalization: Experiment by experiment basis  Access: R (GEOquery), EUtils  Meta-Information: GEOMetaDB ArrayExpress  Hosted by: EMBL  Platform: All accepted  Normalization: Experiment by experiment basis  Access: Web interface, EMBL API  Meta-Information: ? (API) Many smaller repositories which have more phenotypic information for specific diseases  Phenotypic information may be hard to access

7 Gene Expression Omnibus

8 Samples Per Platform in GEO HGU133 Plus 2.0 HGU133A Latest 3’ Affymetrix Array Affymetrix arrays account for ~67% of human gene expression data in public repositories.

9 Affymetrix Probesets ProbeProbe Pair Probeset (11 Probe Pairs) Perfect Match Mismatch GeneChip U133 Plus 2.0 Array (Image stored as CEL file.) >54,000 Probesets 25 nucleotides

10 Pre-Processing 101

11 Pre-Processing Gene Expression Data

12 Removing Miss-Targeted and Non-Specific Probes CEL File CDF File Intensities Normally CDF File Comes from Affymetrix Zhang, et al. 2005 CEL File AltCDF File Intensities Alternative CDF File Thorougly Cleaned

13 Pre-Processing Gene Expression Data

14 What Makes Cells Different?

15 PANP: Presence/Absence Filtering Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution  NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand Utilize this background distribution from these NSMPs to threshold the entire dataset Output is a call for each array for each gene  Calls are: P = presence M = marginal A = Absence

16 Identifying Present Genes Filter out genes ≥ 50% absent  Whole dataset  Subsets Only present genes are utilized in future analyses

17 Pre-Processing Gene Expression Data

18 Removing Redundancy

19 Reason for Removing Redundancy Before Running

20 Removing Redundancy Collapse Affymetrix Probeset IDs to EntrezIDs Test for correlation between probesets  If correlation is ≥ 0.8 then combine probesets  If not then leave them separate

21 Pre-Processing Gene Expression Data

22 Pre-Processing Pipeline = Implemented in R = Implemented in Python

23 Big Picture

24 Glioma: A Deadly Brain Cancer Wikimedia commons

25 Brain Anatomy Wikimedia commons

26 What do they do?

27 Neurophysiology

28 Hierarchy of Nervous Tissue Tumors

29 Glioma WHO GradeTumor Type Percentage of CNS Tumors IPilocytic Astrocytoma 9.8% II Diffuse or Low-Grade Astrocytoma IIIAnaplastic Astrocytoma IVGlioblastoma Multiforme20.3% Gliomas account for 40% of all tumors and 78% of malignant tumors. Buckner et al., 2007

30 Glioma Survival http://www.neurooncology.ucla.edu/ 5 years 10 years

31

32 Repository of Molecular Brain Neoplasia Data (REMBRANDT) REMBRANDT (Madhavan et al., 2009)  Currently 257 individual specimens Glioblastoma multiforme (GBM) = 110 Astrocytoma = 50 Oligodendroglioma = 55 Mixed = 21 Non-Tumor = 21  Phenotypes Tumor type:  GBM, Astrocytoma, etc. WHO Grade:  176 individuals Age:  253 individuals Sex:  250 individuals (partially inferred using Y chromosome genes) Survival (days post diagnosis):  169 individuals

33 REMBRANT: Chromosome Y Expression Sex specific gene expression FemaleMale Conversions of male to female should be more common than the other way, because it is difficult for females to express the Y chromosome. 4 females cluster with males 8 males cluster with females

34 REMBRANT: Chr. Y Expression – Intelligent Reassignment Sex specific gene expression FemaleMale Intelligent Reassignment – If previous call of sex is for other group then the call is turned into an NA. All unknowns are given a call.

35 Progression of Astrocytic Glioma Furnari, et al. (2007)

36 Modeling Glioma Increasing metastatic potential and severity of glioma could be modeled using this simple schema Correlation of model to survival post diagnosis is -0.68 0 1 2

37 Exploring Meta-Information Age explains 31% of survival post diagnosis Age explains 25% of the progression model Sex does not have a significant effect on either survival or the progression model  Yet it is known that glioblastoma is slightly more common in men than in women

38 Summary Very ample dataset with good amount of meta-information Ready for dimensionality reduction and network inference!

39 Big Picture

40 Clustering as Dimensionality Reduction

41 Big Picture

42 Likely Issues Size of eukaryotic genomes Added complexity of regulatory regions Tissue and cell type heterogeneity Patient genetic and environmental heterogeneity

43 Relative Genome Sizes

44 Solutions Pre-process genomic sequences Reduce data complexity by collapsing redundancies Utilize filters that select for only the most variant genes

45 Likely Issues Size of eukaryotic genomes Added complexity of regulatory regions Tissue and cell type heterogeneity Patient genetic and environmental heterogeneity

46 Eukaryotic Gene Structure

47

48

49

50 Regulatory Regions 3’ UTR miRNA binding sites (4-9bp motifs) Promoter Transcription Factor Binding Sites (6-12bp motifs) No set length for promoters in eukaryotes. Grabbing 2Kbp, so we can use 2Kbp or smaller. Median 3’ UTR length is 831bp

51 Three Examples After Capture 85% (n = 36,177) of probesets are associated with a sequence

52 Solution Do motif detection on both promoter and 3’ UTR sequences Incorporate both of these regulatory regions into the cMonkey bi-cluster scoring matrix

53 Promoter Sequences Looking for transcription factor binding sites (TFBS)  Using MEME with 6-12bp motif widths Utilized RefSeq gene mapping to identify putative promoter regions  2Kbp of sequence upstream of transcriptional start site (TSS) was grabbed If two RefSeq gene mappings did not overlap then the longest transcripts promoter was taken

54 3’ UTR Sequences Looking for miRNA binding sites  miRNA are 21bp RNA molecules that bind to mRNA and alter expression  Using MEME with 4- 9bp motif widths

55 Likely Issues Size of eukaryotic genomes Added complexity of regulatory regions Tissue and cell type heterogeneity Patient genetic and environmental heterogeneity

56 Complexity of Mammalian Systems

57 Cellular Heterogeneity in Tissues

58 What Makes Cells Different?

59 Solution Filter our genes that are not expressed for each tissue, leaving only those that are expressed Enhance the capability of the software to handle missing data

60 Likely Issues Size of eukaryotic genomes Added complexity of regulatory regions Tissue and cell type heterogeneity Patient genetic and environmental heterogeneity

61 Intelligent Sample Collection Genetic and environmental heterogeneity are real world issues Can try to match for certain confounders Or stratify analyses based on particular confounders

62 Running cMonkey Running cMonkey on AEGIR cluster  10 nodes with 8 cores per node  1 node has 24GB ram  2 others have 16GB ram Completion time depending heavily on the size of the run

63 Beautiful New Result Interface

64 Looking at a Cluster

65 Chris’s Graphics Mods

66 Original cMonkey Output

67 Sorted cMonkey Output

68 Boxplot For All Samples

69 Boxplot for In Samples

70 Integrating Phenotypes

71 What to do when you find a cluster?

72 Checking Out PSSM #1

73 Known Motif?

74 Motif Known?

75 What do the genes do?

76 Functional Enrichment?

77 Functional Enrichment

78 Genes?

79 Interesting Cluster

80 Phenotype Correlations Survival –  Correlation coefficient = -0.48  P-value = 3.2 x 10 -11 Progression Model –  Correlation coefficient = 0.55  P-value = 6.7 x 10 -16 Age –  Correlation coefficient = 0.32  P-value = 2.2 x 10 -7 Sex –  Correlation coefficient = -0.27  P-value = 0.0012 Bonferroni corrected significant p-value ≤ (0.05 / (585*4)) ≤ 2.1 x 10 -5

81 Genes from Cluster AFFY_IDGene SymbolGene Name 212067_S_ATC1Rcomplement component 1, r subcomponent 208747_S_ATC1Scomplement component 1, s subcomponent 201743_ATCD14cd14 antigen 215049_X_ATCD163cd163 antigen 203854_ATCFIcomplement factor i 213060_S_ATCHI3L2chitinase 3-like 2 208146_S_ATCPVLcarboxypeptidase, vitellogenic-like 201798_S_ATFER1L3fer-1-like 3, myoferlin (c. elegans) 206584_ATLY96lymphocyte antigen 96 202180_S_ATMVPmajor vault protein 204150_ATSTAB1stabilin 1 204924_ATTLR2toll-like receptor 2 = Previously known to be differentially expressed in GBM.

82 Motif Matches PSSM #2 PSSM #1

83 Summary Very promising results Need to further develop certain aspects of cMonkey to better utilize the human data Then need to build network inference component

84 General Questions Biclustering or not? How many genes to run? How much sequence to feed MEME? Can more than one experiment be included?

85 Cluster Samples, or Not? Bi-clustering clusters not only on genes but also by experimental conditions (samples) Because we are using just one experiment it may not be necessary to cluster samples Although it may be useful again once other experiments are included

86 Bi-clustering or Not? Bi-clustering Gene Clustering Only

87 Brief Glance Looks like for this dataset it may make more sense to only cluster genes  More clusters with significant motifs Although this is likely to change once we add more experiments to the mix Need a method to quantify this

88 General Questions Biclustering or not? How many genes to run? How much sequence to feed MEME? Can more than one experiment be included?

89 Maxing Out cMonkey Can cMonkey handle running all genes  Yes, without doing motif finding  With motif finding this will take a long time (weeks?), and tends to crash out Essentially need to balance sequence length for motif finding with cluster size and number of clusters Need a method to quantify this

90 General Questions Biclustering or not? How many genes to run? How much sequence to feed MEME? Can more than one experiment be included?

91 Length for Promoters? MEME suggests 1Kbp or less for sequences as input Tried using 500bp, 1Kbp, 2Kbp, 2.5Kbp, and 5Kbp

92 Brief Glance So far looks like the 500bp give the most clusters with motifs Need a method to quantify this

93 General Questions Biclustering or not? How many genes to run? How much sequence to feed MEME? Can more than one experiment be included?

94 Breast Cancer Metastasis Bos et al., 2009

95 cMonkey for Eukaryotes Future Modifications to cMonkey for eukaryotes:  Preprocess sequence data  Add 3’ UTR miRNA motif detection  Integrate 3’ UTR miRNA motif scores with promoter motif scores

96 Network Inference cMonkey software is utilized to produce the bi- clusters Inferelator can then be used to identify regulatory factors Simple correlation with phenotypes can relate bi- clusters to disease

97 Acknowledgements Baliga Lab Nitin David Chris Dan Hood Lab Burak Kutlu Luxembourg Project REMBRANDT


Download ppt "Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier."

Similar presentations


Ads by Google