Presentation is loading. Please wait.

Presentation is loading. Please wait.

University at BuffaloThe State University of New York Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering.

Similar presentations


Presentation on theme: "University at BuffaloThe State University of New York Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering."— Presentation transcript:

1 University at BuffaloThe State University of New York Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering University at Buffalo The State University of New York 05.12.03

2 University at BuffaloThe State University of New York What is Bioinformatics Broad Definition qThe study of how information technologies are used to solve problems in biology Narrow Definition qThe creation and management of biological databases in support of genomic sequences Oxford English Dictionary (proposed) qConceptualizing biology in terms of molecules and applying information techniques to understand and organize the information associated with these molecules, on a large scale

3 University at BuffaloThe State University of New York Aims of Bioinformatics Simplest qOrganize data in a way that allows researchers to access information and submit new entries as they are produced Higher qDevelop tools and resources that aid in the analysis of data Advanced qUse these tools to analyze the data and interpret the results in a biologically meaning manner

4 University at BuffaloThe State University of New York Subjects of Bioinfromatics Data SourceData SizeTopics Raw DNA sequence 8.2 million sequences (9.5 billion bases) Separating regions Gene product prediction Protein sequence 300,000 sequences (~300 amino acids each) Sequence comparison, alignments, identification Macromolecular structure 13,000 structures (~1,000 atomic coordinates each) Structure prediction, 3D alignment Protein geometry measurements Genomes 40 complete genomes (1.6 million – 3 billion bases each) Molecular simulations Phylogenetic analysis Genomic-scale censuses Linkage analysis Gene expression ~20 time point measurements for ~6,000 genes Clustering, correlating patterns, mapping data to sequence, structural and biochemical data Literature 11 million citationsDigital libraries Knowledge databases Metabolic pathways Pathway simulations

5 University at BuffaloThe State University of New York Figure taken from http://www.oml.gov/hgmis

6 University at BuffaloThe State University of New York http://www.ipam.ucla.edu/programs/fg2000/fgt_speed7.ppt DNA Microarray Experiments

7 University at BuffaloThe State University of New York Gene Expression Data Matrix Each row represents a gene G i ; Each column represents an experiment condition S j ; Each cell X ij is a real value representing the gene expression level of gene G i under condition S j ; X ij > 0: over expressed X ij < 0: under expressed A time-series gene expression data matrix typically contains O(10 3 ) genes and O(10) time points. Gene Expression Data

8 University at BuffaloThe State University of New York X 11 X 12 X 13 X 21 X 22 X 23 X 31 X 32 X 33 sample 1sample 2sample 3 genes samples asymmetric dimensionality 10 ~ 100 sample / condition 1000 ~ 10000 gene two-way analysis sample space gene space Gene Expression Data

9 University at BuffaloThe State University of New York Analysis from two angles sample as object, gene as attribute gene as object, sample/condition as attribute Microarray Data Analysis

10 University at BuffaloThe State University of New York Challenges of Gene Data Analysis (1) Gene space: Automatically identify clusters of genes which express similar patterns in the data set qRobust to huge amount of noise qEffective to handle the highly intersected clusters qPotential to visualize the clustering results

11 University at BuffaloThe State University of New York Gene Expression Data MatrixGene Expression Patterns Co-expressed Genes Why looking for co-expressed genes?  Co-expression indicates co-function;  Co-expression also indicates co-regulation. Co-expressed Genes

12 University at BuffaloThe State University of New York Challenges of Gene Data Analysis (2) Sample space: unsupervised sample clustering presents interesting but also very challenging problems –The sample space and gene space are of very different dimensionality (10 1 ~ 10 2 samples versus 10 3 ~10 4 genes). –High percentage of irrelevant or redundant genes. –People usually have little knowledge about how to construct an informative gene space.

13 University at BuffaloThe State University of New York Sample Clustering Gene expression data clustering

14 University at BuffaloThe State University of New York Microarray Data Analysis Sample Clusters Microaray Data Gene Expression Matrices Gene Expression Data Analysis Important patterns Important patterns Important patterns Microarray Images Gene Expression Patterns Visualization

15 University at BuffaloThe State University of New York Our Approaches Density-based approach: recognizes a dense area as a cluster, and organizes the cluster structure of a data set into a hierarchical tree. qcaculate the density of each data object based on its neighboring data distribution. qconstruct the "attraction" relationship between data objects according to object density. qorganize the attraction relationship into the "attraction tree". qsummarize the attraction tree by a hierarchical "density tree". qderive clusters from density tree.

16 University at BuffaloThe State University of New York Our Approaches (2) Interrelated dimensional clustering -- automatically perform two tasks: detection of meaningful sample patterns selection of those significant genes of empirical pattern

17 University at BuffaloThe State University of New York Our Approaches (3) Visualization tool: offers insightful information Detects the structure of dataset Three Aspects qExplorative qConfirmative qRepresentative Microarray Analysis Status qNumerical methods dominant qVisualization serve graphical presentations of major clustering methods qVisualization applied  Global visualization (TreeView)  Sammon’s mapping TreeView

18 University at BuffaloThe State University of New York Explorative Visualization – Sample space Confirmative Visualization – Gene space VizStruct Architecture

19 University at BuffaloThe State University of New York VizStruct - Dimension Tour  Interactively adjust dimension parameters  Manually or automatically  May cause false clusters to break  Create dynamic visualization

20 University at BuffaloThe State University of New York Visualized Results for a Time Series Data Set

21 University at BuffaloThe State University of New York Elements of Clustering Feature Selection. Select properly the features on which clustering is to be performed. Clustering Algorithm. qCriteria (e.g. object function) qProximity Measure (e.g. Euclidean distance, Pearson correlation coefficient ) Cluster Validation. The assessment of clustering results. Interpretation of the results.

22 University at BuffaloThe State University of New York Supervised Analysis qSelect training samples (hold out…) qSort genes (t-test, ranking…) qSelect informative genes (top 50 ~ 200) qCluster or classification based on informative genes Class 1 1 1 … 1 0 0 … 0 0 0 … 0 1 1 … 1 Class 2 g 1 g 2. g 4131 g 4132 1 1 … 1 0 0 … 0 0 0 … 0 1 1 … 1 g 1 g 2. g 4131 g 4132

23 University at BuffaloThe State University of New York Unsupervised Analysis Microarray data analysis methods can be divided into two categories: supervised/unsupervised analysis. We will focus on unsupervised sample classification which assume no membership information being assigned to any sample. qSince the initial biological identification of sample classes has been slow, typically evolving through years of hypothesis-driven research, automatically discovering sample pattern presents a significant contribution in microarray data analysis. qUnsupervised sample classification is much more complex than supervised manner. Many mature statistic methods such as t-test, Z-score, and Markov filter can not be applied without the phenotypes of samples known in advance.

24 University at BuffaloThe State University of New York Problem Statement  Given a data matrix M in which the number of samples and the volume of genes are in different order of magnitude (|G|>>| S|) and the number of sample categories K.  The goal is to find K mutually exclusive groups of the samples matching their empirical types, thus to discover their meaningful pattern and to find the set of genes which manifests the meaningful pattern.

25 University at BuffaloThe State University of New York Problem Statement Informative Genes Non- informative Genes gene 1 gene 6 gene 7 gene 8 gene 2 gene 4 gene 5 gene 3 1 2 34 5 6 7 samples

26 University at BuffaloThe State University of New York Problem Statement (2) gene 1 gene 6 gene 7 gene 2 gene 4 gene 5 gene 3 Non- informative Genes Informative Genes 1 2 34 5 6 7 samples 8 9 10

27 University at BuffaloThe State University of New York Problem Statement (3) Class 1 Class 2 Class3 gene a gene b gene c gene d gene e gene f Class 1 Class 2 Class3

28 University at BuffaloThe State University of New York Related Work  New tools using traditional methods : TreeView CLUTO CIT CNIO GeneSpring J-Express CLUSFAVOR SOM K-means Hierarchical clustering Graph based clustering PCA  Their similarity measures based on full gene space are interfered by high percentage of noise.

29 University at BuffaloThe State University of New York Related Work (2)  Clustering with feature selection: (CLIFF, leaf ordering, two-way ordering) 1.Filtering the invarient genes Bayes model Rank variance PCA 2.Partition the samples Ncut Min-Max Cut 3.Pruning genes based on the partition Markov blanket filter T-test Leaf ordering

30 University at BuffaloThe State University of New York Related Work (3)  Subspace clustering : Bi-clustering δ-clustering

31 University at BuffaloThe State University of New York Intra-pattern-steadiness  Variance of a single gene:  Average row variance: We require each genes show either all “on” or all “off” within each sample class.

32 University at BuffaloThe State University of New York Intra-pattern-consistency(2) Measure- ment Data(A)Data(B) residue0.19750.4506 MSR0.04940.4012 ARV*339.06675.3000

33 University at BuffaloThe State University of New York Inter-pattern-divergence  In our model, both ``inter-pattern- steadiness'' and ``intra- pattern-dissimilarity'‘ on the same gene are reflected.  Average block distance:

34 University at BuffaloThe State University of New York Pattern Quality  The purpose of pattern discovery is to identify the empirical pattern where the patterns inside each class are steady and the divergence between each pair of classes is large.

35 University at BuffaloThe State University of New York Pattern Quality (2) Data(A)Data(B)Data(C) Con4.253.444.52 Div41.6025.2046.16  14.26879.607415.3526

36 University at BuffaloThe State University of New York The Problem Input 1. m samples each measured by n-dimensional genes 2. the number of sample categories K Output A K partition of samples (empirical pattern) and a subset of genes (informative space) that the pattern quality of the partition projected on the gene subset reaches the highest.

37 University at BuffaloThe State University of New York Strategy Starts with a random K-partition of samples and a subset of genes as the candidate of the informative space. Iteratively adjust the partition and the gene set toward the optimal solution. Basic elements: qA state:  A partition of samples {S 1,S 2,…S k }  A set of genes G’  G  The corresponding pattern quality  qAn adjustment  For a gene  G’, insert into G’  For a gene  G’, remove from G’  For a sample in group S’, move to other group

38 University at BuffaloThe State University of New York Strategy (2) Iteratively adjust the partition and the gene set toward the optimal pattern. qfor each gene, try possible insert/remove qfor each sample, try best movement.

39 University at BuffaloThe State University of New York Improvement  Data Standardization othe original gene intensity values  relative values where  Random order  Conduct negative action with a probability  Stimulated annealing

40 University at BuffaloThe State University of New York Experimental Results Data Sets: qMultiple-sclerosis data  MS-IFN : 4132 * 28 (14 MS vs. 14 IFN)  MS-CON : 4132 * 30 (15 MS vs. 15 Control) qLeukemia data  7129 * 38 (27 ALL vs. 11 AML)  7129 * 34 (20 ALL vs. 14 AML) qColon Cancer data  2000 * 62 (22 normal vs. 40 tumor colon tissue) qHereditary breast cancer data  3226 * 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)

41 University at BuffaloThe State University of New York Experimental Results (2)

42 University at BuffaloThe State University of New York Interrelated Dimensional Clustering The approach is applied on classifying multiple-sclerosis patients and IFN-drug treated patients. q(A) Shows the original 28 samples' distribution. Each point represents a sample, which is a mapping from the sample's 4132 genes intensity vectors. q(B) Shows 28 samples' distribution on 2015 genes. q(C) Shows 28 samples' distribution on 312 genes. q(D) Shows the same 28 samples distribution after using our approach. We reduce 4132 genes to 96 genes.

43 University at BuffaloThe State University of New York Experimental Results (3)

44 University at BuffaloThe State University of New York Experimental Results (4)

45 University at BuffaloThe State University of New York Applications Gene Function qCo-expressed genes in the same cluster tend to share common roles in cellular processes and genes of unrelated sequence but similar function cluster tightly together. qSimilar tendency was observed in both yeast data and human data. Gene Regulation qBy searching for common DNA sequences at the promoter regions of genes within the same cluster, regulatory motifs specific to each gene cluster are identified. Cancer Prediction Normal vs. Tumor Tissue Classification Drug Treatment Evaluation …

46 University at BuffaloThe State University of New York Summary We have developed advanced approaches for gene expression data analysis which work more effectively than traditional analysis approaches This research area is exciting and challenging. There are a lot of interesting research issues.


Download ppt "University at BuffaloThe State University of New York Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering."

Similar presentations


Ads by Google