Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fuzzy Machine Learning Methods for Biomedical Data Analysis

Similar presentations


Presentation on theme: "Fuzzy Machine Learning Methods for Biomedical Data Analysis"— Presentation transcript:

1 Fuzzy Machine Learning Methods for Biomedical Data Analysis
Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 2017/4/25 Yan-Qing Zhang, Georgia State University

2 Yan-Qing Zhang, Georgia State University
Outline Background Fuzzy Association Rule Mining for Decision Support (FARM-DS) FARM-DS on Medical Data FARM-DS on Microarray Expression Data Fuzzy-Granular Gene Selection on Microarray Expression Data Conclusion and Future Work This is the outline of my presentation. Firstly, I will briefly overview the background knowledge. After introduction of the proposed fuzzy association rule mining for decision support system, the experiment results on biomedical datasets will be presented. Because I have reported this part in my proposal, today we will focus on microarray data analysis by applying fuzzy-granular based methods including FARM-DS. Finally, I will summarize this presentation and discuss the future work. 2017/4/25 Yan-Qing Zhang, Georgia State University

3 Yan-Qing Zhang, Georgia State University
Background Theory Computational Intelligence, Granular Computing, Fuzzy Sets Knowledge Discovery and Data mining (KDD) Decision Support system (DS) Rule-Based Reasoning (RBR), Association Rule Mining Application Bioinformatics, Medical Informatics, etc. Concern Accuracy Interpretability In the last decade, with the advent of genomic and proteomic technologies, more and more biomedical databases have been created and have been growing very fast. The general target of my research is intelligent data analysis with hybrid Computational Intelligence techniques including Fuzzy Sets, Granular Computing, Clustering, Association Rule Mining to extract knowledge from these databases to ease biomedical decision-making process. An effective DSS is expected to be both accurate and easy to interpret. 2017/4/25 Yan-Qing Zhang, Georgia State University

4 Yan-Qing Zhang, Georgia State University
Outline Background Fuzzy Association Rule Mining for Decision Support (FARM-DS) FARM-DS on Medical Data FARM-DS on Microarray Expression Data Fuzzy-Granular Gene Selection on Microarray Expression Data Conclusion and Future Work So now I will quickly go through the algorithm design part of FARM-DS. 2017/4/25 Yan-Qing Zhang, Georgia State University

5 Motivation – deal with numeric data
Traditional Association rule mining algorithm If X, then Y Conf = Pr(Y|X) Supp = Pr(X and Y) don’t work on numeric data Fuzzy Logic Feature transform Fuzzy AR mining (Zadeh, 1965) Basically, association rules identify feature subsets that are statistically related in the underlying data. An association rule is of the form “IF X, THEN Y” where X and Y are disjoint conjunctions of feature-value pairs. The confidence of the rule is the conditional probability of Y given X, Pr(Y|X), and the support of the rule is the prior probability of X and Y, Pr(X and Y). Here probability is taken to be the observed frequency in the data set. However, traditional AR mining algorithms can only handle datasets with categorical features. To extend them for discovering correlations in numeric data, it is natural to use the idea of fuzzy logic to split a numeric feature into discrete fuzzy sets for feature transformation. After that, traditional AR mining algorithms can work on the dataset after transformation. A lot of works have been conducted for fuzzy association rule mining under this basic idea. 2017/4/25 Yan-Qing Zhang, Georgia State University

6 Motivation – decision support
FARs for classification Accuracy vs. Interpretability Very Few works Hu et al. 2002 Combinatorial rule explosion Chatterjee et al. 2004 Human intervention However, most of AR or FAR mining algorithms are used to describe, or interpret correlations inside a dataset. At another end of the whole story, humans always need to make decisions in the real world. The simplest decisions humans need to make is binary classification: For example, given a sample, we may need to decide it is good or bad. State-of-the-art classifiers, such as SVMs, Neural Nets, although demonstrate high classification accuracy, are well known as “black-boxes”. How they classify or predict a sample is hard to understand by human, and hence can not provide effective decision support Because association rules are easy to understand, it is promising to go one step beyond data description. That is, to utilize these ARs or FARs for decision support to help human experts to make decisions, preconditioned that predictions based on these rules are accurate enough. As far as we know, very few works have been conducted in this promising research field. Hu proposed a FARM system in However, their system faces combinatorial rule explosion, and hence it can not handle data with a high dimensional feature space. Chatterjee also designed another FARM system in One shortcoming is that some parameters in the system need to be predefined by humans based on experience. However, it is usually difficult for a human to accurately estimate these parameters. 2017/4/25 Yan-Qing Zhang, Georgia State University

7 Yan-Qing Zhang, Georgia State University
FARM-DS Target Numeric data Binary classification Effectiveness Accuracy Interpretability Modeling process Training Testing To extract fuzzy association rules from numerical data, And to use these FARs to provide effective decision support for a binary classification problem, We proposed this FARM-DS system in this work. Notice that the effectiveness is evaluated by both accuracy and interpretability. The new FARM-DS system consists of two phases: the training phase and the testing phase. In the training phase, four steps are executed to extract fuzzy association rules. These FARs are thereafter used to predict unseen samples in the testing phase. 2017/4/25 Yan-Qing Zhang, Georgia State University

8 Step 1: Fuzzy Interval Partition
1-in-1-out 0-order TSK model Step 1 in the training phase is fuzzy interval partition. In this step, we build a very simple 1-in-1-out 0-order TSK fuzzy model for each feature. As a result, we split a numeric feature into multiple fuzzy intervals, which are represented with these simple fuzzy rules and corresponding fuzzy membership functions. Notice that the conclusion part of each rule is just the class label. Also notice that there are at least two fuzzy sets for each feature. Here is an example to split the feature into two fuzzy sets, with corresponding linguistic terms “low” and “high”: If the feature value is in the low fuzzy set, it is a negative sample; If it is high, the sample is postive. An ANFIS system will be used to find the optimal number of fuzzy sets and also find fuzzy MFs with the optimal shape and parameters with some validation heuristic. ANFIS for model optimization and parameter selection (Jang, 1993) 2017/4/25 Yan-Qing Zhang, Georgia State University

9 Step 2: Data Abstraction
positive cluster Clustering K-Means Fuzzy C-means Validation #clusters Optimal cluster Silhouette Value At the same time as the 1st step, the second step is conducted for data abstraction. Here some clustering algorithms such as K-means, FCM, and self organization map will be utilized. The basic idea is to group similar samples into multiple clusters based on their patterns in the feature space. In our FARM-DS system, a cluster with more positive samples than negative samples is defined to be a positive cluster, and a cluster with more negative samples than positive samples is defined to be a negative cluster. Similar to the first step, some validation heuristic is used to decide the optimal number of clusters and the optimal clustering result. The optimization target is to maximize the overall silhouette value defined in this formula. Basically, a larger silhouette value means that samples in a same cluster are more similar and samples from different clusters are more different. (The silhouette value for a sample is a measure of how similar the sample is to samples in its own cluster compared with samples in other clusters, and ranges from -1 to +1.). After clustering, each cluster can be represented by some representative samples. For example, the center may be used to denote a cluster. As a result, a high-level data abstraction can be achieved. In this way, the number of transactions and following rules is independent with the dimension of the input feature space. It is only decided by the number of clusters to generate a compact rule base, which in turn enhances the generalization capability and the interpretability to predict unknown new samples. Currently, only k-means clustering algorithm for data abstraction. We can try other clustering algorithm such as fuzzy C-means, self organization map in near future. negative cluster 2017/4/25 Yan-Qing Zhang, Georgia State University

10 Step 3: Generating Fuzzy Discrete Transactions
Project the center of each cluster on each feature Create transactions With positive cluster, +1 is inserted With negative cluster, -1 is inserted After fuzzy interval partition at step 1 for numeric data transformation, and clustering at step 2 for data abstraction, step 3 is executed to generate fuzzy discrete transactions. The idea is pretty straightforward: Given a cluster, with sk+ positive samples and sk- negative samples, | sk+ - sk-| same “fuzzy discrete transactions” will be created. If this is a positive cluster, 1 is inserted into these transactions. If this is a negative cluster, -1 is inserted. After that, the center of the cluster is projected onto each feature, if the difference between projections values on different fuzzy sets are not significant for feature Fi, it will not be inserted into these transactions. That means Fi is pruned in these transactions. The pruning process improves the interpretability of rules because short rules will be induced. Otherwise if the difference is significant enough (>=alpha), Fi will be inserted in the form of “Fi_1” or “Fi_0”. Currently, only two MFs for each feature at step 1 are considered. On each input feature fi, two membership values and are calculated for a center by projecting the center on the feature. Here shows an example of projecting a center with fi =0.113 on the trapezoidal membership functions. If there are multiple fuzzy sets for a feature, may replace alpha_cut with other more general operations such as max or sum ones 2017/4/25 Yan-Qing Zhang, Georgia State University

11 Yan-Qing Zhang, Georgia State University
Step 3 - example 5-2 = 3 transactions 1 f1_1 f2 Here is one example to show how to project a cluster onto two features to generate fuzzy discrete transactions. In this cluster, there are 5 positive samples and 2 negative samples. So this is a positive cluster, and 5 minus 2 is equal to 3 same transactions will be generated from this cluster. Each transaction will include an item “1”. After that, the center is projected onto two features F1 and F2. Because the difference between projection values on two fuzzy sets is not significant for F2, F2 will not be inserted into the transactions. For F1, projection value on “high” is significantly larger than projection value on “low”, F1 will be inserted into these transactions in the form of “F1_1”. The advantage is that we can overcome combinatorial rule explosion because the number of rules is not directly related to dimensionalities but decided by the number of clusters. f1 Avoid combinatorial rule explosion Number of different transactions are decided by number of clusters 2017/4/25 Yan-Qing Zhang, Georgia State University

12 Step 4: Association Rule Mining
Association Rule Mining on fuzzy discrete transactions Traditional Apriori algorithm (Agrawal and Srikant 1994) If f1 is low, f2 is high, …, fh is low, then y=1/-1 Rule pruning: For a pair of rules A and B, if B is more specific than A (that means A is included by B), and B has the same support value as A, A is eliminated. A: If f1 is low, then y=1, sup=50% B: If f1 is low and f2 is high, then y=1, sup=50% In the first 3 steps, numeric data is transformed into fuzzy discrete transactions, and hence it is easy to utilize traditional AR mining algorithms such as Apriori algorithm proposed by Agrawal and Srikant in 1994. The association rules can be represented in this form, notice the number of features in an AR usually is smaller than the number of original features, because some features are not significant and hence be removed from the corresponding fuzzy transactions. To improve interpretability and to simplify the model, a rule pruning process is conducted: For a pair of … 2017/4/25 Yan-Qing Zhang, Georgia State University

13 Yan-Qing Zhang, Georgia State University
Testing Phase In the testing phase, the performance of fuzzy association rules is evaluated on the testing dataset. Assume that there are r plus positive rules and r minus negative rules. For each new sample, its positive weight weight+ is defined to be the sum of the firing strengths of all positive rules (1), negative weight is defined similarly (3) The firing strength of a rule is calculated by projecting the sample onto each feature in the rule and then calculating the activation difference on different fuzzy sets. Finally, a class label is calculated as the difference between positive weight and negative weight, and plus a bias, which can be optimized by cross validation. 2017/4/25 Yan-Qing Zhang, Georgia State University

14 Yan-Qing Zhang, Georgia State University
Adaptive FARM-DS Train Fuzzy intervals partition Data abstraction Generate fuzzy discrete transactions AR mining Test As a summary, The new FARM-DS algorithm consists of two phases: the training phase and the testing phase. In the training phase, four steps are executed to mine fuzzy association rules. At step 1, a 1-in-1-out ANFIS system is used to generate fuzzy internals on each input feature. Each fuzzy interval is defined with a fuzzy membership function. At step 2, clustering is conducted for data abstraction to extract inherent data distribution knowledge. At step 3, FARM-DS naturally transforms quantitative samples into “fuzzy discrete transactions” by projecting the center of each cluster extracted at step 2 on the fuzzy intervals generated at step 1. Finally, at step 4, simple “IF-THEN” Fuzzy Association Rules can be mined from the “fuzzy discrete transactions” by the traditional Apriori association rule mining algorithm. These FARs are thereafter used to predict unseen samples in the testing phase. Notice that step 1 and step 2 can be executed independently in parallel. He, et al. 2006a, IJDMB 2017/4/25 Yan-Qing Zhang, Georgia State University

15 Yan-Qing Zhang, Georgia State University
Outline Background Fuzzy Association Rule Mining for Decision Support (FARM-DS) FARM-DS on Medical Data FARM-DS on Microarray Expression Data Fuzzy-Granular Gene Selection on Microarray Expression Data Conclusion and Future Work I’d like to skip this part as it has already been reported in my proposal. 2017/4/25 Yan-Qing Zhang, Georgia State University

16 Yan-Qing Zhang, Georgia State University
Empirical Studies Classification algorithms C4.5 decision trees (Quinlan, 1993) Support vector machines (Vapnik, 1995) FARM-DS (He, et al. 2006a, IJDMB) Accuracy Estimation 5-folds cross validation Interpretability In this group of empirical studies, we compared the FARM-DS system with other two popular classifiers, c4.5 decision trees proposed by quinlan in 1993 and support vector machine proposed by vapnik in And the 5- folds cross validation heuristic approach has been used to evaluate the performance. we randomly split the original dataset into 5 equal size subset. four of these subsets are combined as the training dataset and another one is taken as the testing dataset. The training-testing process is repeated five times such that each subset is used as the testing dataset exactly once. And the average accuracy is reported as the accuracy of the system. After that, parameters with the best validation accuracy are used to extract FARs on the whole dataset for interpretability analysis. 2017/4/25 Yan-Qing Zhang, Georgia State University

17 Yan-Qing Zhang, Georgia State University
Evaluation metrics Accuracy Classification Error Area under ROC curve (future work) Interpretability Rule numbers Average rule lengths We adopted multiple metrics to evaluate the performance. This evaluation focus on two aspects, the accuracy and interpretability. Now, we can use classification error metric and area under ROC curve metric to evaluate the accuracy. Although we did not do the area under ROC curve metric now, we will do it in near future. A smaller error and a larger AUC mean a more accurate classifier. The number of extracted rules and the average rule lengths are used to evaluate the interpretability. Intuitively, we believe a classifier is easy to interpret with the small number of short rules. Bradley, 1997 2017/4/25 Yan-Qing Zhang, Georgia State University

18 Yan-Qing Zhang, Georgia State University
Datasets The datasets used in this group are wisconsin breast cancer dataset and cleveland heart-disease dataset which are available from UCI repository of machine learning databases. Merz, et al. UCI repository of machine learning databases, 1998 2017/4/25 Yan-Qing Zhang, Georgia State University

19 Result analysis on Accuracy
In this work, we run FARM-DS and SVM on these two datasets, and because we have the exactly same experiment conditions as bennett’s work in 1997, their results can directly compare with ours. Notice that the svm modeled by us is called svm1, and the svm modeled by bennett is called svm2. and the results demonstrate that FARM-DS have almost the same accuracy as the optimal SVM, and the higher accuracy than the C4.5 decision tree classifier. FARM-DS ≈ SVM > C4.5 SVM2 and C4.5 results from (Bennett et al. 1997) 2017/4/25 Yan-Qing Zhang, Georgia State University

20 Result analysis on Interpretability
SVM, high accuracy, hard to interpret C4.5, low accuracy , easy to interpret FARM-DS, high accuracy, easy to interpret As we all know, SVM is well known as a black box because the classification decision is hard to understand. Decision Trees can induce rules that are easy to interpret. However, the low accuracy decreases the effectiveness of DTs induced rules. Our FARM-DS has both high accuracy and good interpretability. 2017/4/25 Yan-Qing Zhang, Georgia State University

21 Yan-Qing Zhang, Georgia State University
Interpretability (1) FARs extracted by FARM-DS are short and compact, and hence, easy to understand. 22 positive rules and 8 negative rules are extracted. In average, the length of a positive rule is 2.6, the length of a negative rule is 4.3, and every sample activates 3.3 positive rules and 5.6 negative rules. Firstly, FARs extracted by FARM-DS are short and compact. In experiments, FARM-DS is executed again on the whole dataset. Take BCW data as an example, 22 positive rules and 8 negative rules are extracted. In average, the length of a positive rule is 2.6, the length of a negative rule is 4.3, and every sample activates 3.3 positive rules and 5.6 negative rules. We believe that both the short length and the small number of activated rules can make extracted FARs easy to understand for further study. 2017/4/25 Yan-Qing Zhang, Georgia State University

22 Yan-Qing Zhang, Georgia State University
Interpretability (2) FARs may help human experts to correct the wrongly classified samples. Secondly, FARs may help human experts to correct the wrongly classified samples. There are 19 wrongly classified samples by FARM-DS in Wisconsin dataset, however, we notice that 12 such samples activate some correct rules. 2017/4/25 Yan-Qing Zhang, Georgia State University

23 Yan-Qing Zhang, Georgia State University
Interpretability (3) The larger support of the negative rules may help human experts to make final correct decisions and find inherent disease-resulting mechanisms. For example, the first validation sample in fold 1 is classified to be positive but it is actually negative. (That is, it is false positive). Its positive weight weight+=2.0000, and its negative weight weight-= For this sample, FARM-DS returns 2 fired positive rules and 5 fired negative rules, of which the most general ones and the most specific ones are shown in this Table. The larger support of the negative rules may help human experts to make final correct decisions and find inherent disease-resulting mechanisms. 2017/4/25 Yan-Qing Zhang, Georgia State University

24 Yan-Qing Zhang, Georgia State University
Interpretability (4) FARs are helpful to select important features. Higher activation frequency means more important feature Last, FARs are helpful to select important features. Intuitively, more frequent a feature is activated, more important it is. In experiment, we calculate the activation frequency. For BCW data, f4, f6, f8 are more important, and f1, f7, and f9 are less important. Human experts may work on important features first. 2017/4/25 Yan-Qing Zhang, Georgia State University

25 Yan-Qing Zhang, Georgia State University
Outline Background Fuzzy Association Rule Mining for Decision Support (FARM-DS) FARM-DS on Medical Data FARM-DS on Microarray Expression Data Fuzzy-Granular Gene Selection on Microarray Expression Data Conclusion and Future Work Now I will report the result of microarray expression data analysis with FARM-DS. 2017/4/25 Yan-Qing Zhang, Georgia State University

26 Microarray Expression Data
Extremely high dimensionality Gene selection Cancer classification Rule-based reasoning A typical microarray expression dataset is extremely sparse compared to a traditional classification dataset. For example, the AML/ALL leukemia dataset has only 72 samples (tissues) with 7129 features (gene expression measurements). That means, without gene selection, we have to discriminate and classify such a few samples in such a high dimensional space. It is unnecessary or even harmful for classification because it is believed that no more than 10% of these 7129 genes are relevant to Leukemia classification [9]. This extreme sparseness is believed to significantly deteriorate the performance of a classifier. As a result, the ability to extract a subset of informative genes while to remove irrelevant or redundant genes is crucial for accurate classification. Furthermore, it is also helpful for biologists to find the inherent cancer-resulting mechanism. After gene selection, FAR mining are conducted on expression data, these FARs will be used to classify new tissue samples. Moreover, due to their easy interpretability, FARs may also be helpful to support human experts for rule-based reasoning. 2017/4/25 Yan-Qing Zhang, Georgia State University

27 Yan-Qing Zhang, Georgia State University
Empirical Studies Rule-Based Reasoning/Classification CART for decision trees modeling (Breiman, et al. 1984) ANFIS for fuzzy neural networks modeling (Jang, 1993) FARM-DS (He, et al. 2006a, IJDMB) the FARM-DS system is compared with two other rule-based classifiers, CART decision trees proposed by Breiman in 1984 and ANFIS proposed by Jang in The accuracy for each model is estimated with leave-one-out cross validation. 2017/4/25 Yan-Qing Zhang, Georgia State University

28 Yan-Qing Zhang, Georgia State University
Evaluation metrics Accuracy Classification Error Area under ROC curve Accuracy Estimation Leave-one-out cross validation Interpretability Rule numbers Average rule lengths The accuracy is evaluated with classification error and area under ROC curve. A smaller error and a larger AUC mean a more accurate classifier. We use leave-one-out cross validation to estimate the real classification performance on unknown new samples. Suppose there are n samples, in each fold, we build a classifier on n-1 samples and test the performance on another 1 sample. This process is repeated n times so that each sample is used for testing one and only one time. Finally the averaged accuracy on these n folds performance is averaged as the estimation to real classification accuracy. The interpretability is evaluated with rule numbers and rule lengths. A classifier is easy to interpret if the rules are few and short. Bradley, 1997 2017/4/25 Yan-Qing Zhang, Georgia State University

29 AML/ALL leukemia dataset
The AML/ALL leukemia dataset is used in experiments. The 8 gene features listed in this table are believed to be related to leukemia and hence are used for rule extraction. Tang, et al. 2006 2017/4/25 Yan-Qing Zhang, Georgia State University

30 Result analysis: AML/ALL leukemia dataset
FARM-DS is more accurate than CART. There are 2 classification errors with FARM-DS and 7 classification errors with CART. FARM-DS has also the largest area under ROC curve. On the other hand, compared with ANFIS, FARM-DS extracts much shorter rules (average rule length 4.8 vs. 8) and thus easier to interpret. Higher accuracy than CART Easier to interpret than ANFIS 2017/4/25 Yan-Qing Zhang, Georgia State University

31 Rules extracted by FARM-DS: AML/ALL leukemia dataset
IF gene2 (Y12670), gene3 (D14659) and gene5 (M80254) are down-regulated, THEN the tissue is ALL(-1) This table lists the 5 rules extracted by FARM-DS. For example, the 1st rule can be explained as … Obviously, these rules are easy to interpret and hence may be more helpful to biomedical studies. 2017/4/25 Yan-Qing Zhang, Georgia State University

32 Prostate cancer dataset
Another group of experiments is on the prostate cancer dataset. There are 102 tissue samples and gene features. The 8 genes listed here are closely related to the prostate cancer. Tang, et al. 2006 2017/4/25 Yan-Qing Zhang, Georgia State University

33 Result analysis: prostate cancer dataset
Similar to leukemia dataset, FARM-DS has higher accuracy than CART. There are 13 errors with CART but only 7 errors with FARM-DS. FARM-DS has larger area under ROC curve than CART. On the other hand, compared with ANFIS, FARM-DS extracts much shorter rules (average rule length 3.1 vs. 8) and hence easier to interpret. Higher accuracy than CART Easier to interpret than ANFIS 2017/4/25 Yan-Qing Zhang, Georgia State University

34 Rules extracted by FARM-DS: prostate cancer dataset
The 15 fuzzy association rules are listed in this table. It shows some interesting patterns. Seems like if gene G5 is down-regulated, the sample is healthy, otherwise, it is a cancerous tissue. The similar pattern is also demonstrated for gene G1. 2017/4/25 Yan-Qing Zhang, Georgia State University

35 Yan-Qing Zhang, Georgia State University
Outline Background Fuzzy Association Rule Mining for Decision Support (FARM-DS) FARM-DS on Medical Data FARM-DS on Microarray Expression Data Fuzzy-Granular Gene Selection on Microarray Expression Data Conclusion and Future Work We also designed a fuzzy-granular based method to select marker genes from microarray expression data. 2017/4/25 Yan-Qing Zhang, Georgia State University

36 Gene Selection and Cancer Classification on Microarray Expression Data
Extremely high dimensionality AML/ALL leukemia dataset 72 * 7129 no more than 10% relevant genes (Golub, et al. 1999) Gene selection accurate classification helpful for cancer study A typical gene expression dataset is extremely sparse compared to a traditional classification dataset. For example, the AML/ALL leukemia dataset has only 72 samples (tissues) with 7129 features (gene expression measurements). That means, without gene selection, we have to discriminate and classify such a few samples in such a high dimensional space. It is unnecessary or even harmful for classification because it is believed that no more than 10% of these 7129 genes are relevant to Leukemia classification [9]. This extreme sparseness is believed to significantly deteriorate the performance of a classifier. As a result, the ability to extract a subset of informative genes while to remove irrelevant or redundant genes is crucial for accurate classification. Furthermore, it is also helpful for biologists to find the inherent cancer-resulting mechanism. From the data mining viewpoint, this gene selection problem is essentially a feature selection or dimensionality reduction problem. 2017/4/25 Yan-Qing Zhang, Georgia State University

37 Gene Categorization and Gene Ranking
Informative genes Redundant genes Irrelevant genes Noisy genes • Informative genes, which are really cancer-related; • Redundant genes, which are also cancer-related but there are some other informative genes regulated similarly but more significantly for cancer classification; • Irrelevant genes, which are not cancer-related and their existence do not affect cancer classification; • Noisy genes, which are not cancer-related but they have negative effects on cancer classification. 2017/4/25 Yan-Qing Zhang, Georgia State University

38 Yan-Qing Zhang, Georgia State University
Information Loss Noise Overfitting themselves Complementary to redundant/irrelevant genes Conflict with informative genes Imbalanced gene selection Inflexibility (Notice that the pre-filtering step by RI metric is targeted at minimizing this kind of effect by eliminating most of irrelevant genes.) individually contribute to discriminate the training samples by some non-cancer-related factors so that it is ranked high. (overfitting) complementary to some redundant or irrelevant genes so that these redundant or irrelevant genes are ranked higher. conflict with some informative genes so that these informative genes are ranked lower. How to decrease information loss? Granulation! 2017/4/25 Yan-Qing Zhang, Georgia State University

39 Coarse Granulation with Relevance Indexes
Target: remove irrelevant genes imbalance imbalance balance Target: tune thresholds to select genes in balance 2017/4/25 Yan-Qing Zhang, Georgia State University

40 Fine Granulation with Fuzzy C-Means Clustering
We explicitly group genes with similar expression patterns into clusters and then the lower-ranked genes in each cluster could be safely removed as redundant genes. The assumption is that genes with similar expression patterns also have similar functions to regulate cancers. Furthermore, due to complex correlation between genes, the similarity is by no means a “crisp” concept. Fuzzy C-Means deals with complex correlation between genes by assigning a gene into multiple clusters Therefore, a really informative gene achieves more than one opportunity to survive. clustering in the training samples space genes with similar expression patterns have similar functions a gene may have multiple functions (Fuzzy works here!) 2017/4/25 Yan-Qing Zhang, Georgia State University

41 Conquer with correlation-based Ranking
After clustering, the third step is to rank genes in each cluster separately. Three different correlation-based methods, S2N, proposed by Furey in 2000, FC, designed by Pavlidis in 2001, and T-Statitics, used by Duan in 2004, are used for gene ranking. In these formulas, a larger weight value means a higher rank, the gene with the largest weight is the most informative in a cluster. Lower-ranked genes are removed as redundant genes 2017/4/25 Yan-Qing Zhang, Georgia State University

42 Aggregation with Data Fusion
Pick up genes from different clusters in balance An informative gene is more possible to survive (due to fuzzy clustering) 2017/4/25 Yan-Qing Zhang, Georgia State University

43 Yan-Qing Zhang, Georgia State University
Original Gene Set Relevance Indexes -based pre-filtering Relevant Gene Set Correlation-based Gene Ranking 1 Gene Cluster 1 Correlation-based Gene Ranking 2 Fuzzy C-Means Clustering Gene Cluster 2 Correlation-based Gene Ranking K Gene Cluster K Final Gene Set 2017/4/25 Yan-Qing Zhang, Georgia State University

44 Yan-Qing Zhang, Georgia State University
Empirical Study Comparison Signal to Noise (S2N) (Furey, et al. 2000) Fuzzy-Granular + S2N Fisher Criterion (FC) (Pavlidis, et al. 2001) Fuzzy-Granular + FC T-Statistics (TS) (Duan, et al. 2004) Fuzzy-Granular + TS In our experiments, comparison studies are carried out to compare these three correlation algorithms on the whole gene set and also on the gene subsets after fuzzy granulation. So there are altogether 6 gene selection methods in the experiments. 2017/4/25 Yan-Qing Zhang, Georgia State University

45 Yan-Qing Zhang, Georgia State University
Evaluation Methods Metrics Accuracy Sensitivity Specificity Area under ROC curve Estimation Leave-1-out CV .632 bootstrapping .632 Perf = * training perf * testing perf The performance is evaluated with 4 metrics, accuracy, sensitivity, specificity, and area under ROC curve, which are explained in this slide. The performance for each model is estimated with leave-one-out cross validation. We also tried bootstrapping for performance estimation because someone argue that bootstrapping is better than cross validation to estimate the accuracy on small size datasets like microarray datasets. The basic process of bootstrapping: Suppose there are n samples, one bootstrapping process randomly selects n samples with replacement. That mean that after we pick up one sample from the dataset, we will return it back and continue another time random selection. As a result, a sample may be selected multiple times. Averagely, 63.2% samples will be selected for training and other 36.8% samples will be used for testing. The performance on one bootstrapping process is calculate with this formula. We repeat this bootstrapping process B times, the average performance on these B times bootstrapping will be used to estimate real classification performance. 2017/4/25 Yan-Qing Zhang, Georgia State University

46 prostate cancer dataset
The prostate cancer dataset is used. This dataset is very high-dimensional. 2017/4/25 Yan-Qing Zhang, Georgia State University

47 Result analysis: prostate cancer dataset
The result shows that fuzzy granulation can always improve classification performance on the original correlated-based method without fuzzy granulation. With leave-one-out cross validation, it improves accuracy and area under ROC curve about 3%-8%. Similar performance improvement is also observed with 100 times .632 bootstrapping. The result also shows that fuzzy-granulation plus signal to noise is the best method. It is better than fuzzy-granulation plus fisher criterion and fuzzy-granulation plus t-statistics. 2017/4/25 Yan-Qing Zhang, Georgia State University

48 Yan-Qing Zhang, Georgia State University
Colon cancer dataset Similar performance improvement is also observed on the colon cancer dataset. 2017/4/25 Yan-Qing Zhang, Georgia State University

49 Result analysis: colon cancer dataset
Firstly, fuzzy-granulation plus a correlation-based method is always better than directly applying the correlation-based method on the whole data. Secondly, fuzzy-granulation plus S2N is better than FG+FC and FG+TS. 2017/4/25 Yan-Qing Zhang, Georgia State University

50 Yan-Qing Zhang, Georgia State University
Conclusion High-level data abstraction data clustering techniques Quantitative data transformed to fuzzy discrete transactions Fuzzy interval partition Apriori algorithm for AR mining Strong decision support for biomedical study High accuracy and easy to interpret More accurate cancer classification Eliminate irrelevant/redundant genes to decrease noise Select informative genes in balance Two algorithms have been designed in my dissertation work. The first one is a general fuzzy association rule mining algorithm. FARM-DS implements high-level data abstraction by applying data clustering techniques. It automatically transform continuous data into fuzzy discrete transactions with a simple one-in-on-out TSK model. After that, the apriori algorithm is used for association rule mining on these fuzzy discrete transactions. Our experiment results show that FARM-DS can provide strong decision support for biomedical studies because classification based on FARs are both highly accurate and easy to interpret. The second algorithm in this dissertation work is applying fuzzy granulation on the large gene set on microarray expression data. The fuzzy-granular method can improve classification accuracy, mainly by eliminating XXX and selecting XXX. 2017/4/25 Yan-Qing Zhang, Georgia State University

51 Yan-Qing Zhang, Georgia State University
Future Works Applying FARM-DS on other biomedical applications Integrating more intelligent data analysis techniques. Cloud computing based fuzzy data mining algorithms for big data mining GPU based fuzzy data mining algorithms for big data mining 2017/4/25 Yan-Qing Zhang, Georgia State University

52 Yan-Qing Zhang, Georgia State University
References [1] Y. C. He, Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Mining Fuzzy Association Rules from Microarray Gene Expression Data for Leukemia Classification,” Proc. of International Conference on Granular Computing (GrC-IEEE 2006), Atlanta, pp , May 10-12, 2006. [2] Y.C. He and Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Adaptive Fuzzy Association Rule Mining for Effective Decision Support in Biomedical Applications,” International Journal of Data Mining and Bioinformatics, Vol. 1, No. 1, pp. 3-18, 2006. [3] Y.C. He, Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Fuzzy-Granular Gene Selection from Microarray Expression Data,” Proc. of DMB2006 in conjunction with IEEE-ICDM2006, Hong Kong, Dec. 18, 2006, (accepted). [4] Y.C. He, Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Fuzzy-Granular Methods for Identifying Marker Genes from Microarray Expression Data,” Computational Intelligence for Bioinformatics, Gary B. Fogel, David Corne, and Yi Pan (eds.), IEEE Press, 2007. 2017/4/25 Yan-Qing Zhang, Georgia State University

53 Yan-Qing Zhang, Georgia State University
Acknowledgments Thanks goto Dr. Yuchun Tang Dr. Yuanchen He For their hard works on this research project. 2017/4/25 Yan-Qing Zhang, Georgia State University

54 Yan-Qing Zhang, Georgia State University
Questions? Comments? 2017/4/25 Yan-Qing Zhang, Georgia State University


Download ppt "Fuzzy Machine Learning Methods for Biomedical Data Analysis"

Similar presentations


Ads by Google