2016-6-28 University of Georgia 1 Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-3994

2016-6-28 University of Georgia 1 Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-3994 Email: yzhang@cs.gsu.eduyzhang@cs.gsu.edu Granular Machine Learning Methods for Biomedical Data Classification

2016-6-28 University of Georgia 2 Outline Granular Machine Learning Granular Support Vector Machines  Basic idea, Motivation, State of the Art GSVM-RU （ Repetitive Undersampling ）  Highly imbalanced classification (Data Granules) GSVM-RFE （ Recursive Feature Elimination ）  High dimensional classification (Feature Granules) Conclusions

Granular Machine Learning Granular Computing (GrC) is a general computation theory for effectively using granules such as classes, clusters, subsets, groups and intervals to build an efficient computational model for complex applications with huge amounts of data, information and knowledge. 2006 IEEE International Conference on Granular Computing (IEEE-GrC2006), at Georgia State University, Atlanta, May 10-12, 2006. (Dr. Vapnik: SVM+, Dr. Zadeh: Soft Computing, Dr. Smale: Mathematical Learning, Dr. Lin: GrC, etc.) GrC + Machine Learning => Granular Machine Learning. Major Challenge : Granular Data Inputs=>Granular ML=> Granular Data Outputs 2016-6-28 University of Georgia 3

2016-6-28 University of Georgia 4 Granular Machine Learning (cont.) Our works on GML: Granular Support Vector Machines, Tang, Zhang, (2004 -). Granular Kernel Machines, Jin and Zhang, (2005-). Granular Neural Networks, Zhang and Reyaz (2000-). Major applications: binary biomedical data classification (cancer, etc.). protein secondary structure prediction. highly imbalanced biomedical data classification. Main Goal: Design GML methods to intelligently map granular data inputs to crisp/granular data outputs, and then effectively make correct decisions on data space and feature space.

2016-6-28 University of Georgia 5 A Major Challenge (Optimal Data Granulation and Optimal Feature Granulation ) Feature 1 ……Feature m-1Feature mClass Data 1+1 Data 2 Data 3+1 …… …….…… Data n

2016-6-28 University of Georgia 7 Binary Classification Data Mining Predictive Data Modeling Classification Binary Classification Descriptive Data Modeling Regression Multi Classification

2016-6-28 University of Georgia 8 Statistical Learning – Support Vector Machines Principles  SRM principle  Kernel functions SVMs Challenges  Non-i.i.d.  Noisy  High-dimensional  Imbalanced Vapnik, 1995

2016-6-28 University of Georgia 9 Granular Computing Granulation  Divide-and-Conquer for a huge and complicated problem  comprehends a sequence of similar little tasks Knowledge-oriented  makes the mining algorithms more effective and/or more efficient.

2016-6-28 University of Georgia 10 Granular SVM - basic idea Learning  Divide – Granulation Subspace-based or subset-based May overlap One granule may be the best  Conquer Any classification models, here we pick up SVMs  Aggregation Data Fusion, Info Fusion, Decision Fusion, or Knowledge Fusion Prediction

2016-6-28 University of Georgia 11 Initiatives – GSVM – Efficiency Fast - Usually more efficient to address a sequence of subtasks than addressing the original task. Scalable - For massive data, the modeling in different granules is easy to be parallelized for HPC.

2016-6-28 University of Georgia 12 Initiatives – GSVM – Interpretability Decision process is easy to understand SVMs and NNs are “black boxes” GSVM can extract a few rules or cases from each smaller granule for RBR (Rule-Based Reasoning) or CBR (Case--Based Reasoning).

2016-6-28 University of Georgia 13 Initiatives – GSVM – Effectiveness (1) hybrid model by combining SVMs with other GrC-based models:  Clustering, DTs, ARs split the whole feature space into a set of subspaces.  Sampling, Bagging, Boosting split the whole dataset into a sequence of subsets.  New prior knowledge-based granulation methods  A hybrid model can combine powers from multiple models for more reliable prediction.

2016-6-28 University of Georgia 14 Initiatives – GSVM – Effectiveness (2)  if A is helpful to correctly classify B, should be in the same granule  If C is noisy to confuse a classifier for B’s classification, should be in different granule  Then effectiveness can be improved. B C A

2016-6-28 University of Georgia 15

2016-6-28 University of Georgia 17 Case Study: Highly Imbalanced Classification Highly skewed data distribution (100:1 or even more) Imbalance is ubiquitous Primary interest is to find rare samples.

2016-6-28 University of Georgia 18 Effect of Highly Imbalanced Distribution the majority class pushes the “ideal” decision boundary toward the minority class. Wu, et al. 2003

2016-6-28 University of Georgia 19 Tang and Zhang, et al. 2005d, IEEE GrC 2005 GSVM – Repetitive Undersampling

2016-6-28 University of Georgia 20 GSVM-RU for Imbalanced Classification Target  minimize the negative effect of information loss  maximize the positive effect of data cleaning Assumption  boundary is pushed toward the minority class  a single SVM is able to extract a part of informative samples

2016-6-28 University of Georgia 21 Granulation (Divide): Repetitive Undersampling with SVMs

2016-6-28 University of Georgia 22 Aggregation (Conquer): Discard Old Information Granules

2016-6-28 University of Georgia 23 Aggregation (Conquer): Combine All Information Granules

2016-6-28 University of Georgia 24 TR(i) SVMu(i) NLSV(i) INFO(i) SVMc(i) Accuracy is improved? TR(i)=TR(i-1)-NLSV(i-1) Y N End Output INFO(i-1), SVMc(i-1) INFO(i-1) TR(1) is the original training dataset INFO(0) is the set of all positive samples in TR(1) NLSV(i) is the set of negative samples which are SVs of SVMu(i)

2016-6-28 University of Georgia 25 Effectiveness analysis: yeast data 7-6-folds double CV G-means metric GSVM-RU with “discard” aggregation 84.2±0.7 KBA 82.2±7.1 RBF-SVM59.0±12.1 8 features 1484 samples 51 positive (3.44%) “discard” operation is used

2016-6-28 University of Georgia 26 Effectiveness analysis : abalone data 7-6-folds double CV G-means metric GSVM-RU with “combine” aggregation 73.4±1.6 KBA 57.8±5.4 RBF-SVM0.0±0.0 8 features 4177 samples 32 positive (0.77 %) “combine” operation is used

2016-6-28 University of Georgia 27 Effectiveness analysis: DMC05 online shopping behavior prediction DMC05 online shopping behavior prediction GSVM-RU ranks 19 th (1 st in the US) out of 147 overall 70 features 30000 training samples 1746 positive (5.82 %) 20000 testing samples

2016-6-28 University of Georgia 28 Effectiveness analysis: KDDCUP04 protein homology prediction KDDCUP04 protein homology prediction GSVM-RU ranks 2 nd out of 107 overall now, ranked 1 st before. 74 features 145751 training samples 1296 positive (0.89 %) 139658 testing samples “Discard” is used: 1st, 1st, 7th, and 10th granules are used

2016-6-28 University of Georgia 29 Efficiency analysis: As a comparison, KBA even needs more time than SVM [Wu, et al. 2003].

2016-6-28 University of Georgia 30 GSVM-RU Summary GSVM-RU is efficient due to undersampling GSVM-RU is effective due to  reservation of informative samples and  elimination of large quantities of redundant or even noisy samples The improvement on effectiveness seems more significant if the imbalance degree is higher.  abalone dataset (0.77%)  KDDCUP 2004 protein homology prediction (0.89%) Future works  GSVM-RU + SMOTE [Chawla, et al. 2002]  Parallelized

2016-6-28 University of Georgia 32 Case study: Gene Selection and Cancer Classification on Microarray Expression Data Extremely high dimensionality  AML/ALL leukemia dataset 72 * 7129  no more than 10% relevant genes (Golub, et al. 1999) Gene selection  accurate classification  helpful for cancer study  feature selection Not i.i.d., imbalance, and noise Tasks  gene subset selection  cancer classification

2016-6-28 University of Georgia 33 Gene ranking Informative genes: really cancer-related Redundant genes: cancer-related but there are some other informative genes functioning similarly but more significantly for cancer classification Irrelevant genes: not cancer-related and their existence do not affect cancer classification Noisy genes: not cancer-related but they have negative effects on cancer classification

2016-6-28 University of Georgia 34 Tang, et al. IEEE BIBE 2005 GSVM-RFE Extract multiple cancer-related gene subsets for  Reliable cancer classification  Constructing gene regulation network

2016-6-28 University of Georgia 35 Fine Granulation with Fuzzy C-Means Clustering clustering in the training samples space genes with similar expression patterns have similar functions a gene may have multiple functions

2016-6-28 University of Georgia 36 Conquer with SVM-based Ranking Guyon, et al. 2002 Lower-ranked genes are removed as redundant genes

2016-6-28 University of Georgia 37 Aggregation with Data Fusion Pick up genes from different clusters in balance An informative gene is more possible to survive

2016-6-28 University of Georgia 38 Flexibility In a gene regulation network, many different gene subsets may regulate cancers in different ways. different runs of GSVM-RFE extract different gene subsets. Moreover, genes that survive in multiple subsets deserve higher priority for biological study.

2016-6-28 University of Georgia 39 Original Gene Set Relevance Indexes - based pre-filtering Relevant Gene Set Fuzzy C-Means Clustering Gene Cluster 1 Gene Cluster 2 Gene Cluster K SVM-based Gene Elimination 1 SVM-based Gene Elimination 2 SVM-based Gene Elimination K Survived Gene Set SVM-based Gene Elimination Final Gene Set N Y If # of > Nt

2016-6-28 University of Georgia 40 Empirical Studies Comparison  S2N correlation-based algorithm (Furey, et al. 2000)  SVM-RFE algorithm (Guyon, et al. 2002)  GSVM-RFE algorithm

2016-6-28 University of Georgia 41 Evaluation metrics  Accuracy  Sensitivity  Specificity  Area under ROC curve Bradley, 1997

2016-6-28 University of Georgia 42 prostate cancer dataset  High-dimensionality  Non i.i.d. (so cross validation or random split is not suitable)  prepared under different biological experimental contexts  Imbalance  4761 P and 110 N if thresholds 0.5 are used for both RI metrics

2016-6-28 University of Georgia 43 Result Statistical analysis: prostate cancer dataset

2016-6-28 University of Georgia 44 Biological Literature Verification: prostate cancer dataset 100% leave-one-out validation accuracy on the training dataset 100% prediction accuracy on the testing dataset

2016-6-28 University of Georgia 45 AML/ALL leukemia dataset  Non i.i.d. (so cross validation or random split is not suitable)  Two datasets are prepared under different biological experimental conditions.

2016-6-28 University of Georgia 46 Result Statistical analysis: AML/ALL leukemia dataset

2016-6-28 University of Georgia 47 Biological Literature Verification: AML/ALL leukemia dataset 100% leave-one-out validation accuracy on the training dataset 100% prediction accuracy on the testing dataset

2016-6-28 University of Georgia 48 GSVM-RFE Summary (1) Relevance Index filtering  to remove most of irrelevant genes  to decrease noisy effect  to select genes in balance FCM clustering  to group genes with similar functions into clusters  to assigning a gene into multiple clusters  to extract multiple informative gene subsets SVM ranking  to remove lower-ranked redundant genes in each cluster

2016-6-28 University of Georgia 49 GSVM-RFE Summary (2) Reliable cancer classification  Granulation Multiple compact “perfect gene subsets”  Selection bias in each single gene subset Strong decision support for further cancer study  Potentially helpful to construct gene regulation network

2016-6-28 University of Georgia 51 Conclusions Granular Machine Learning methods can be used to improve performance of biomedical data classification. Data granulation and feature granulation are important for effective biomedical data classification. Future works: (1) Data/ Feature domain granulation optimization: find optimal (or near optimal) data / feature granules. (2) Design relevant GML methods. (3) Biomedical Data Multi-classification.

2016-6-28 University of Georgia 52 References Y.C. Tang, Y.-Q. Zhang and Z. Huang, “Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 365-381, July-September 2007. Y.C. Tang and Y.-Q. Zhang, “Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction,” Proc. of 2006 IEEE International Conference on Granular Computing (IEEE-GrC2006), Atlanta, May 10-12, 2006. Y. C. He, Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Mining Fuzzy Association Rules from Microarray Gene Expression Data for Leukemia Classification,” Proc. of International Conference on Granular Computing (GrC-IEEE 2006), Atlanta, pp. 461-465, May 10- 12, 2006. Y.C. Tang, Y.C. He, Y.-Q. Zhang, Z. Huang, X.H. T. Hu and R. Sunderraman, “A Hybrid CI-Based Knowledge Discovery System on Microarray Gene Expression Data,” Proc. of 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB2005), San Diego, Nov. 14-15, 2005.

2016-6-28 University of Georgia 53 References Y.C. Tang, Y.-Q. Zhang and Z. Huang, “FCM-SVM-RFE Gene Feature Selection Algorithm for Leukemia Classification from Microarray Gene Expression Data,” Proc. of FUZZ-IEEE 2005, pp. 97-101, Reno, May 22-25, 2005. Y.C. Tang and Y.-Q. Zhang, “Granular Support Vector Machines with Data Cleaning for Fast and Accurate Biomedical Binary Classification,” Proc. of IEEE-GrC 2005, pp. 262-265, Beijing, July 25-27, 2005. Y.C. Tang, Y.-Q. Zhang, Z. Huang and X.H. T. Hu, “Granular SVM- RFE Gene Selection Algorithm for Reliable Prostate Cancer Classification on Microarray Expression Data,” Proc. of the Fifth IEEE Symposium on Bioinformatics & Bioengineering (BIBE 2005), Minneapolis, Oct. 19 - 21, 2005. G. Wu and E. Y. Chang, “KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution,” IEEE Transactions on Knowledge and Data Engineering, pp. 786-795, Vol. 17, No. 6, June 2005.

2016-6-28 University of Georgia 54 Acknowledgments Yuchun Tang was supported by Molecular Basis for Disease (MBD) Doctoral Fellowship, Georgia State University. This work was supported in part by NIH under P20 GM065762. Thank Professor Ying Xu and Dr. Huiling Chen! Thank everyone!

2016-6-28 University of Georgia 55 Questions?

2016-6-28 University of Georgia 1 Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-3994

Similar presentations

Presentation on theme: "2016-6-28 University of Georgia 1 Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-3994"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2016-6-28 University of Georgia 1 Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-3994

Similar presentations

Presentation on theme: "2016-6-28 University of Georgia 1 Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-3994"— Presentation transcript:

Similar presentations

About project

Feedback