The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 11/01/2006 Semi-supervised learning Administrative Student presentations start from next week Send me the list of references that you use, if your choices are different from what are listed at the class webpage Speaker: Allocate enough time for background. 20 minutes for general audience 20 minutes for people working in “your area” <20 minutes for experts Keep in mind of five “w” questions: What is it? Why is it important? Who cares? Works or not? and So what?

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 11/01/2006 Semi-supervised learning Administrative Participants Need to ask at least one question

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 11/01/2006 Semi-supervised learning Overview Semi-supervised learning Semi-supervised classification Semi-supervised clustering Search based methods Cop K-mean Seeded K-mean Constrained K-mean Similarity based methods

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 11/01/2006 Semi-supervised learning Supervised Classification Example....

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 11/01/2006 Semi-supervised learning Supervised Classification Example....................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 11/01/2006 Semi-supervised learning Supervised Classification Example....................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 11/01/2006 Semi-supervised learning. Unsupervised Clustering Example....................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 11/01/2006 Semi-supervised learning Unsupervised Clustering Example....................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 11/01/2006 Semi-supervised learning Semi-Supervised Learning Combines labeled and unlabeled data during training to improve performance: Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 11/01/2006 Semi-supervised learning Semi-Supervised Classification Example....................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 11/01/2006 Semi-supervised learning Semi-Supervised Classification Example....................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 11/01/2006 Semi-supervised learning Semi-Supervised Classification Algorithms: Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. Co-training [Blum:COLT98]. Transductive SVM’s [Vapnik:98,Joachims:ICML99]. Assumptions: Known, fixed set of categories given in the labeled data. Goal is to improve classification of examples into these known categories.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 11/01/2006 Semi-supervised learning Semi-Supervised Clustering Example....................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 11/01/2006 Semi-supervised learning Semi-Supervised Clustering Example...................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 11/01/2006 Semi-supervised learning Second Semi-Supervised Clustering Example....................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 11/01/2006 Semi-supervised learning Second Semi-Supervised Clustering Example...................

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 11/01/2006 Semi-supervised learning Semi-Supervised Clustering Can group data using the categories in the initial labeled data. Can also extend and modify the existing set of categories as needed to reflect other regularities in the data. Can cluster a disjoint set of unlabeled data using the labeled data as a “guide” to the type of clusters desired.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 11/01/2006 Semi-supervised learning Problem definition Input: A set of unlabeled objects Some domain knowledge Output: A partitioning of the objects into clusters Objective: Maximum intra-cluster similarity Minimum inter-cluster similarity High consistency between the partitioning and the domain knowledge

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 11/01/2006 Semi-supervised learning What is Domain Knowledge? Must-link and cannot-link Class labels Ontology

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 11/01/2006 Semi-supervised learning Why semi-supervised clustering? Why not clustering? Could not incorporate prior knowledge into clustering process Why not classification? Sometimes there are insufficient labeled data. Potential applications Bioinformatics (gene and protein clustering) Document hierarchy construction News/email categorization Image categorization

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 11/01/2006 Semi-supervised learning Semi-Supervised Clustering Approaches Search-based Semi-Supervised Clustering Alter the clustering algorithm using the constraints Similarity-based Semi-Supervised Clustering Alter the similarity measure based on the constraints Combination of both

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 11/01/2006 Semi-supervised learning Search-Based Semi-Supervised Clustering Alter the clustering algorithm that searches for a good partitioning by: Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02].

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 11/01/2006 Semi-supervised learning Unsupervised KMeans Clustering KMeans iteratively partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: Cluster Assignment Step: Assign each data point x to the cluster X l, such that L 2 distance of x from (center of X l ) is minimum Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 11/01/2006 Semi-supervised learning KMeans Objective Function Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: Initialization of K cluster centers: Totally random Random perturbation from global mean Heuristic to ensure well-separated centers etc.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 11/01/2006 Semi-supervised learning K Means Example

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 11/01/2006 Semi-supervised learning K Means Example Randomly Initialize Means x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 11/01/2006 Semi-supervised learning K Means Example Assign Points to Clusters x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 11/01/2006 Semi-supervised learning K Means Example Re-estimate Means x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 11/01/2006 Semi-supervised learning K Means Example Re-assign Points to Clusters x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 11/01/2006 Semi-supervised learning K Means Example Re-estimate Means x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 11/01/2006 Semi-supervised learning K Means Example Re-assign Points to Clusters x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 11/01/2006 Semi-supervised learning K Means Example Re-estimate Means and Converge x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 11/01/2006 Semi-supervised learning Semi-Supervised K-Means Constraints (Must-link, Cannot-link) COP K-Means Partial label information is given Seeded K-Means (Basu, ICML’02) Constrained K-Means

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 11/01/2006 Semi-supervised learning COP K-Means COP K-Means is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. Initialization: Cluster centers are chosen randomly but no must-link constraints that may be violated Algorithm: During cluster assignment step in COP-K- Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort. Based on Wagstaff et al.: ICML01

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 11/01/2006 Semi-supervised learning COP K-Means Algorithm

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 11/01/2006 Semi-supervised learning Illustration x x Must-link Determine its label Assign to the red class

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 11/01/2006 Semi-supervised learning Illustration x x Cannot-link Determine its label Assign to the red class

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 11/01/2006 Semi-supervised learning Illustration x x Cannot-link Determine its label The clustering algorithm fails Must-link

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 11/01/2006 Semi-supervised learning Evaluation Rand index: measures the agreement between two partitions, P1 and P2, of the same data set D. Each partition is viewed as a collection of n(n-1)/2 pairwise decisions, where n is the size of D. a is the number of decisions where P1 and P2 put a pair of objects into the same cluster b is the number of decisions where two instances are placed in different clusters in both partitions. Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2):

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 11/01/2006 Semi-supervised learning Evaluation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 11/01/2006 Semi-supervised learning Semi-Supervised K-Means Seeded K-Means: Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. Seed points are only used for initialization, and not in subsequent steps. Constrained K-Means: Labeled data provided by user are used to initialize K-Means algorithm. Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re- estimated. Based on Basu et al., ICML’02.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 11/01/2006 Semi-supervised learning Seeded K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points may change.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 11/01/2006 Semi-supervised learning Seeded K-Means Example

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide45 11/01/2006 Semi-supervised learning Seeded K-Means Example Initialize Means Using Labeled Data x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 11/01/2006 Semi-supervised learning Seeded K-Means Example Assign Points to Clusters x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 11/01/2006 Semi-supervised learning Seeded K-Means Example Re-estimate Means x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide48 11/01/2006 Semi-supervised learning Seeded K-Means Example Assign points to clusters and Converge x x the label is changed

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 11/01/2006 Semi-supervised learning Constrained K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points will not change.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 11/01/2006 Semi-supervised learning Constrained K-Means Example

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 11/01/2006 Semi-supervised learning Constrained K-Means Example Initialize Means Using Labeled Data x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide52 11/01/2006 Semi-supervised learning Constrained K-Means Example Assign Points to Clusters x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53 11/01/2006 Semi-supervised learning Constrained K-Means Example Re-estimate Means and Converge x x

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide54 11/01/2006 Semi-supervised learning Datasets Data sets: UCI Iris (3 classes; 150 instances) CMU 20 Newsgroups (20 classes; 20,000 instances) Yahoo! News (20 classes; 2,340 instances) Data subsets created for experiments: Small-20 newsgroup: random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms. Different-3 newsgroup: 3 very different newsgroups (alt.atheism, rec.sport.baseball, sci.space), created to study effect of data separability on algorithms. Same-3 newsgroup: 3 very similar newsgroups (comp.graphics, comp.os.ms-windows, comp.windows.x).

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide55 11/01/2006 Semi-supervised learning Evaluation Mutual information Objective function

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide56 11/01/2006 Semi-supervised learning Results: MI and Seeding Zero noise in seeds [Small-20 NewsGroup] Semi-Supervised KMeans substantially better than unsupervised KMeans

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide57 11/01/2006 Semi-supervised learning Results: Objective function and Seeding User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] Obj. function of data partition increases exponentially with seed fraction

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide58 11/01/2006 Semi-supervised learning Results: Objective Function and Seeding User-labeling inconsistent with KMeans assumptions [Yahoo! News] Objective function of constrained algorithms decreases with seeding

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide59 11/01/2006 Semi-supervised learning Similarity Based Methods Questions: given a set of points and the class labels, can we learn a distance matrix such that inter-cluster distance are minimized and intra-cluster distance are maximized?

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide60 11/01/2006 Semi-supervised learning Distance metric learning Define a new distance measure of the form: Linear transformation of the original data

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide61 11/01/2006 Semi-supervised learning Distance metric learning

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide62 11/01/2006 Semi-supervised learning Semi-Supervised Clustering Example Similarity Based

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide63 11/01/2006 Semi-supervised learning Semi-Supervised Clustering Example Distances Transformed by Learned Metric

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide64 11/01/2006 Semi-supervised learning Semi-Supervised Clustering Example Clustering Result with Trained Metric

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide65 11/01/2006 Semi-supervised learning Source: E. Xing, et al. Distance metric learning Evaluation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide66 11/01/2006 Semi-supervised learning Source: E. Xing, et al. Distance metric learning Evaluation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide67 11/01/2006 Semi-supervised learning Additional Readings Combining Similarity and Search-Based Semi-Supervised Clustering “Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering”, Basu, et al. Ontology based semi-supervised clustering “A framework for ontology-driven subspace clustering”, Liu et al.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide68 11/01/2006 Semi-supervised learning Summary Semi-supervised clustering Its applications in Bioinformatics? Not fully explored

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide69 11/01/2006 Semi-supervised learning Reference UT machine learning group http://www.cs.utexas.edu/~ml/publication/unsupervised.html Semi-supervised Clustering by Seeding http://www.cs.utexas.edu/users/ml/papers/semi-icml-02.pdf Constrained K-means clustering with background knowledge http://www.litech.org/~wkiri/Papers/wagstaff-kmeans-01.pdf Some slides are from Jieping Ye at Arizona State

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Similar presentations

Presentation on theme: "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Similar presentations

Presentation on theme: "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."— Presentation transcript:

Similar presentations

About project

Feedback