Download presentation
Presentation is loading. Please wait.
Published byJustina Dalton Modified over 8 years ago
1
V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93
2
Outline V.1 Clustering tasks in text analysis V.2 The general clustering problem V.3 Clustering algorithm V.4 Clustering of textual data
3
Clustering Clustering An unsupervised process through which objects are classified into groups called cluster. (cf. categorization is a supervised process.) Data mining, document retrieval, image segmentation, pattern classification.
4
V.1 Clustering tasks in text analysis(1/2) Cluster hypothesis “ Relevant documents tend to be more similar to each other than to nonrelevant ones. ” If cluster hypothesis holds for a particular document collection, then the clustering of documents may help to improve the search effectiveness. Improving search recall When a query matches a document its whole cluster can be return Improving search precision By grouping the document into a much smaller number of groups of related documents
5
V.1 Clustering tasks in text analysis(2/2) Scatter/gather browsing method Purpose: to enhance the efficiency of human browsing of a document collection when a specific search query cannot be a formulated. Session1: a document collection is scattered into a set of clusters. Sesson2: then the selected clusters are gathered into a new subcollection with which the process may be repeated. 참고사이트 –http://www2.parc.com/istl/projects/ia/sg-background.htmlhttp://www2.parc.com/istl/projects/ia/sg-background.html Query-Specific clustering are also possible. - the hierarchical clustering is appealing
6
V.2 Clustering problem(1/2) Cluster tasks problem representation definition proximity measures actual clustering of objects data abstraction evalutation Problem representation Basically, optimization problem. Goal: select the best among all possible groupings of objects Similarity function: clustering quality function. Feature extraction/ feature selection In a vector space model, objects: vectors in the high-dimensional feature space. the similarity function: the distance between the vectors in some metric
7
V.2 Clustering problem(2/2) Similarity Measures Euclidian distance Cosine similarity measure is the most common
8
V.3 Clustering algorithm (1/9) flat clustering: a single partition of a set of objects into disjoint groups. hierarchical clustering: a nested series of partition. hard clustering: every objects may belongs to exactly one cluster. soft clustering: objects may belongs to several clusters with a fractional degree of membership in each.
9
V.3 Clustering algorithm (2/9) Agglomerative algorithm: begin with each object in a separate cluster and successively merge cluster until a stopping criterion is satisfied. Divisive algorithm: begin with a single cluster containing all objects and perform splitting until stopping criterion satisfied. Shuffling algorithm: iteratively redistribute objects in clusters
10
V.3 Clustering algorithm (3/9) k-means algorithm(1/2) hard, flat, shuffling algorithm
11
V.3 Clustering algorithm (4/9) example of K-means algorithm
12
V.3 Clustering algorithm (5/9) K-means algorithm(2/2) Simple, efficient Complexity O(kn) bad initial selection of seeds.-> local optimal. k-means suboptimality is also exist.-> Buckshot algorithm. ISO-DATA algorithm Maximizes the quality function Q:
13
V.3 Clustering algorithm (6/9) EM-based probabilistic clustering algorithm(1/2) Soft, flat, probabilistic
14
V.3 Clustering algorithm (7/9)
15
V.3 Clustering algorithm (8/9) Hierarchical agglomerative Clustering single-link method Complete-link method Average-link method
16
V.3 Clustering algorithm (9/9)
17
Other clustering algorithms minimal spanning tree nearest neighbor clustering Buckshot algorithm
18
V.4 clustering of textual data(1/6) representation of text clustering problem Objects are very complex and rich internal structure. Documents must be converted into vectors in the feature space. Bag-of-words document representation. Reducing the dimensionality Local method: delete unimportant components from individual document vectors. Global method: latent semantic indexing(LSI)
19
V.4 clustering of textual data(2/6) latent semantic indexing map N-dimensional feature space F onto a lower dimensional subspace V. LSI is based upon applying the SVD to the term-document matrix.
20
V.4 clustering of textual data(3/6) Singular value decomposition (SVD) A = UDV T U : column-orthonormal mxr matrix D: diagonal rxr matrix, matrix,digonal elements are the singular values of A V: column-orthonormal nxr UU T = V T V = I Dimension reduction
21
V.4 clustering of textual data(4/6) Mediods: actual documents that are most similar to the centroids Using Na ï ve Bayes Mixture models with the EM clustering algorithm
22
V.4 clustering of textual data(5/6) Data abstraction in text clustering generating meaningful and concise description of the cluster. method of generating the label automatically a title of the medoid document several words common to the cluster documents can be shown. a distinctive noun phrase.
23
V.4 clustering of textual data(6/6) Evaluation of text clustering - the quality of the result? purity assume {L 1,L 2,...,L n } are the manually labeled classes of documents, {C 1,C 2,...,C m } the clusters returned by the clustering process entropy, mutual information between classes and clusters
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.