V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93.

V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93

Outline  V.1 Clustering tasks in text analysis  V.2 The general clustering problem  V.3 Clustering algorithm  V.4 Clustering of textual data

Clustering  Clustering An unsupervised process through which objects are classified into groups called cluster. (cf. categorization is a supervised process.) Data mining, document retrieval, image segmentation, pattern classification.

V.1 Clustering tasks in text analysis(1/2)  Cluster hypothesis “ Relevant documents tend to be more similar to each other than to nonrelevant ones. ”  If cluster hypothesis holds for a particular document collection, then the clustering of documents may help to improve the search effectiveness. Improving search recall When a query matches a document its whole cluster can be return Improving search precision By grouping the document into a much smaller number of groups of related documents

V.1 Clustering tasks in text analysis(2/2) Scatter/gather browsing method Purpose: to enhance the efficiency of human browsing of a document collection when a specific search query cannot be a formulated. Session1: a document collection is scattered into a set of clusters. Sesson2: then the selected clusters are gathered into a new subcollection with which the process may be repeated. 참고사이트 –http://www2.parc.com/istl/projects/ia/sg-background.htmlhttp://www2.parc.com/istl/projects/ia/sg-background.html Query-Specific clustering are also possible. - the hierarchical clustering is appealing

V.2 Clustering problem(1/2)  Cluster tasks problem representation definition proximity measures actual clustering of objects data abstraction evalutation  Problem representation Basically, optimization problem. Goal: select the best among all possible groupings of objects Similarity function: clustering quality function. Feature extraction/ feature selection In a vector space model, objects: vectors in the high-dimensional feature space. the similarity function: the distance between the vectors in some metric

V.2 Clustering problem(2/2)  Similarity Measures Euclidian distance Cosine similarity measure is the most common

V.3 Clustering algorithm (1/9)  flat clustering: a single partition of a set of objects into disjoint groups. hierarchical clustering: a nested series of partition.  hard clustering: every objects may belongs to exactly one cluster. soft clustering: objects may belongs to several clusters with a fractional degree of membership in each.

V.3 Clustering algorithm (2/9)  Agglomerative algorithm: begin with each object in a separate cluster and successively merge cluster until a stopping criterion is satisfied. Divisive algorithm: begin with a single cluster containing all objects and perform splitting until stopping criterion satisfied. Shuffling algorithm: iteratively redistribute objects in clusters

V.3 Clustering algorithm (3/9)  k-means algorithm(1/2) hard, flat, shuffling algorithm

V.3 Clustering algorithm (4/9) example of K-means algorithm

V.3 Clustering algorithm (5/9)  K-means algorithm(2/2) Simple, efficient Complexity O(kn) bad initial selection of seeds.-> local optimal. k-means suboptimality is also exist.-> Buckshot algorithm. ISO-DATA algorithm Maximizes the quality function Q:

V.3 Clustering algorithm (6/9)  EM-based probabilistic clustering algorithm(1/2) Soft, flat, probabilistic

V.3 Clustering algorithm (7/9)

V.3 Clustering algorithm (8/9)  Hierarchical agglomerative Clustering single-link method Complete-link method Average-link method

V.3 Clustering algorithm (9/9)

Other clustering algorithms  minimal spanning tree  nearest neighbor clustering  Buckshot algorithm

V.4 clustering of textual data(1/6)  representation of text clustering problem Objects are very complex and rich internal structure. Documents must be converted into vectors in the feature space. Bag-of-words document representation. Reducing the dimensionality Local method: delete unimportant components from individual document vectors. Global method: latent semantic indexing(LSI)

V.4 clustering of textual data(2/6)  latent semantic indexing map N-dimensional feature space F onto a lower dimensional subspace V. LSI is based upon applying the SVD to the term-document matrix.

V.4 clustering of textual data(3/6)  Singular value decomposition (SVD) A = UDV T U : column-orthonormal mxr matrix D: diagonal rxr matrix, matrix,digonal elements are the singular values of A V: column-orthonormal nxr UU T = V T V = I  Dimension reduction

V.4 clustering of textual data(4/6)  Mediods: actual documents that are most similar to the centroids  Using Na ï ve Bayes Mixture models with the EM clustering algorithm

V.4 clustering of textual data(5/6)  Data abstraction in text clustering generating meaningful and concise description of the cluster. method of generating the label automatically a title of the medoid document several words common to the cluster documents can be shown. a distinctive noun phrase.

V.4 clustering of textual data(6/6)  Evaluation of text clustering - the quality of the result? purity assume {L 1,L 2,...,L n } are the manually labeled classes of documents, {C 1,C 2,...,C m } the clusters returned by the clustering process entropy, mutual information between classes and clusters

V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Similar presentations

Presentation on theme: "V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Similar presentations

Presentation on theme: "V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93."— Presentation transcript:

Similar presentations

About project

Feedback