Download presentation

Presentation is loading. Please wait.

Published byJavion Aborn Modified over 2 years ago

1
Albert Gatt Corpora and Statistical Methods Lecture 13

2
In this lecture Text categorisation overview of clustering methods machine learning methods for text classification

3
Text classification Given: a set of documents a set of categories Task: sort documents by category Examples: sort news text by topic (POLITICS, SPORT etc) sort email into SPAM/NON-SPAM classify documents by author

4
Setup Typical setup: identify relevant features of the documents individual words n-grams (e.g. bigrams) … learn a model to classify a document Naïve Bayes method maximum entropy language models …

5
Supervised vs unsupervised (cf. the un/supervised distinction for Word Sense Disambiguation; lecture 6) Supervised learning: training data is labeled several methods available (naïve Bayes, etc) Unsupervised learning: training data is unlabeled document classes have to be “discovered” possible method: clustering

6
Clustering documents Part 1

7
Clustering Flat/non-hierarchical just sets of related documents no relationship between clusters very efficient algorithms exist e.g. k-means clustering Hierarchical related documents grouped in a tree (dendrogram) tree branches indicate similarity (resp distance) less efficient than non-hierarchical clustering n documents need n * n similarity computations but more informative

8
Soft vs hard clusters Hard clustering: each document belongs to exactly 1 class hierarchical methods are usually hard Soft clustering: allow degrees of membership e.g. p(c1|d1) > p(c2|d1) i.e. d1 belongs to c1 to a greater degree than to c2

9
Similarity & monotonicity All hierarchical methods require a similarity metric: similarity computed between individual documents and between clusters Vector-space representation for documents with cosine similarity is a common technique The similarity metric needs to be monotonic: i.e. we expect merging not to increase similarity otherwise, when we merge 2 clusters, their similarity to a third might change

10
Agglomerative clustering algorithm Given: D = {d 1,…,d n } (the documents) similarity metric 1. Initialise clusters C = {c 1,…,c n } for {d 1,…,d n } 2. j := n+1 3. do until |C| = 1 a. find the most similar pair (c,c’) in C b. create a new cluster c j = c U c’ c. remove c, c’ from C d. add c j to C e. j := j+1

11
Agglomerative clustering - walkthrough Start with separate clusters for each document D4 D5D3D2D1

12
Agglomerative clustering - walkthrough D1 and D2 are most similar D4 D5D3D2D1

13
Agglomerative clustering - walkthrough D4 and D5 are most similar D4 D5D3D2D1

14
Agglomerative clustering - walkthrough D3 and {D4,D5} are most similar D4 D5D3D2D1

15
Agglomerative clustering - walkthrough Final step: merge last two clusters D4 D5D3D2D1

16
Merging: single link strategy Similarity of two clusters = similarity of the two most similar members. Pro: good local coherence (high pairwise similarity) Con: “elongated” clusters (bad global coherence) c1 c3 c2 c4 sim c5 c7 c6 c8

17
Merging: Complete link strategy Similarity of two clusters = similarity of the two most dissimilar members. better global coherence c1 c3 c2 c4 sim c5 c7 c6 c8

18
Group average strategy Similarity of two clusters = average pairwise similarity. Compromise between local & global coherence. When using a vector-space representation with cosine similarity, the average similarity of a cluster C = {C1,C2} can be computed from the average similarity of its children C1 & C2. much more efficient than computing average pairwise similarity between all document pairs in C1 * C2!

19
Divisive clustering a kind of top-down hierarchical clustering also a greedy algorithm 1. start with a single cluster representing all documents 2. iteratively divide clusters split the cluster which is least coherent (the cluster whose elements are least similar to eachother) to split a cluster C into {C1,C2}, one can run agglomerative clustering over the elements of C! therefore, computationally more expensive than pure agglomerative method

Similar presentations

OK

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Pdf to ppt online convert Ppt on tunnel diode characteristics Ppt on brain drain A ppt on presentation skills Ppt on automatic water level controller using microcontroller Download ppt on nutrition in human beings Ppt on ideal gas law lab Ppt on bluetooth communication standard Download ppt on oxidation and reduction for dummies Ppt on social and religious diversity in india