Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011.

Similar presentations


Presentation on theme: "Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011."— Presentation transcript:

1 Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

2 Roadmap Clustering Motivation & Applications Clustering Approaches Evaluation

3 Clustering Task: Given a set of objects, create a set of clusters over those objects Applications:

4 Clustering Task: Given a set of objects, create a set of clusters over those objects Applications: Exploratory data analysis Document clustering Language modeling Generalization for class-based LMs Unsupervised Word Sense Disambiguation Automatic thesaurus creations Unsupervised Part-of-Speech Tagging Speaker clustering,….

5 Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering:

6 Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment

7 Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire

8 Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire Author clustering

9 Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire Author clustering Language ID: language clusters

10 Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire Author clustering Language ID: language clusters Topic clustering: documents on the same topic OWS, debt supercommittee, Seattle Marathon, Black Friday..

11 Example: Word Clustering Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats Output: Word clusters

12 Example: Word Clustering Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats Output: Word clusters Example clusters:

13 Example: Word Clustering Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats Output: Word clusters Example clusters: (from NYT) ballot, polls, Gov, seats profit, finance, payments NFL, Reds, Sox, inning, quarterback, scored, score researchers, science Scott, Mary, Barbara, Edward

14 Questions What should a cluster represent? Due to F. Xia

15 Questions What should a cluster represent? Similarity among objects How can we create clusters? Due to F. Xia

16 Questions What should a cluster represent? Similarity among objects How can we create clusters? How can we evaluate clusters? Due to F. Xia

17 Questions What should a cluster represent? Similarity among objects How can we create clusters? How can we evaluate clusters? How can we improve NLP with clustering? Due to F. Xia

18 Similarity Between two instances

19 Similarity Between two instances Between an instance and a cluster

20 Similarity Between two instances Between an instance and a cluster Between clusters

21 Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n )

22 Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n ) Euclidean distance:

23 Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n ) Euclidean distance: Manhattan distance:

24 Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n ) Euclidean distance: Manhattan distance: Cosine similarity:

25 Clustering Algorithms

26 Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters

27 Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters Hierarchical: Nodes form hierarchy

28 Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters Hierarchical: Nodes form hierarchy Hard vs Soft Clustering Hard: Each object assigned to exactly one cluster

29 Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters Hierarchical: Nodes form hierarchy Hard vs Soft Clustering Hard: Each object assigned to exactly one cluster Soft: Allows degrees of membership and membership in more than one cluster Often probability distribution over cluster membership

30 Hierarchical Clustering

31 Hierarchical Vs. Flat Hierarchical clustering:

32 Hierarchical Vs. Flat Hierarchical clustering: More informative Good for data exploration Many algorithms, none good for all data Computationally expensive

33 Hierarchical Vs. Flat Hierarchical clustering: More informative Good for data exploration Many algorithms, none good for all data Computationally expensive Flat clustering:

34 Hierarchical Vs. Flat Hierarchical clustering: More informative Good for data exploration Many algorithms, none good for all data Computationally expensive Flat clustering: Fairly efficient Simple baseline algorithm: K-means Probabilistic models use EM algorithm

35 Clustering Algorithms Flat clustering: K-means clustering K-medoids clustering Hierarchical clustering: Greedy, bottom-up clustering

36 K-Means Clustering Initialize: Randomly select k initial centroids

37 K-Means Clustering Initialize: Randomly select k initial centroids Center (mean) of cluster Iterate until clusters stop changing

38 K-Means Clustering Initialize: Randomly select k initial centroids Center (mean) of cluster Iterate until clusters stop changing Assign each instance to the nearest cluster Cluster is nearest if cluster centroid is nearest

39 K-Means Clustering Initialize: Randomly select k initial centroids Center (mean) of cluster Iterate until clusters stop changing Assign each instance to the nearest cluster Cluster is nearest if cluster centroid is nearest Recompute cluster centroids Mean of instances in the cluster

40 K-Means: 1 step

41 K-Means Running time:

42 K-Means Running time: O(n) – where n is the number of clusters Converges in finite number of steps Issues:

43 K-Means Running time: O(n) – where n is the number of clusters Converges in finite number of steps Issues: Need to pick # clusters k Can find only local optimum Sensitive to outliers Requires Euclidean distance: What about enumerable classes (e.g. colors)?

44 Medoid Medoid: Element in cluster with highest average similarity to other elements in cluster

45 Medoid Medoid: Element in cluster with highest average similarity to other elements in cluster Finding the medoid: For each element compute:

46 Medoid Medoid: Element in cluster with highest average similarity to other elements in cluster Finding the medoid: For each element compute: Select the element with highest f(p)

47 K-Medoids Initialize: Select k instances at random as medoids

48 K-Medoids Initialize: Select k instances at random as medoids Iterate until no changes Assign instances to cluster with nearest medoid

49 K-Medoids Initialize: Select k instances at random as medoids Iterate until no changes Assign instances to cluster with nearest medoid Recompute medoid for each cluster

50 Greedy, Bottom-Up Hierarchical Clustering Initialize: Make an individual cluster for each instance

51 Greedy, Bottom-Up Hierarchical Clustering Initialize: Make an individual cluster for each instance Iterate until all instances in same cluster Merge two most similar clusters

52 Evaluation

53 With respect to gold standard Accuracy For each cluster, assign most common label to all items Rand index F-measure Alternatives:

54 Evaluation With respect to gold standard Accuracy For each cluster, assign most common label to all items Rand index F-measure Alternatives: Extrinsic evaluation

55 Evaluation With respect to gold standard Accuracy For each cluster, assign most common label to all items Rand index F-measure Alternatives: Extrinsic evaluation Human inspection

56 Configuration Given Set of objects O = {o 1,o 2, ….o n }

57 Configuration Given Set of objects O = {o 1,o 2, ….o n } Partition X ={x 1,…,x r } Partition Y ={y 1,….y s }

58 Configuration Given Set of objects O = {o 1,o 2, ….o n } Partition X ={x 1,…,x r } Partition Y ={y 1,….y s } In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b

59 Rand Index Measure of cluster similarity (Rand, 1971) No agreement? In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b

60 Rand Index Measure of cluster similarity (Rand, 1971) No agreement? 0; Full agreement In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b

61 Rand Index Measure of cluster similarity (Rand, 1971) No agreement? 0; Full agreement? 1 In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b

62 Precision & Recall Assume X is the gold standard partition Assume Y is the system-generated partition

63 Precision & Recall Assume X is the gold standard partition Assume Y is the system-generated partition For each pair of items in a cluster in Y Correct if they appear together in a cluster in X

64 Precision & Recall Assume X is the gold standard partition Assume Y is the system-generated partition For each pair of items in a cluster in Y Correct if they appear together in a cluster in X Can compute P, R, and F-measure

65 HW #10 Due to F. Xia

66 HW #10 Unsupervised POS tagging: Word clustering by neighboring word cooccurrence Create feature vectors: Features: counts of adjacent word occurrence E.g., L=he:10 or R=run:3 Perform clustering: K-medoids algorithm ( with cosine similarity) Evaluate clusters: Cluster mapping + accuracy

67 Q1 create_vectors.* training_file word_file feat_file outfile Training file: one-sentence-per-line: w1 w2 w3 …wn word_file: List of words to cluster word freq feat_file: List of words to use as features feat freq outfile: One list per word in word_file Format: word L=he 10 L=she 5 ….. R=gone 2 R=run 3…

68 Features Features are of the form: (L|R)=xx freq where xx is a word in the feat_file, L, R: the position where the feature appeared freq: # of times word xx appeared in position in training file Suppose ‘New York’ appears 540 times in corpus York L=New 540 … R=New 0…

69 Vector File One line per word in word_file Lines should be ordered by word_file Features should be sorted alphabetically by feature name E.g. L=an 3 L=the 10 … R=aqua 1 R=house 5 Feature sorting aids cosine computation

70 Q2 k_medoids.* vector_file num_clusters sys_cluster_file vector_file: Created by Q1 num_clusters : number of clusters to create sys_cluster_file: output representing clustering of vectors medoid w1 w2 w3 …wn where medoid is the medoid representing the cluster w1…wn are the words in the cluster

71 Q2: K-Medoids Similarity measure: Cosine similarity Initial medoids: Medoid i is at instance: where N is # of words to cluster C is # of clusters

72 Mapping Sys to Gold: One-to-One Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum Due to F. Xia g1g2g3 s12109 s2742 s3096 s4503

73 Mapping Sys to Gold: One-to-One Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum Due to F. Xia g1g2g3 s12109 s2742 s3096 s4503

74 Mapping Sys to Gold: Many-to-One Find highest number in matrix Remove corresponding row (but not column) Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 9 s4 => g1 5 acc= (10+7+9+5)/sum Due to F. Xia g1g2g3 s12109 s2742 s3096 s4503

75 Q3: calculate_accuracy calculate_accuracy.* sys_clust gold_clust flag map_file acc_file sys_clust: output of Q2: m w1 w2 … gold_clust: similar format, gold standard flag: 0: one-to-one; 1:many-to-one map_file: mapping of sys to gold clusters sys_clust_num => gold_clust_num count acc_file: just overall accuracy

76 Experiments Compare different numbers of words and different feature representations Compare different mapping strategies for accuracy Tabulate results


Download ppt "Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011."

Similar presentations


Ads by Google