Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011.

Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Roadmap Clustering Motivation & Applications Clustering Approaches Evaluation

Clustering Task: Given a set of objects, create a set of clusters over those objects Applications:

Clustering Task: Given a set of objects, create a set of clusters over those objects Applications: Exploratory data analysis Document clustering Language modeling Generalization for class-based LMs Unsupervised Word Sense Disambiguation Automatic thesaurus creations Unsupervised Part-of-Speech Tagging Speaker clustering,….

Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering:

Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment

Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire

Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire Author clustering

Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire Author clustering Language ID: language clusters

Example: Document Clustering Input: Set of individual documents Output: Sets of document clusters Many different types of clustering: Category: news, sports, weather, entertainment Genre clustering: Similar styles: blogs, tweets, newswire Author clustering Language ID: language clusters Topic clustering: documents on the same topic OWS, debt supercommittee, Seattle Marathon, Black Friday..

Example: Word Clustering Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats Output: Word clusters

Example: Word Clustering Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats Output: Word clusters Example clusters:

Example: Word Clustering Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats Output: Word clusters Example clusters: (from NYT) ballot, polls, Gov, seats profit, finance, payments NFL, Reds, Sox, inning, quarterback, scored, score researchers, science Scott, Mary, Barbara, Edward

Questions What should a cluster represent? Due to F. Xia

Questions What should a cluster represent? Similarity among objects How can we create clusters? Due to F. Xia

Questions What should a cluster represent? Similarity among objects How can we create clusters? How can we evaluate clusters? Due to F. Xia

Questions What should a cluster represent? Similarity among objects How can we create clusters? How can we evaluate clusters? How can we improve NLP with clustering? Due to F. Xia

Similarity Between two instances

Similarity Between two instances Between an instance and a cluster

Similarity Between two instances Between an instance and a cluster Between clusters

Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n )

Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n ) Euclidean distance:

Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n ) Euclidean distance: Manhattan distance:

Similarity Measures Given x=(x 1,x 2,…,x n ) and y=(y 1,y 2,…,y n ) Euclidean distance: Manhattan distance: Cosine similarity:

Clustering Algorithms

Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters

Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters Hierarchical: Nodes form hierarchy

Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters Hierarchical: Nodes form hierarchy Hard vs Soft Clustering Hard: Each object assigned to exactly one cluster

Types of Clustering Flat vs Hierarchical Clustering: Flat: partition data into k clusters Hierarchical: Nodes form hierarchy Hard vs Soft Clustering Hard: Each object assigned to exactly one cluster Soft: Allows degrees of membership and membership in more than one cluster Often probability distribution over cluster membership

Hierarchical Clustering

Hierarchical Vs. Flat Hierarchical clustering:

Hierarchical Vs. Flat Hierarchical clustering: More informative Good for data exploration Many algorithms, none good for all data Computationally expensive

Hierarchical Vs. Flat Hierarchical clustering: More informative Good for data exploration Many algorithms, none good for all data Computationally expensive Flat clustering:

Hierarchical Vs. Flat Hierarchical clustering: More informative Good for data exploration Many algorithms, none good for all data Computationally expensive Flat clustering: Fairly efficient Simple baseline algorithm: K-means Probabilistic models use EM algorithm

Clustering Algorithms Flat clustering: K-means clustering K-medoids clustering Hierarchical clustering: Greedy, bottom-up clustering

K-Means Clustering Initialize: Randomly select k initial centroids

K-Means Clustering Initialize: Randomly select k initial centroids Center (mean) of cluster Iterate until clusters stop changing

K-Means Clustering Initialize: Randomly select k initial centroids Center (mean) of cluster Iterate until clusters stop changing Assign each instance to the nearest cluster Cluster is nearest if cluster centroid is nearest

K-Means Clustering Initialize: Randomly select k initial centroids Center (mean) of cluster Iterate until clusters stop changing Assign each instance to the nearest cluster Cluster is nearest if cluster centroid is nearest Recompute cluster centroids Mean of instances in the cluster

K-Means: 1 step

K-Means Running time:

K-Means Running time: O(n) – where n is the number of clusters Converges in finite number of steps Issues:

K-Means Running time: O(n) – where n is the number of clusters Converges in finite number of steps Issues: Need to pick # clusters k Can find only local optimum Sensitive to outliers Requires Euclidean distance: What about enumerable classes (e.g. colors)?

Medoid Medoid: Element in cluster with highest average similarity to other elements in cluster

Medoid Medoid: Element in cluster with highest average similarity to other elements in cluster Finding the medoid: For each element compute:

Medoid Medoid: Element in cluster with highest average similarity to other elements in cluster Finding the medoid: For each element compute: Select the element with highest f(p)

K-Medoids Initialize: Select k instances at random as medoids

K-Medoids Initialize: Select k instances at random as medoids Iterate until no changes Assign instances to cluster with nearest medoid

K-Medoids Initialize: Select k instances at random as medoids Iterate until no changes Assign instances to cluster with nearest medoid Recompute medoid for each cluster

Greedy, Bottom-Up Hierarchical Clustering Initialize: Make an individual cluster for each instance

Greedy, Bottom-Up Hierarchical Clustering Initialize: Make an individual cluster for each instance Iterate until all instances in same cluster Merge two most similar clusters

Evaluation

With respect to gold standard Accuracy For each cluster, assign most common label to all items Rand index F-measure Alternatives:

Evaluation With respect to gold standard Accuracy For each cluster, assign most common label to all items Rand index F-measure Alternatives: Extrinsic evaluation

Evaluation With respect to gold standard Accuracy For each cluster, assign most common label to all items Rand index F-measure Alternatives: Extrinsic evaluation Human inspection

Configuration Given Set of objects O = {o 1,o 2, ….o n }

Configuration Given Set of objects O = {o 1,o 2, ….o n } Partition X ={x 1,…,x r } Partition Y ={y 1,….y s }

Configuration Given Set of objects O = {o 1,o 2, ….o n } Partition X ={x 1,…,x r } Partition Y ={y 1,….y s } In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b

Rand Index Measure of cluster similarity (Rand, 1971) No agreement? In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b

Rand Index Measure of cluster similarity (Rand, 1971) No agreement? 0; Full agreement In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b

Rand Index Measure of cluster similarity (Rand, 1971) No agreement? 0; Full agreement? 1 In same sets in XIn diff’t sets in X In same sets in Y a d In diff’t sets in Y c b

Precision & Recall Assume X is the gold standard partition Assume Y is the system-generated partition

Precision & Recall Assume X is the gold standard partition Assume Y is the system-generated partition For each pair of items in a cluster in Y Correct if they appear together in a cluster in X

Precision & Recall Assume X is the gold standard partition Assume Y is the system-generated partition For each pair of items in a cluster in Y Correct if they appear together in a cluster in X Can compute P, R, and F-measure

HW #10 Due to F. Xia

HW #10 Unsupervised POS tagging: Word clustering by neighboring word cooccurrence Create feature vectors: Features: counts of adjacent word occurrence E.g., L=he:10 or R=run:3 Perform clustering: K-medoids algorithm ( with cosine similarity) Evaluate clusters: Cluster mapping + accuracy

Q1 create_vectors.* training_file word_file feat_file outfile Training file: one-sentence-per-line: w1 w2 w3 …wn word_file: List of words to cluster word freq feat_file: List of words to use as features feat freq outfile: One list per word in word_file Format: word L=he 10 L=she 5 ….. R=gone 2 R=run 3…

Features Features are of the form: (L|R)=xx freq where xx is a word in the feat_file, L, R: the position where the feature appeared freq: # of times word xx appeared in position in training file Suppose ‘New York’ appears 540 times in corpus York L=New 540 … R=New 0…

Vector File One line per word in word_file Lines should be ordered by word_file Features should be sorted alphabetically by feature name E.g. L=an 3 L=the 10 … R=aqua 1 R=house 5 Feature sorting aids cosine computation

Q2 k_medoids.* vector_file num_clusters sys_cluster_file vector_file: Created by Q1 num_clusters : number of clusters to create sys_cluster_file: output representing clustering of vectors medoid w1 w2 w3 …wn where medoid is the medoid representing the cluster w1…wn are the words in the cluster

Q2: K-Medoids Similarity measure: Cosine similarity Initial medoids: Medoid i is at instance: where N is # of words to cluster C is # of clusters

Mapping Sys to Gold: One-to-One Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum Due to F. Xia g1g2g3 s12109 s2742 s3096 s4503

Mapping Sys to Gold: Many-to-One Find highest number in matrix Remove corresponding row (but not column) Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 9 s4 => g1 5 acc= (10+7+9+5)/sum Due to F. Xia g1g2g3 s12109 s2742 s3096 s4503

Q3: calculate_accuracy calculate_accuracy.* sys_clust gold_clust flag map_file acc_file sys_clust: output of Q2: m w1 w2 … gold_clust: similar format, gold standard flag: 0: one-to-one; 1:many-to-one map_file: mapping of sys to gold clusters sys_clust_num => gold_clust_num count acc_file: just overall accuracy

Experiments Compare different numbers of words and different feature representations Compare different mapping strategies for accuracy Tabulate results

Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011.

Similar presentations

Presentation on theme: "Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011.

Similar presentations

Presentation on theme: "Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011."— Presentation transcript:

Similar presentations

About project

Feedback