Moshe Koppel and Navot Akiva

Name: Moshe Koppel and Navot Akiva
Uploaded: 2017-12-22T22:58:12+00:00
Duration: PTM25S59
Channel: Adrian Fleming
Description: Moshe Koppel and Navot Akiva

Moshe Koppel and Navot Akiva
Lecture 7: Clustering and Decomposition of a Document to Authorial Components Moshe Koppel and Navot Akiva

Clustering The idea is to define “natural” clusters
Position in space depends on feature set Need to have some notion of what you want in order to choose right feature types e.g., cluster by topic: content words e.g., cluster by author: style features With many features, clusters aren’t so elegant

How Many Clusters? Some methods decide in advance
Some methods try to optimize k based on data e.g., in previous example, would want 3 clusters Caution: some optimization criteria biased towards one giant cluster or many tiny ones

Flat Clustering: K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal. For each cluster cj: sj = (cj) (Update the seeds to the centroid of each cluster)

K-Means Example (K=2) x x Pick seeds Reassign clusters
Compute centroids x Reassign clusters x Compute centroids Reassign clusters Converged!

Termination conditions
Several possibilities, e.g., A fixed number of iterations. Doc partition unchanged. Centroid positions don’t change. a a a a a a A a b b b B b b b c C c c c c c c a a a A a a a b b b b B b b b b c c c C c c c In the long run, centroid unchanged implies documents unchanged?

Convergence Why should the K-means algorithm ever reach a fixed point?
A state in which clusters don’t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large.

Convergence of K-Means
Define goodness measure of cluster k as sum of squared distances from cluster centroid: Gk = Σi (di – ck) (sum over all di in cluster k) G = Σk Gk Reassignment monotonically decreases G since each vector is assigned to the closest centroid. Recomputation monotonically decreases each Gk since (mk is number of members in cluster k): Σ (di – a)2 reaches minimum for: Σ –2(di – a) = 0 Σ di = Σ a mK a = Σ di a = (1/ mk) Σ di = ck K-means typically converges quickly

Dependence on Seed Choice
Results depend on random seed selection. Some tricks for avoiding bad seeds: Choose seed far from any already chosen Try out multiple seed sets If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} and {C,F}

A Problem… The methods we’ve discussed so far don’t always give very satisfying results. Have a look at this example….

Ng et al On Spectral clustering: analysis and algorithm

Spectral Methods Because of examples like that, spectral methods have been developed. Here’s the idea: Imagine weighted edges between points, where weight reflects similarity. Zero out all edges except k nearest neighbors with high weights. Now make the “cheapest” cuts that result in separate components.

Ng et al On Spectral clustering: analysis and algorithm

Spectral Clustering Properly formulated, the min Ncut problem is NP-hard. But there are some nice linear algebra tricks we can use to get reasonable approximate solutions: Pre-processing Construct a matrix representation of the dataset. Decomposition Compute eigenvalues and eigenvectors of the matrix. Map each point to a lower-dimensional representation based on one or more eigenvectors. Grouping Assign points to two or more clusters, based on the new representation.

New Problem: Separating Document Components
Often documents consist of multiple authorial components. Our object is to tease apart the components of a composite document.

Basic Idea Divide the document into natural chunks (e.g., chapters, paragraphs) Vectorize chunks using some feature set Cluster the vectors

Classic Example: The Bible
Great historic and cultural interest Much prior research on components Has been manually tagged in every conceivable way

Test Case Let’s throw the chapters of Ezekiel and Jeremiah into a hat and see if we can separate them out. Each is presumably the work of a single (primary) author. There’s no reason to think it’s the same author. They are thematically quite similar, though far from identical.

Clustering Jeremiah+Ezekiel
Chunks = chapters (no sequence info) Features = bag of words Cluster method = Ncut # clusters = 2

Results using all words
Ezekiel Jeremiah 28 29 Cluster 1 20 23 Cluster 2 Not too exciting. We must be picking up thematic or genre-related differences that cross books. Let’s try using only function words.

Results using function words
Ezekiel Jeremiah 28 34 Cluster 1 20 18 Cluster 2 Not any better. Let’s try a new approach.

A Better Idea Exploit the fact that different authors use different synonyms for the same idea (e.g., makel/mateh). It would be really convenient if it turned out that Jeremiah and Ezekiel made consistently different choices for various synsets. Note that words aren’t synonyms, rather word senses are synonyms. (For example mateh=staff is a synonym of makel, but mateh=tribe is not.)

Automatically Finding Synonyms
There are various clever methods for identifying synsets, but most are not exact enough for our purpose. Conveniently, for the Bible, we have many useful tools, including careful translations and manual sense tagging. We identify as synonyms word senses that are translated into the same English word (e.g., makel=staff and mateh=staff). (We even get sense disambiguation for free.) Due to polysemy (in English), this method overshoots. We manually delete mistakes. (This is the only manual intervention we will ever do.)

Synonym Method The usual similarity measures (e.g., cosine, inverse Euclidean distance) aren’t quite right. If two docs use the same synonym, they are similar. If two docs use opposite synonyms, they are different. If one of the docs uses one of the synonyms, but the other doesn’t, cosine would regard them as different. But are they? For measuring similarity, we only consider synsets represented in both docs.

Results using synonyms
Ezekiel Jeremiah 6 46 Cluster 1 41 7 Cluster 2 Now we’re getting somewhere. But we’re not done yet…

Core of a Cluster Some chapters are in a cluster because they really belong there; some just have to be somewhere. Let’s consider only chapters near one centroid and far from the other. These are the “cores” of the respective clusters. 36 7 4 12 3 2

Cluster cores Ezekiel Jeremiah 2 36 Cluster 1 Cluster 2

Cluster cores Ezekiel Jeremiah 2 36 Cluster 1 Cluster 2 Ezekiel 1, 10

Expanding the Core Now that we have a core, we can use supervised methods (e.g., SVM) to learn a boundary. In fact, we can use function words as our features. Using synonyms will just get us back where we started. And besides, FW are generally very reliable for supervised authorship attribution.

SVM expansion of core 1 52 47 Ezekiel Jeremiah Cluster 1 Cluster 2
Cluster 2 Two training examples are “misclassed” by SVM. Incredibly, these are Ezekiel 1 and 10, which were part of Jeremiah core, but are on Ezekiel side of optimal SVM boundary. The only exception is Ezekiel 42, a non-core chapter which lies in the SVM margin.

Decomposing Unsegmented Text
Until now, we’ve assumed that our units (chapters) were all pure Ezekiel or pure Jeremiah. That’s a major unrealistic assumption. Let’s create a munged “Jer-iel” book by alternately taking a random number of verses from each book until we run out of verses. Can we tease that apart?

Chunks Might Be Mixed xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Start with some arbitrary chunking of the munged text. (Some of these chunks will be mixed.) Now proceed exactly as above, until you have cluster cores. With a bit of luck, the chunks in the core will be the “pure” chunks.

A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

As above, use SVM to learn a classifier, using the core chunks as training examples. But at the last step, classify individual verses instead of chapters. Some smoothing techniques are necessary to: Deal with the sparseness of features in individual verses Take advantage of serial correlation

A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Results on Munged Books

Multiple Books The same method works if we create a munged book out of k>2 constituent books. BUT, we need to be given the k. So we don’t yet know how many components there are, only how to best split into a given number of components.

Open Question Can we extend this method so it can be used also for texts that don’t have canonical translations like the Bible? Can we use this method to identify plagiarized sections of a text?

Moshe Koppel and Navot Akiva

Similar presentations

Presentation on theme: "Moshe Koppel and Navot Akiva"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Moshe Koppel and Navot Akiva

Similar presentations

Presentation on theme: "Moshe Koppel and Navot Akiva"— Presentation transcript:

Similar presentations

About project

Feedback