NLP Workshop Eytan Ruppin Ben Sandbank

NLP Workshop Eytan Ruppin http://www.cs.tau.ac.il/~ruppin ruppin@post.tau.ac.il Ben Sandbank http://www.cs.tau.ac.il/~sandban sandban@post.tau.ac.il

The problem of language acquisition The argument from the poverty of the stimulus

Projects description ADIOS (“Automatic DIstillation Of Structure”) – a grammar induction algorithm developed by Zach Solan. Projects will consist of two phases Implementation of the algorithm Some original work Applying ADIOS to new data Developing & implementing some extensions

Latent Semantic Analysis Assumes a high-dimensional “semantic space” Each word/topic is a point in this space Semantic similarity is a function of distance A discourse regarding a certain topic will utilize words from neighboring regions of semantic space

The problem of acquisition The learner is presented with sets of words each used in discourse of a single topic For example – the words contained in paragraphs in text The problem – figure out the true distances between the words in semantic space

A toy example

Bags of words

Measuring similarity Question – why not just measure correlations between word vectors? Rationale – words with similar meanings will tend to co-occur in the same contexts Answer – that just doesn’t work

Doesn’t it?

Oh. Why not, then? That two words rarely co-occur can be because – The words are truly unrelated Like carburetor/semantics The words are related but the sample is too small Many pairwise counts will be zero The words are near-perfect synonyms Like human/user in the toy example

The virtues of dimension reduction Proper dimension reduction helps overcome errors in measurement The three houses example Imposes global constraints that give rise to indirect effects Only true if we know the underlying dimensionality

Singular Value Decomposition Refers to the fact that every rectangular matrix X can be decomposed thusly – Where m = rank(X) (= min(w, c)) X Contexts Words = W wxcwxcwxmwxm S mxmmxm CTCT mxcmxc

Properties of the SVD W, C are orthonormal S is diagonal, with values usually arranged in descending order These are the ‘singular values’ Zeroing all but the r largest values of S will result in a matrix X’ which is the closest rank r matrix to X! X = W wxcwxc wxmwxm S mxmmxm CTCT mxcmxc

Toy example SVD

The original matrix again…

And the 2D result

Matrix preprocessing The co-occurrence matrix is subjected to a preprocessing transformation that improves LSA’s performance Define tf i,j – term frequency – the number of occurrences of word i in context j. df i – document frequency – the number of documents (context) containing word I

The transformation tf i,j  log(tf i,j +1)*log(N/df i ) The ‘log’ compresses the occurrence counts A document containing 3 counts of word i is more important, but not 3 times as important The second term (“Inverse Document Frequency”) is a measure of the word’s informativeness

Low dimensional representation Let S r be equal to S with all but the r largest values set to zero. Then WS r gives the r-dimensional representation of the word matrix (each row corresponds to one word) CS r gives the r-dimensional representation of the context matrix (each row corresponds to one column) X = W wxcwxc wxmwxm S mxmmxm CTCT mxcmxc

Measuring similarity Actually, similarity is measured by the cosine of the angle between the two words/contexts Equivalent to distance if the vectors are all normalized For a context, its angle is related to its topic; its length indicates how much it has to say about it

“Folding” new contexts We would like to be able to determine the low-dimensional representation of new contexts Let q w x1 be the new context, and W w x k the k-dimensional word matrix The k-dimensional embedding of the new context is given by – W w x k T q w x1

So how good is this thing? LSA was run on 4.6 million words taken from an online encyclopedia 30,473 articles (= contexts) 60,768 distinct words The LSA representation was tested on the synonym portion of the TOEFL The model got 64.4% of the answers correctly (52.5% corrected for guessing) As compared with an average result of 64.5% (52.7%) by foreign students applying to US colleges

Effects of dimensionality Without SVD, 16% of the answers were correct

The effects of expertise The performance of LSA was tested with regards to the amount of prior knowledge (i.e corpus size) on two dimensions – The number of occurrences of the tested word The number of contexts NOT containing the tested word A continuous discrimination measured was defined to assess LSA’s knowledge of a word -

The effects of expertise

Rate of learning The LSA was trained on a corpus the size of a seventh-grader history of reading Adding a single paragraph and retraining the LSA allowed to assess the rate of LSA learning

Rate of learning The effect of reading a paragraph on knowledge of words contained in it was a gain of ~0.05 words The effect on words not contained in it was ~0.15 words Given that a seventh-grader reads about 50 paragraphs a day, this means 2.5 words are learned directly 7.5 words are learned indirectly

Some more LSA applications

An interesting fact about psychologists LSA was trained on the text of introductory psychology textbooks and then used to answer multiple-choice based course exams It passed albeit at a significantly lower grade than class average

Judging the quality of essay answers LSA has been used to assign holistic quality scores to answers to essay questions Several methods were used Based on a sample of essays was graded by instructors Based on a pre-existing exemplary text The LSA correlated with expert scores approximately as well as experts correlated with each other

Pronominal Anaphora Resolution Anaphora – a word that acquires its meaning in reference to previous words Pronoun – she, he, it, him, himself… For example - The new medicine has been released to the market a week ago. The doctor gave it to the boy.

Pronominal Anaphora Resolution ‘The new medicine’, ‘the market’, and ‘a week’ are all possible resolutions for the anaphora Suggestion – compare each candidate to the phrase containing the pronoun (‘the doctor gave to the boy’) ‘The new medicine’ had the highest cosine and was selected as the correct resolution

Some LSA shortcomings

Some quirks Inexplicable word similarities sometimes arise ‘verbally’ and ‘sadomasochism’ had a cosine of 0.8 in the encyclopedia space LSA sometimes seems more sensitive to contextual associations than to semantic features ‘nurse’ is more similar to ‘physician’ (cos = 0.47) than ‘doctor’ is (cos = 0.41)

A limitation? The LSA representation enforces one meaning per word This does not always hold Fly may mean an insect or relate to an airplane The resulting word representation must be a compromise between all its interpretations

The effects of context Consider the sentence – The player caught the high fly to left field This sentence has a cosine of 0.37 with ball, 0.31 with baseball, 0.27 with hit In contrast, 0.17 with insect, 0.18 with airplane, 0.13 with bird ‘fly’ by itself has cosines – 0.02, 0.01, -0.02 with ball, baseball and hit 0.69, 0.53, 0.23 with insect, airplane and bird

Discussion

Questions for the audience What’s missing from LSA? Why does it work so well without it? Attempts to integrate syntax into LSA generally resulted in worse performance Landauer argued at one point that syntax serves no purpose other than a computationally-efficient way to get text into an LSA representation. Is this reasonable?

Bibliography and further reading “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge”, Landauer & Dumais, Psychological Review, Vol. 104, No. 2, 1997 “An Introduction to Latent Semantic Analysis”, Landauer, Foltz & Laham, Discourse Processes, 25, 1998

Bibliography and further reading “Using LSA for Pronominal Anaphora Resolution”, Klebanov & Wiemer- Hastings “Adding Syntactic Information to LSA”, Wiemer-Hastings

NLP Workshop Eytan Ruppin Ben Sandbank

Similar presentations

Presentation on theme: "NLP Workshop Eytan Ruppin Ben Sandbank"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NLP Workshop Eytan Ruppin Ben Sandbank

Similar presentations

Presentation on theme: "NLP Workshop Eytan Ruppin Ben Sandbank"— Presentation transcript:

Similar presentations

About project

Feedback