SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.

SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS

The Textbook l Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schuetze l We’ll go through one chapter each week

Chapters to be Covered 1. Introduction (this week) 2. Linguistic Essentials 3. Mathematical Foundations 4. Mathematical Foundations (cont.) 5. Collocations 6. Statistical Inference 7. Word Sense Disambiguation 8. Markov Models 9. Text Categorization 10. Topics in Information Retrieval 11. Clustering 12. Lexical Acquisition

Introduction l Scientific basis for this inquiry l Rationalist vs. Empirical Approach to Language Analysis –Justification for rationalist view: poverty of the stimulus –Can overcome this if we assume humans can generalize concepts

Introduction l Competence vs. performance theory of grammar –Focus on whether or not sentences are well-formed –Syntactic vs. semantic well-formedness –Conventionality of expression breaks this notion

Introduction l Categorical perception –Recognizing phonemes, works pretty well –But not for larger phenomena like syntax –Language change example as counter-evidence to strict categorizability of language »kind of/sort of -- change parts of speech very gradually »Occupied an intermediate syntactic status during the transition –Better to adopt a probabilistic view (of cognition as well as of language)

Introduction l The ambiguity of language –Unlike programming languages, natural language is ambiguous if not understood in terms of all its parts »Sometimes truly ambiguous too –Parsing with syntax only is harder than if using the underlying meaning as well

Classifying Application Types

Word Token Distribution l Word tokens are not uniformly distributed in text –The most common tokens are about 50% of the occurrences –About 50% of the tokens occur only once –~12% of the text consists of words occurring 3 times or fewer l Thus it is hard to predict the behavior of many words in the text.

Zipf’s “Law” Rank = order of words’ frequency of occurrence The product of the frequency of words (f) and their rank (r) is approximately constant

Consequences of Zipf l There are always a few very frequent tokens that are not good discriminators. –Called “stop words” in Information Retrieval –Usually correspond to linguistic notion of “closed-class” words »English examples: to, from, on, and, the,... »Grammatical classes that don’t take on new members. l Typically –A few very common words –A middling number of medium frequency words –A large number of very infrequent words l Medium frequency words most descriptive

Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not (usually) the most descriptive.

Order by Rank vs. by Alphabetical Order

Other Zipfian “Laws” l Conservation of speaker/hearer effort -> –Number of meanings of a word is correlated with its meaning –(there would be only one word for all meanings vs. only one meaning for all words) –m inversely proportional to sqrt(f) –Important for word sense disambiguation l Content words tend to clump together –Important for computing term distribution models

Is Zipf a Red Herring? l Power laws are common in natural systems l Li 1992 shows a Zipfian distribution of words can be generated randomly –26 characters and a blank –The blank or any other character is equally likely to be generated. –Key insights: »There are 26 times more words of length n+1 than of length n »There is a constant ratio by which words of length n are more frequent than length n+1 l Nevertheless, the Zipf insight is important to keep in mind when working with text corpora. Language modeling is hard because most words are rare.

Collocations l Collocation: any turn of phrase or accepted usage where the whole is perceived to have an existence beyond the sum of its parts. –Compounds (disk drive) –Phrasal verbs (make up) –Stock phrases (bacon and eggs) l Another definition: –The frequent use of a phrase as a fixed expression accompanied by certain connotations.

Computing Collocations l Take the most frequent adjacent pairs –Doesn’t yield interesting results –Need to normalize for the word frequency within the corpus. l Another tack: retain only those with interesting syntactic categories »adj noun »noun noun l More on this later!

Next Week l Learn about linguistics! l Decide on project participation

SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.

Similar presentations

Presentation on theme: "SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.

Similar presentations

Presentation on theme: "SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS."— Presentation transcript:

Similar presentations

About project

Feedback