SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.

Slides:



Advertisements
Similar presentations
Cognitive Linguistics Croft & Cruse 9
Advertisements

Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Statistical NLP: Lecture 3
Statistical Methods and Linguistics - Steven Abney Thur. POSTECH Computer Science NLP Lab Shim Jun-Hyuk.
Introduction to Statistics and Research
Extracting an Inventory of English Verb Constructions from Language Corpora Matthew Brook O’Donnell Nick C. Ellis Presentation.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Cognitive Processes PSY 334 Chapter 11 – Language Structure.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California,
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Some definitions Morphemes = smallest unit of meaning in a language Phrase = set of one or more words that go together (from grammar) (e.g., subject clause,
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
9. Microstructure of Bilingual Dictionaries. The microstructure of the dictionary specifies the way the lemma articles are composed. The lemma article.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 12.
Natural Language Processing (NLP) I. Introduction II. Issues in NLP III. Statistical NLP: Corpus-based Approach.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
LOGIC AND ONTOLOGY Both logic and ontology are important areas of philosophy covering large, diverse, and active research projects. These two areas overlap.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Linguistic Essentials
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Cognitive Processes PSY 334 Chapter 11 – Language Structure June 2, 2003.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Introduction Chapter 1 Foundations of statistical natural language processing.
Communicative and Academic English for the EFL Professional.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Levels of Linguistic Analysis
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Statistical Properties of Text
Passive Generalizations Li, Charles N. & Thompson, Sandra A. (1981). Mandarin Chinese - A Functional Reference Grammar. Los Angeles: University of California.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Chapter 11 Language. Some Questions to Consider How do we understand individual words, and how are words combined to create sentences? How can we understand.
Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Plan for Today’s Lecture(s)
Statistical NLP: Lecture 7
Statistical NLP: Lecture 3
Text Based Information Retrieval
Representation of documents and queries
Levels of Linguistic Analysis
Content Analysis of Text
Information Retrieval
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS

The Textbook l Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schuetze l We’ll go through one chapter each week

Chapters to be Covered 1. Introduction (this week) 2. Linguistic Essentials 3. Mathematical Foundations 4. Mathematical Foundations (cont.) 5. Collocations 6. Statistical Inference 7. Word Sense Disambiguation 8. Markov Models 9. Text Categorization 10. Topics in Information Retrieval 11. Clustering 12. Lexical Acquisition

Introduction l Scientific basis for this inquiry l Rationalist vs. Empirical Approach to Language Analysis –Justification for rationalist view: poverty of the stimulus –Can overcome this if we assume humans can generalize concepts

Introduction l Competence vs. performance theory of grammar –Focus on whether or not sentences are well-formed –Syntactic vs. semantic well-formedness –Conventionality of expression breaks this notion

Introduction l Categorical perception –Recognizing phonemes, works pretty well –But not for larger phenomena like syntax –Language change example as counter-evidence to strict categorizability of language »kind of/sort of -- change parts of speech very gradually »Occupied an intermediate syntactic status during the transition –Better to adopt a probabilistic view (of cognition as well as of language)

Introduction l The ambiguity of language –Unlike programming languages, natural language is ambiguous if not understood in terms of all its parts »Sometimes truly ambiguous too –Parsing with syntax only is harder than if using the underlying meaning as well

Classifying Application Types

Word Token Distribution l Word tokens are not uniformly distributed in text –The most common tokens are about 50% of the occurrences –About 50% of the tokens occur only once –~12% of the text consists of words occurring 3 times or fewer l Thus it is hard to predict the behavior of many words in the text.

Zipf’s “Law” Rank = order of words’ frequency of occurrence The product of the frequency of words (f) and their rank (r) is approximately constant

Consequences of Zipf l There are always a few very frequent tokens that are not good discriminators. –Called “stop words” in Information Retrieval –Usually correspond to linguistic notion of “closed-class” words »English examples: to, from, on, and, the,... »Grammatical classes that don’t take on new members. l Typically –A few very common words –A middling number of medium frequency words –A large number of very infrequent words l Medium frequency words most descriptive

Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not (usually) the most descriptive.

Order by Rank vs. by Alphabetical Order

Other Zipfian “Laws” l Conservation of speaker/hearer effort -> –Number of meanings of a word is correlated with its meaning –(there would be only one word for all meanings vs. only one meaning for all words) –m inversely proportional to sqrt(f) –Important for word sense disambiguation l Content words tend to clump together –Important for computing term distribution models

Is Zipf a Red Herring? l Power laws are common in natural systems l Li 1992 shows a Zipfian distribution of words can be generated randomly –26 characters and a blank –The blank or any other character is equally likely to be generated. –Key insights: »There are 26 times more words of length n+1 than of length n »There is a constant ratio by which words of length n are more frequent than length n+1 l Nevertheless, the Zipf insight is important to keep in mind when working with text corpora. Language modeling is hard because most words are rare.

Collocations l Collocation: any turn of phrase or accepted usage where the whole is perceived to have an existence beyond the sum of its parts. –Compounds (disk drive) –Phrasal verbs (make up) –Stock phrases (bacon and eggs) l Another definition: –The frequent use of a phrase as a fixed expression accompanied by certain connotations.

Computing Collocations l Take the most frequent adjacent pairs –Doesn’t yield interesting results –Need to normalize for the word frequency within the corpus. l Another tack: retain only those with interesting syntactic categories »adj noun »noun noun l More on this later!

Next Week l Learn about linguistics! l Decide on project participation