Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan

Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book

Chapter 1 Introduction January 8, 2007

Linguistic vs Statistic Rationale for a statistical approach Linguistic approaches that attempt to parse language based on grammar have failed Edward Saphir famous quote: “All grammars leak” Statistical approaches have been shown to be practical to look at “What are the common patterns that occur in language use”

Rationalist vs Empiricist Sort of the difference between “nature” and “nurture” Rationalist: Innate intelligence of humans is inherited and hence computational system must be loaded with pre-knowledge to be effective Empiricist: Lot can be learned through examining actual use of language – and hence statistical approaches that learn from “corpus” are germane. Corpus – a body of text Corpora – a collection of texts

Scientific content: Questions that linguistics should answer What kinds of things do people say? What do these things say/ask/request about the world? Traditional linguistic approach –Competence grammar and grammaticality determination –But this is hard… trying to determine whether sentences are grammatical or not. –Some examples in page 10 – next page

Some examples of sentences

Non-categorical phenomena in language Language usage changes with time Some words defy categorization into rigid linguistic boundaries Example of “near” which can be an adjective, adverb or both simultaneously Example of change: kind of and sort of Language usage change can be better tracked using statistical NLP approaches

Language and cognition as probabilistic phenomena One view of the world – the Chomsky line of thinking is that probability and statistics are inappropriate for determining “grammaticality” and understanding the “meaning” of sentences. The viewpoint with statistical NLP is that “grammar” is not necessarily relevant to understand and develop practical solutions

Some parses of the sentence: “Our company is training workers”

The ambiguity of language: why NLP is hard Linguists like to parse sentences to determine things like: who did what to whom Parsing sentences is hard 455 parses to the sentence: –“List the sales of the products produced in 1973 with the products produced in 1972” AI approaches to understanding meaning have failed and have been shown to be brittle and non-scalable

Dirty Hands Variety of corpus available for statistical NLP research Tom Sawyer example

Common Words in Tom Sawyer

Word Counts Some statistics from Tom Sawyer # of word tokens: 71,370 # of word types (unique words): 8,018 Average frequency: 71,370/8,018 = 8.9 Some words are very common! 12 words appear more than 700 times each 100 words account for more than 50.9% of the text 49.8% of “word types” appear only once in the corpus! “hapax legomena” Greek for “read only once” How can statistics help us understand the meaning of sentences if half the words only appear once?

Frequency of frequencies 8018 Total # of Word Types

Zipf Law Empirical evaluation of Zipf law for Tom Sawyer

Basic Insight from Power Laws What makes frequency-based approaches to language hard is almost all words are rare. Zipf’s law is a good way to encapsulate this insight.

Collocations Collocations in New York Times corpus with and without filtering

Concordances Key Word In Context (KWIC)

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Similar presentations

Presentation on theme: "Natural Language Processing Spring 2007 V. “Juggy” Jagannathan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Similar presentations

Presentation on theme: "Natural Language Processing Spring 2007 V. “Juggy” Jagannathan."— Presentation transcript:

Similar presentations

About project

Feedback