# Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.

## Presentation on theme: "Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid."— Presentation transcript:

Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid

Topic Statistical Natural Language Processing Applies  Machine Learning / Statistics to Learning : the ability to improve one’s behaviour at a specific task over time - involves the analysis of data (statistics)  Natural Language Processing Following parts of the book  Statistical NLP (Manning and Schuetze), MIT Press, 1999.

Rationalism versus Empiricism Rationalist  Noam Chomsky - innate language structures  AI : hand coding NLP  Dominant view 1960-1985  Cf. e.g. Steven Pinker’s The language instinct. (popular science book) Empiricist  Ability to learn is innate  AI : language is learned from corpora  Dominant 1920-1960 and becoming increasingly important

Rationalism versus Empiricism Noam Chomsky:  But it must be recognized that the notion of “probability of a sentence” is an entirely useless one, under any known interpretation of this term Fred Jelinek (IBM 1988)  Every time a linguist leaves the room the recognition rate goes up.  (Alternative: Every time I fire a linguist the recognizer improves)

This course Empiricist approach  Focus will be on probabilistic models for learning of natural language No time to treat natural language in depth !  (though this would be quite useful and interesting)  Deserves a full course by itself Covered in more depth in Logic, Language and Learning (SS 05, prob. SS 06)

Ambiguity

Statistical Disambiguation Define a probability model for the data Compute the probability of each alternative Choose the most likely alternative NLP and Statistics

Statistical Methods deal with uncertainty. They predict the future behaviour of a system based on the behaviour observed in the past.  Statistical Methods require training data. The data in Statistical NLP are the Corpora NLP and Statistics

 Corpus: text collection for linguistic purposes  Tokens How many words are contained in Tom Sawyer?  71.370  Types How many different words are contained in T.S.?  8.018  Hapax Legomena words appearing only once Corpora

 The most frequent words are function words wordfreqwordfreq the3332in906 and2972that877 a1775he877 to1725I783 of1440his772 was1161you686 it1027Tom679 Word Counts

f n f 13993 21292 3 664 4 410 5 243 6 199 7 172 8 131 9 82 10 91 11-50 540 51-100 99 > 100 102 How many words appear f times? Word Counts About half of the words occurs just once About half of the text consists of the 100 most common words ….

Word Counts (Brown corpus)

wordfr f*rwordfr f*r the33321 3332turned5120010200 and29722 5944you‘ll30300 9000 a17753 5235name21400 8400 he87710 8770comes16500 8000 but41020 8400group13600 7800 be29430 8820lead11700 7700 there22240 8880friends10800 8000 one17250 8600begin9900 8100 about15860 9480family81000 8000 more13870 9660brushed42000 8000 never12480 9920sins23000 6000 Oh1169010440Could24000 8000 two10410010400Applausive18000 8000 Zipf‘s Law: f~1/r (f*r = const) Zipf‘s Law Minimize effort

Some probabilistic models N-grams  Predicting the next word  Artificial intelligence and machine ….  Statistical natural language …. Probabilistic  Regular (Markov Models)  Hidden Markov Models  Conditional Random Fields  Context-free grammars  (Stochastic) Definite Clause Grammars

Illustration Wall Street Journal Corpus 3 000 000 words Correct parse tree for sentences known  Constructed by hand  Can be used to derive stochastic context free grammars  SCFG assign probability to parse trees Compute the most probable parse tree

Conclusions Overview of some probabilistic and machine learning methods for NLP Also very relevant to bioinformatics !  Analogy between parsing A sentence A biological string (DNA, protein, mRNA, …)

Download ppt "Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid."

Similar presentations