# Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,

## Presentation on theme: "Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,"— Presentation transcript:

Probabilistic Language Processing Chapter 23

Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram, n-gram Count using corpus but need smoothing: – add-one –Linear interpolation Evaluate with Perplexity measure E.g. segmentwordswithoutspaces w/ Viterbi

PCFGs Rewrite rules have probabilities. Prob of a string is sum of probs of its parse trees. Context-freedom means no lexical constraints. Prefers short sentences.

Learning PCFGs Parsed corpus -- count trees. Unparsed corpus –Rule structure known -- use EM (inside-outside algorithm) –Rules unknown -- Chomsky normal form… problems.

Information Retrieval Goal: Google. Find docs relevant to user’s needs. IR system has doc. Collection, query in some language, set of results, and a presentation of results. Ideally, parse docs into knowledge base… too hard.

IR 2 Boolean Keyword Model -- in or out? Problem -- single bit of “relevance” Boolean combinations a bit mysterious How compute P(R=true | D,Q)? Estimate language model for each doc, computes prob of query given the model. Can rank documents by P(r|D,Q)/P(~r|D,Q)

IR3 For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes. Good example pp 842-843.

Evaluating IR Precision is proportion of results that are relevant. Recall is proportion of relevant docs that are in results ROC curve (there are several varieties): standard is to plot false negatives vs. false positives. More “practical” for web: reciprocal rank of first relevant result, or just “time to answer”

IR Refinements Case Stems Synonyms Spelling correction Metadata --keywords

IR Presentation Give list in order of relevance, deal with duplicates Cluster results into classes –Agglomerative –K-means How describe automatically-generated clusters? Word list? Title of centroid doc?

IR Implementation CSC172! Lexicon with “stop list”, “inverted” index: where words occur Match with vectors: vectorof freq of words dotted with query terms.

Information Extraction Goal: create database entries from docs. Emphasis on massive data, speed, stylized expressions Regular expression grammars OK if stylized enough Cascaded Finite State Transducers,,,stages of grouping and structure-finding

Machine Translation Goals Rough Translation (E.g. p. 851) Restricted Doman (mergers, weather) Pre-edited (Caterpillar or Xerox English) Literary Translation -- not yet! Interlingua-- or canonical semantic representation like Conceptual Dependency Basic Problem != languages, != categories

MT in Practice Transfer -- uses data base of rules for translating small units of language Memory -based. Memorize sentence pairs Good diagram p. 853

Statistical MT Bilingual corpus Find most likely translation given corpus. Argmax_F P(F|E) = argmax_F P(E|F)P(F) P(F) is language model P(E|F) is translation model Lots of interesting problems: fertility (home vs. a la maison). Horrible drastic simplfications and hacks work pretty well!

Learning and MT Stat. MT needs: language model, fertility model, word choice model, offset model. Millions of parameters Counting, estimate, EM.

Download ppt "Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,"

Similar presentations