Chapter 23: Probabilistic Language Models April 13, 2004.

Chapter 23: Probabilistic Language Models April 13, 2004

Corpus-Based Learning Information Retrieval Information Extraction Machine Translation

23.1 Probabilistic Language Models There are several advantages –Can be trained from data –Robust (accept any sentence) –Reflect fact that not all speakers agree on which sentences are part of a language –Can be used for disambiguation

Unigram Model  P(w i ) Bigram Model  P(w i | w i-1 ) Trigram Model  P(w i | w i-2, w i-1 )

Smoothing Problem: many pairs (triples, etc.) of words never occur in the training text. N: words in corpus B: possible bigrams c: actual count of bigram Add-One Smoothing (c + 1) / (N + B)

Smoothing Linear Interpolation Smoothing P(w i | w i-2, w i-1 ) = c 3 P(w i | w i-2, w i-1 ) + c 2 P(w i | w i-1 ) + c 1 P(w i ) c 1 + c 2 + c 3 = 1

Segmentation The task is to find the word boundaries in a text with no spaces P(“with”) =.2 P(“out”) =.1 P(“with out”) =.02 (unigram model) P(“without”) =.05 Figure 23.1, Viterbi-based segmentation algorithm

Probabilistic CFG (PCFG) N-Gram models have no notion of grammar at distances greater than n Figure 23.2, PCFG example Figure 23.3, PCFG parse Problem: context-free Problem: preference for short sentences

Learning PCFG Probabilities Parsed Data: straight forward Unparsed Data: two challenges –Learning the structure of the grammar rules. A Chomsky Normal Form bias can be used (X  Y Z, X  t). Something similar to SEQUITUR can be used. –Learning the probabilities associated with each rule (inside-outside algorithm, based on dynamic programming)

23.2 Information Retrieval Components of IR System: –Document Collection –Query Posed in Query Language –Result Set –Presentation of Result Set

Boolean Keyword Model Boolean queries Each word in a document is treated as a boolean feature Drawbacks –Each word is a single bit of relevance –Boolean logic can be difficult to use correctly for the average user

General Framework r: Boolean random variable indicating relevance that has the value true D: Document Q: Query P( r | D, Q) Order results by decreasing probability

Language Modeling P(r | D) / P(  r | D) is a query independent measure of document quality. This can be estimated by references to the document, the recency of the document, etc. P(Q | D, r) =  j P(Q j | D, r) where each Q j is a words in the query. Figure 23.4.

Evaluating IR Systems Precision. Proportion of documents in result set that are actually relevant. Recall. Proportion of relevant documents in the collection that are in the result set. Average Reciprocal Rank. Time to Answer. Length of time for user to find desired answer

IR Refinements Stemming. Can help recall, can hurt precision. Case Folding. Synonyms. Use a bigram model. Spelling Corrections. Metadata.

Result Sets Relevance feedback from user. Document classification. Document clustering. –K-Means clustering 1. Pick k documents at random as category seeds 2. Assign every document to the closest category 3. Computer the mean of each cluster and uses these means as the new seeds. 4. Go to step 2 until convergence occurs.

Implementing IR Systems Lexicon. Given a word, return the location in the inverted index. Stop words are often omitted. Inverted Index. Might be a list of (document, count) pairs.

Vector Space Model Used more often in practice than the probabilistic model Documents are represented as vectors of unigram word frequencies. A query is represented as a vector consisting of 0s and 1s, e.g. [0 1 1 0 0].

Chapter 23: Probabilistic Language Models April 13, 2004.

Similar presentations

Presentation on theme: "Chapter 23: Probabilistic Language Models April 13, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 23: Probabilistic Language Models April 13, 2004.

Similar presentations

Presentation on theme: "Chapter 23: Probabilistic Language Models April 13, 2004."— Presentation transcript:

Similar presentations

About project

Feedback