Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
Search Engines and Information Retrieval
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Scalable Text Mining with Sparse Generative Models
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Natural Language Processing Expectation Maximization.
1 Probabilistic Language-Model Based Document Retrieval.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Statistical Alignment and Machine Translation
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Search Engines and Information Retrieval Chapter 1.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Information Retrieval Quality of a Search Engine.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
23.3 Information Extraction More complicated than an IR (Information Retrieval) system. Requires a limited notion of syntax and semantics.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Text Based Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Language Models for Information Retrieval
Basic Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Data Mining Chapter 6 Search Engines
CS4705 Natural Language Processing
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
Language Model Approach to IR
Information Retrieval and Web Design
Precision and Recall Reminder:
Presentation transcript:

Probabilistic Language Processing Chapter 23

Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram, n-gram Count using corpus but need smoothing: – add-one –Linear interpolation Evaluate with Perplexity measure E.g. segmentwordswithoutspaces w/ Viterbi

PCFGs Rewrite rules have probabilities. Prob of a string is sum of probs of its parse trees. Context-freedom means no lexical constraints. Prefers short sentences.

Learning PCFGs Parsed corpus -- count trees. Unparsed corpus –Rule structure known -- use EM (inside-outside algorithm) –Rules unknown -- Chomsky normal form… problems.

Information Retrieval Goal: Google. Find docs relevant to user’s needs. IR system has doc. Collection, query in some language, set of results, and a presentation of results. Ideally, parse docs into knowledge base… too hard.

IR 2 Boolean Keyword Model -- in or out? Problem -- single bit of “relevance” Boolean combinations a bit mysterious How compute P(R=true | D,Q)? Estimate language model for each doc, computes prob of query given the model. Can rank documents by P(r|D,Q)/P(~r|D,Q)

IR3 For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes. Good example pp

Evaluating IR Precision is proportion of results that are relevant. Recall is proportion of relevant docs that are in results ROC curve (there are several varieties): standard is to plot false negatives vs. false positives. More “practical” for web: reciprocal rank of first relevant result, or just “time to answer”

IR Refinements Case Stems Synonyms Spelling correction Metadata --keywords

IR Presentation Give list in order of relevance, deal with duplicates Cluster results into classes –Agglomerative –K-means How describe automatically-generated clusters? Word list? Title of centroid doc?

IR Implementation CSC172! Lexicon with “stop list”, “inverted” index: where words occur Match with vectors: vectorof freq of words dotted with query terms.

Information Extraction Goal: create database entries from docs. Emphasis on massive data, speed, stylized expressions Regular expression grammars OK if stylized enough Cascaded Finite State Transducers,,,stages of grouping and structure-finding

Machine Translation Goals Rough Translation (E.g. p. 851) Restricted Doman (mergers, weather) Pre-edited (Caterpillar or Xerox English) Literary Translation -- not yet! Interlingua-- or canonical semantic representation like Conceptual Dependency Basic Problem != languages, != categories

MT in Practice Transfer -- uses data base of rules for translating small units of language Memory -based. Memorize sentence pairs Good diagram p. 853

Statistical MT Bilingual corpus Find most likely translation given corpus. Argmax_F P(F|E) = argmax_F P(E|F)P(F) P(F) is language model P(E|F) is translation model Lots of interesting problems: fertility (home vs. a la maison). Horrible drastic simplfications and hacks work pretty well!

Learning and MT Stat. MT needs: language model, fertility model, word choice model, offset model. Millions of parameters Counting, estimate, EM.