Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate.

Slides:

Advertisements

Similar presentations

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Chapter 5: Introduction to Information Retrieval

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Language Models Hongning Wang

Probabilistic Ranking Principle

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Information Retrieval Models: Probabilistic Models

Mixture Language Models and EM Algorithm

Evaluating Search Engine

Chapter 7 Retrieval Models.

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Lecture 5: Learning models using EM

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

Evaluating the Performance of IR Sytems

Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.

Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.

Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Language Modeling Approaches for Information Retrieval Rong Jin.

Chapter 5: Information Retrieval and Web Search

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Information Retrieval: Problem Formulation & Evaluation ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 1 ChengXiang Zhai (

Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.

Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Chapter 23: Probabilistic Language Models April 13, 2004.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Information Retrieval Models: Vector Space Models

Relevance Feedback Hongning Wang

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR

Natural Language Processing Topics in Information Retrieval August, 2002.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

A Study of Poisson Query Generation Model for Information Retrieval

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Statistical Language Models

Hidden Markov Models (HMMs)

Relevance Feedback Hongning Wang

Language Models for Information Retrieval

John Lafferty, Chengxiang Zhai School of Computer Science

Chapter 5: Information Retrieval and Web Search

Language Model Approach to IR

CS 4501: Information Retrieval

CS590I: Information Retrieval

INF 141: Information Retrieval

Language Models for TR Rong Jin

Presentation transcript:

Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate School of Library & Information Science University of Illinois, Urbana-Champaign

Motivation Biomedical literature serves as a “complete” documentation of the biomedical knowledge discovered by scientists Medline: > 10,000,000 literature abstracts (1966-) Effective access to biomedical literature is essential for –Understanding related existing discoveries –Formulating new hypotheses –Verifying hypotheses –… Biologists routinely use PubMed to access literature (

Challenges in Biomedical Literature Retrieval Tokenization –Many names are irregular with special characters such as “/”, “-”, etc. E.g., MIP-1-alpha, (MIP)-1alpha –Ambiguous words: “was” and “as” can be genes Semi-structured queries –It is often desirable to expand a query about a gene with synonyms of the gene; the expanded query would have several fields (original name + symbols) –“Find the role of gene A in disease B” (3 fields) …

TREC Genomics Track TREC (Text REtrieval Conference): –Started 1992; sponsored by NIST –Large-scale evaluation of information retrieval (IR) techniques Genomics Track –Started in 2003 –Still continuing –Evaluation of IR for biomedical literature search

Typical TREC Cycle Feb: Application for participation Spring: Preliminary (training) data available Beginning of Summer: Official test data available End of Summer: Result submission Early Fall: Official evaluation; results are out in Oct Nov: TREC Workshop; plan for next year

UIUC Participation 2003: Obtained initial experience; recognized the problem of “semi-structured queries” 2005: Continued developing semi-structured language models 2006: Applied hidden Markov models to passage retrieval

Outline Standard IR Techniques Semi-structured Query Language Models Parameter Estimation Experiment Results Conclusions and Future Work

What is Text Retrieval (TR)? There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users More commonly known as “Information Retrieval” (IR) Known as “search technology” in industry

TR is Hard! Under/over-specified query –Ambiguous: “buying CDs” (money or music?) –Incomplete: what kind of CDs? –What if “CD” is never mentioned in document? Vague semantics of documents –Ambiguity: e.g., word-sense, structural –Incomplete: Inferences required Even hard for people! –80% agreement in human judgments

TR is “Easy”! TR CAN be easy in a particular case –Ambiguity in query/document is RELATIVE to the database –So, if the query is SPECIFIC enough, just one keyword may get all the relevant documents PERCEIVED TR performance is usually better than the actual performance –Users can NOT judge the completeness of an answer

Formal Formulation of TR Vocabulary V={w 1, w 2, …, w N } of language Query q = q 1,…,q m, where q i  V Document d i = d i1,…,d im i, where d ij  V Collection C= {d 1, …, d k } Set of relevant documents R(q)  C –Generally unknown and user-dependent –Query is a “hint” on which doc is in R(q) Task = compute R’(q), an “approximate R(q)”

Computing R(q) Strategy 1: Document selection –R(q)={d  C|f(d,q)=1}, where f(d,q)  {0,1} is an indicator function or classifier –System must decide if a doc is relevant or not (“absolute relevance”) Strategy 2: Document ranking –R(q) = {d  C|f(d,q)>  }, where f(d,q)  is a relevance measure function;  is a cutoff –System must decide if one doc is more likely to be relevant than another (“relative relevance”)

Document Selection vs. Ranking Doc Selection f(d,q)=? Doc Ranking f(d,q)=? d d d d d d d d d 9 - R’(q) True R(q)

Problems of Doc Selection The classifier is unlikely accurate –“Over-constrained” query (terms are too specific): no relevant documents found –“Under-constrained” query (terms are too general): over delivery –It is extremely hard to find the right position between these two extremes Even if it is accurate, all relevant documents are not equally relevant Relevance is a matter of degree!

Ranking is often preferred Relevance is a matter of degree A user can stop browsing anywhere, so the boundary is controlled by the user –High recall users would view more items –High precision users would view only a few Theoretical justification: Probability Ranking Principle [Robertson 77]

Evaluation Criteria Effectiveness/Accuracy –Precision, Recall Efficiency –Space and time complexity Usability –How useful for real user tasks?

Methodology: Cranfield Tradition Laboratory testing of system components –Precision, Recall –Comparative testing Test collections –Set of documents –Set of questions –Relevance judgments

The Contingency Table Relevant Retrieved Irrelevant RetrievedIrrelevant Rejected Relevant Rejected Relevant Not relevant RetrievedNot Retrieved Doc Action

How to measure a ranking? Compute the precision at every recall point Plot a precision-recall (PR) curve precision recall x x x x precision recall x x x x Which is better?

Summarize a Ranking Given that n docs are retrieved –Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs –E.g., if the first rel. doc is at the 2 nd rank, then p(1)=1/2. –If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero Compute the average over all the relevant documents –Average precision = (p(1)+…p(k))/k This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document Mean Average Precisions (MAP) – MAP = arithmetic mean average precision over a set of topics –gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)

Precion-Recall Curve Mean Avg. Precision (MAP) Recall=3212/4728 Breakeven Point (prec=recall) Out of 4728 rel docs, we’ve got 3212 D1 + D2 + D3 – D4 – D5 + D6 - Total # rel docs = 4 System returns 6 docs Average Prec = (1/1+2/2+3/5+0)/4 about 5.5 docs in the top 10 docs are relevant

Typical TR System Architecture User querydocs results Query Rep Doc Rep (Index) Scorer Indexer Tokenizer Index judgments Feedback

Tokenization Normalize lexical units: Words with similar meanings should be mapped to the same indexing term Stemming: Mapping all inflectional forms of words to the same root form, e.g. –computer -> compute –computation -> compute –computing -> compute (but king->k?) Porter’s Stemmer is popular for English

Relevance Feedback Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query Retrieval Engine Results: d d … d k User Document collection

Pseudo/Blind/Automatic Feedback Query Retrieval Engine Results: d d … d k Judgments: d 1 + d 2 + d 3 + … d k -... Document collection Feedback Updated query top 10

Traditional approach = Vector space model

Vector Space Model Represent a doc/query by a term vector –Term: basic concept, e.g., word or phrase –Each term defines one dimension –N terms define a high-dimensional space –Element of vector corresponds to term weight –E.g., d=(x 1,…,x N ), x i is “importance” of term i Measure relevance by the distance between the query vector and document vector in the vector space

VS Model: illustration Java Microsoft Starbucks D6D6 D 10 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D2D2 ? D1D1 ? ?? ? D3D3 Query

What’s a good “basic concept”? Orthogonal –Linearly independent basis vectors –“Non-overlapping” in meaning No ambiguity Weights can be assigned automatically and hopefully accurately Many possibilities: Words, stemmed words, phrases, “latent concept”, …

How to Assign Weights? Very important! Why weighting –Query side: Not all terms are equally important –Doc side: Some terms carry more contents How? –Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) –TF normalization

Language Modeling Approaches are becoming more and more popular…

What is a Statistical LM? A probability distribution over word sequences –p(“ Today is Wednesday ”)  –p(“ Today Wednesday is ”)  –p(“ The eigenvalue is positive” )  Context-dependent! Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

The Simplest Language Model (Unigram Model) Generate a piece of text by generating each word INDEPENDENTLY Thus, p(w 1 w 2... w n )=p(w 1 )p(w 2 )…p(w n ) Parameters: {p(w i )} p(w 1 )+…+p(w N )=1 (N is voc. size) Essentially a multinomial distribution over words A piece of text can be regarded as a sample drawn according to this word distribution

Text Generation with Unigram LM (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining paper Food nutrition paper Sampling

Estimation of Unigram LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining paper” (total #words=100) 10/100 5/100 3/100 1/100

Language Models for Retrieval (Ponte & Croft 98) Document Text mining paper Food nutrition paper Language Model … text ? mining ? assocation ? clustering ? … food ? … food ? nutrition ? healthy ? diet ? … Query = “data mining algorithms” ? Which model would most likely have generated this query?

Ranking Docs by Query Likelihood d1d1 d2d2 dNdN q d1d1 d2d2 dNdN Doc LM p(q|  d 1 ) p(q|  d 2 ) p(q|  d N ) Query likelihood

Kullback-Leibler (KL) Divergence Retrieval Model Unigram similarity model Retrieval  Estimation of  Q and  D Special case: = empirical distribution of q query entropy (ignored for ranking)

Estimating p(w|d) (i.e.,  D ) Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) Dirichlet prior (Bayesian): Assume pseudo counts  p(w|C) Absolute discounting: Subtract a constant 

Estimating  Q (Feedback) Query Q Document D Results Feedback Docs F={d 1, d 2, …, d n } Generative model  =0 No feedback  =1 Full feedback

Generative Mixture Model w w F={d 1, …, d n } Maximum Likelihood P(w|  ) P(w| C) 1- P(source) Background words Topic words = Noise in feedback documents

How to Estimate  F ? the 0.2 a 0.1 we 0.01 to 0.02 … text mining … Known Background p(w|C) … text =? mining =? association =? word =? … Unknown query topic p(w|  F )=? “Text mining” =0.7 =0.3 Observed Doc(s) Suppose, we know the identity of each word... ML Estimator

Can We Guess the Identity? Identity (“hidden”) variable: z i  {1 (background), 0(topic)} the paper presents a text mining algorithm the paper... z i Suppose the parameters are all known, what’s a reasonable guess of z i ? - depends on (why?) - depends on p(w|C) and p(w|  F ) (how?) E-step M-step Initially, set p(w|  F ) to some random value, then iterate …

Example of Feedback Query Model Trec topic 412: “airport security” =0.9 =0.7 Mixture model approach Web database Top 10 docs

Problem with Standard IR Methods: Semi-Structured Queries TREC-2003 Genomics Track, Topic 1: Problems with unstructured representation –Intuitively, matching “ATF2” should be counted more than matching “transcription” –Such a query is not a natural sample of a unigram language model, violating the assumption of the language modeling retrieval approach Find articles about the following gene : OFFICIAL_GENE_NAME activating transcription factor 2 OFFICIAL_SYMBOL ATF2 ALIAS_SYMBOL HB16 ALIAS_SYMBOL CREB2 ALIAS_SYMBOL TREB7 ALIAS_SYMBOL CRE-BP1 Bag-of-word Representation: activating transcription factor 2, ATF2, HB16, CREB2, TREB7, CRE-BP1

Problem with Standard IR Methods: Semi-Structured Queries (cont.) A topic in TREC-2005 Genomics Track 3 different fields Should be weighted differently? What about expansion? Find information about the role of the gene interferona-beta in the disease multiple sclerosis

Semi-Structured Language Models Semi-structured query Semi-structured query model Semi-structured LM estimation: Fit a mixture model to pseudo feedback documents using Expectation-Maximization (EM)

Parameter Estimation Synonym queries: –Each field is estimated using maximum likelihood: –Each field has equal weights: i =1/k Aspect queries: –Use top-ranked documents to estimate all the parameters –Similar to single-aspect model, but use query as prior and Bayesian estimation

Maximum Likelihood vs. Bayesian Maximum likelihood estimation –“Best” means “data likelihood reaches maximum” –Problem: small sample Bayesian estimation –“Best” means being consistent with our “prior” knowledge and explaining data well –Problem: how to define prior?

Illustration of Bayesian Estimation Prior: p(  ) Likelihood: p(X|  ) X=(x 1,…,x N ) Posterior: p(  |X)  p(X|  )p(  )    : prior mode  ml : ML estimate  : posterior mode

Experiment Results TREC 2003 (Uniform weights)TREC 2005 (Estimated weights) Query ModelUnstructSemi-structImp.UnstructSemi-structImp. MAP % %

More Experiment Results (with slightly different model)

Conclusions Standard IR techniques are effective for biomedical literature retrieval Modeling and exploiting the structure in a query can improve accuracy Overall TREC Genomics Track findings –Domain-specific resources are very useful –Sound retrieval models and machine learning techniques are helpful

Future Work Using HMMs to model relevant documents Incorporate biomedical resources into principled statistical models