Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate.

Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate School of Library & Information Science University of Illinois, Urbana-Champaign

Motivation Biomedical literature serves as a “complete” documentation of the biomedical knowledge discovered by scientists Medline: > 10,000,000 literature abstracts (1966-) Effective access to biomedical literature is essential for –Understanding related existing discoveries –Formulating new hypotheses –Verifying hypotheses –… Biologists routinely use PubMed to access literature (http://www.ncbi.nlm.nih.gov/PubMed)http://www.ncbi.nlm.nih.gov/PubMed

Challenges in Biomedical Literature Retrieval Tokenization –Many names are irregular with special characters such as “/”, “-”, etc. E.g., MIP-1-alpha, (MIP)-1alpha –Ambiguous words: “was” and “as” can be genes Semi-structured queries –It is often desirable to expand a query about a gene with synonyms of the gene; the expanded query would have several fields (original name + symbols) –“Find the role of gene A in disease B” (3 fields) …

TREC Genomics Track TREC (Text REtrieval Conference): –Started 1992; sponsored by NIST –Large-scale evaluation of information retrieval (IR) techniques Genomics Track –Started in 2003 –Still continuing –Evaluation of IR for biomedical literature search

Typical TREC Cycle Feb: Application for participation Spring: Preliminary (training) data available Beginning of Summer: Official test data available End of Summer: Result submission Early Fall: Official evaluation; results are out in Oct Nov: TREC Workshop; plan for next year

UIUC Participation 2003: Obtained initial experience; recognized the problem of “semi-structured queries” 2005: Continued developing semi-structured language models 2006: Applied hidden Markov models to passage retrieval

Outline Standard IR Techniques Semi-structured Query Language Models Parameter Estimation Experiment Results Conclusions and Future Work

What is Text Retrieval (TR)? There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users More commonly known as “Information Retrieval” (IR) Known as “search technology” in industry

TR is Hard! Under/over-specified query –Ambiguous: “buying CDs” (money or music?) –Incomplete: what kind of CDs? –What if “CD” is never mentioned in document? Vague semantics of documents –Ambiguity: e.g., word-sense, structural –Incomplete: Inferences required Even hard for people! –80% agreement in human judgments

TR is “Easy”! TR CAN be easy in a particular case –Ambiguity in query/document is RELATIVE to the database –So, if the query is SPECIFIC enough, just one keyword may get all the relevant documents PERCEIVED TR performance is usually better than the actual performance –Users can NOT judge the completeness of an answer

Formal Formulation of TR Vocabulary V={w 1, w 2, …, w N } of language Query q = q 1,…,q m, where q i  V Document d i = d i1,…,d im i, where d ij  V Collection C= {d 1, …, d k } Set of relevant documents R(q)  C –Generally unknown and user-dependent –Query is a “hint” on which doc is in R(q) Task = compute R’(q), an “approximate R(q)”

Computing R(q) Strategy 1: Document selection –R(q)={d  C|f(d,q)=1}, where f(d,q)  {0,1} is an indicator function or classifier –System must decide if a doc is relevant or not (“absolute relevance”) Strategy 2: Document ranking –R(q) = {d  C|f(d,q)>  }, where f(d,q)  is a relevance measure function;  is a cutoff –System must decide if one doc is more likely to be relevant than another (“relative relevance”)

Document Selection vs. Ranking + + + + - - - - - - - - - - - - - - + - - Doc Selection f(d,q)=? + + + + - - + - + - - - - - - - - Doc Ranking f(d,q)=? 1 0 0.98 d 1 + 0.95 d 2 + 0.83 d 3 - 0.80 d 4 + 0.76 d 5 - 0.56 d 6 - 0.34 d 7 - 0.21 d 8 + 0.21 d 9 - R’(q) True R(q)

Problems of Doc Selection The classifier is unlikely accurate –“Over-constrained” query (terms are too specific): no relevant documents found –“Under-constrained” query (terms are too general): over delivery –It is extremely hard to find the right position between these two extremes Even if it is accurate, all relevant documents are not equally relevant Relevance is a matter of degree!

Ranking is often preferred Relevance is a matter of degree A user can stop browsing anywhere, so the boundary is controlled by the user –High recall users would view more items –High precision users would view only a few Theoretical justification: Probability Ranking Principle [Robertson 77]

Evaluation Criteria Effectiveness/Accuracy –Precision, Recall Efficiency –Space and time complexity Usability –How useful for real user tasks?

Methodology: Cranfield Tradition Laboratory testing of system components –Precision, Recall –Comparative testing Test collections –Set of documents –Set of questions –Relevance judgments

The Contingency Table Relevant Retrieved Irrelevant RetrievedIrrelevant Rejected Relevant Rejected Relevant Not relevant RetrievedNot Retrieved Doc Action

How to measure a ranking? Compute the precision at every recall point Plot a precision-recall (PR) curve precision recall x x x x precision recall x x x x Which is better?

Summarize a Ranking Given that n docs are retrieved –Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs –E.g., if the first rel. doc is at the 2 nd rank, then p(1)=1/2. –If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero Compute the average over all the relevant documents –Average precision = (p(1)+…p(k))/k This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document Mean Average Precisions (MAP) – MAP = arithmetic mean average precision over a set of topics –gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)

Precion-Recall Curve Mean Avg. Precision (MAP) Recall=3212/4728 Breakeven Point (prec=recall) Out of 4728 rel docs, we’ve got 3212 D1 + D2 + D3 – D4 – D5 + D6 - Total # rel docs = 4 System returns 6 docs Average Prec = (1/1+2/2+3/5+0)/4 about 5.5 docs in the top 10 docs are relevant Precision@10docs

Typical TR System Architecture User querydocs results Query Rep Doc Rep (Index) Scorer Indexer Tokenizer Index judgments Feedback

Tokenization Normalize lexical units: Words with similar meanings should be mapped to the same indexing term Stemming: Mapping all inflectional forms of words to the same root form, e.g. –computer -> compute –computation -> compute –computing -> compute (but king->k?) Porter’s Stemmer is popular for English

Relevance Feedback Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query Retrieval Engine Results: d 1 3.5 d 2 2.4 … d k 0.5... User Document collection

Pseudo/Blind/Automatic Feedback Query Retrieval Engine Results: d 1 3.5 d 2 2.4 … d k 0.5... Judgments: d 1 + d 2 + d 3 + … d k -... Document collection Feedback Updated query top 10

Traditional approach = Vector space model

Vector Space Model Represent a doc/query by a term vector –Term: basic concept, e.g., word or phrase –Each term defines one dimension –N terms define a high-dimensional space –Element of vector corresponds to term weight –E.g., d=(x 1,…,x N ), x i is “importance” of term i Measure relevance by the distance between the query vector and document vector in the vector space

VS Model: illustration Java Microsoft Starbucks D6D6 D 10 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D2D2 ? D1D1 ? ?? ? D3D3 Query

What’s a good “basic concept”? Orthogonal –Linearly independent basis vectors –“Non-overlapping” in meaning No ambiguity Weights can be assigned automatically and hopefully accurately Many possibilities: Words, stemmed words, phrases, “latent concept”, …

How to Assign Weights? Very important! Why weighting –Query side: Not all terms are equally important –Doc side: Some terms carry more contents How? –Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) –TF normalization

Language Modeling Approaches are becoming more and more popular…

What is a Statistical LM? A probability distribution over word sequences –p(“ Today is Wednesday ”)  0.001 –p(“ Today Wednesday is ”)  0.0000000000001 –p(“ The eigenvalue is positive” )  0.00001 Context-dependent! Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

The Simplest Language Model (Unigram Model) Generate a piece of text by generating each word INDEPENDENTLY Thus, p(w 1 w 2... w n )=p(w 1 )p(w 2 )…p(w n ) Parameters: {p(w i )} p(w 1 )+…+p(w N )=1 (N is voc. size) Essentially a multinomial distribution over words A piece of text can be regarded as a sample drawn according to this word distribution

Text Generation with Unigram LM (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining paper Food nutrition paper Sampling

Estimation of Unigram LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining paper” (total #words=100) 10/100 5/100 3/100 1/100

Language Models for Retrieval (Ponte & Croft 98) Document Text mining paper Food nutrition paper Language Model … text ? mining ? assocation ? clustering ? … food ? … food ? nutrition ? healthy ? diet ? … Query = “data mining algorithms” ? Which model would most likely have generated this query?

Ranking Docs by Query Likelihood d1d1 d2d2 dNdN q d1d1 d2d2 dNdN Doc LM p(q|  d 1 ) p(q|  d 2 ) p(q|  d N ) Query likelihood

Kullback-Leibler (KL) Divergence Retrieval Model Unigram similarity model Retrieval  Estimation of  Q and  D Special case: = empirical distribution of q query entropy (ignored for ranking)

Estimating p(w|d) (i.e.,  D ) Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) Dirichlet prior (Bayesian): Assume pseudo counts  p(w|C) Absolute discounting: Subtract a constant 

Estimating  Q (Feedback) Query Q Document D Results Feedback Docs F={d 1, d 2, …, d n } Generative model  =0 No feedback  =1 Full feedback

Generative Mixture Model w w F={d 1, …, d n } Maximum Likelihood P(w|  ) P(w| C) 1- P(source) Background words Topic words = Noise in feedback documents

How to Estimate  F ? the 0.2 a 0.1 we 0.01 to 0.02 … text 0.0001 mining 0.00005 … Known Background p(w|C) … text =? mining =? association =? word =? … Unknown query topic p(w|  F )=? “Text mining” =0.7 =0.3 Observed Doc(s) Suppose, we know the identity of each word... ML Estimator

Can We Guess the Identity? Identity (“hidden”) variable: z i  {1 (background), 0(topic)} the paper presents a text mining algorithm the paper... z i 1 0 1 0... Suppose the parameters are all known, what’s a reasonable guess of z i ? - depends on (why?) - depends on p(w|C) and p(w|  F ) (how?) E-step M-step Initially, set p(w|  F ) to some random value, then iterate …

Example of Feedback Query Model Trec topic 412: “airport security” =0.9 =0.7 Mixture model approach Web database Top 10 docs

Problem with Standard IR Methods: Semi-Structured Queries TREC-2003 Genomics Track, Topic 1: Problems with unstructured representation –Intuitively, matching “ATF2” should be counted more than matching “transcription” –Such a query is not a natural sample of a unigram language model, violating the assumption of the language modeling retrieval approach Find articles about the following gene : OFFICIAL_GENE_NAME activating transcription factor 2 OFFICIAL_SYMBOL ATF2 ALIAS_SYMBOL HB16 ALIAS_SYMBOL CREB2 ALIAS_SYMBOL TREB7 ALIAS_SYMBOL CRE-BP1 Bag-of-word Representation: activating transcription factor 2, ATF2, HB16, CREB2, TREB7, CRE-BP1

Problem with Standard IR Methods: Semi-Structured Queries (cont.) A topic in TREC-2005 Genomics Track 3 different fields Should be weighted differently? What about expansion? Find information about the role of the gene interferona-beta in the disease multiple sclerosis

Semi-Structured Language Models Semi-structured query Semi-structured query model Semi-structured LM estimation: Fit a mixture model to pseudo feedback documents using Expectation-Maximization (EM)

Parameter Estimation Synonym queries: –Each field is estimated using maximum likelihood: –Each field has equal weights: i =1/k Aspect queries: –Use top-ranked documents to estimate all the parameters –Similar to single-aspect model, but use query as prior and Bayesian estimation

Maximum Likelihood vs. Bayesian Maximum likelihood estimation –“Best” means “data likelihood reaches maximum” –Problem: small sample Bayesian estimation –“Best” means being consistent with our “prior” knowledge and explaining data well –Problem: how to define prior?

Illustration of Bayesian Estimation Prior: p(  ) Likelihood: p(X|  ) X=(x 1,…,x N ) Posterior: p(  |X)  p(X|  )p(  )    : prior mode  ml : ML estimate  : posterior mode

Experiment Results TREC 2003 (Uniform weights)TREC 2005 (Estimated weights) Query ModelUnstructSemi-structImp.UnstructSemi-structImp. MAP0.160.185+13.5%0.2420.258+6.6% Pr@10docs0.140.154+10%0.3820.412+7.8%

More Experiment Results (with slightly different model)

Conclusions Standard IR techniques are effective for biomedical literature retrieval Modeling and exploiting the structure in a query can improve accuracy Overall TREC Genomics Track findings –Domain-specific resources are very useful –Sound retrieval models and machine learning techniques are helpful

Future Work Using HMMs to model relevant documents Incorporate biomedical resources into principled statistical models

Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate.

Similar presentations

Presentation on theme: "Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate.

Similar presentations

Presentation on theme: "Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate."— Presentation transcript:

Similar presentations

About project

Feedback