Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI and a gift from Genentech
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Overview Motivation: Need to re-use results of NLP processing: for additional processing for end applications: data mining etc. Proposed solution: Layers of annotations over text Illustration: Application to noun compound bracketing
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Noun Compound Bracketing (a)[ [ liver cell ] antibody ] (left bracketing) (b)[ liver [cell line] ] (right bracketing) In (a), the antibody targets the cell line. In (b), the cell line is derived from the liver.
Related Work Pustejosky et al. (1993) adjacency model:Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995) dependency model:Pr(w 1 |w 3 ) vs. Pr(w 2 |w 3 ) Keller & Lapata (2004): use the Web unigrams and bigrams Nakov & Hearst (2005): will be presented at coNLL! use the Web, Chi-squared n-grams paraphrases surface features
Nakov & Hearst (2005) Web page hits: proxy for n-gram frequencies Sample surface features amino-acid sequence left brain stem’s cell left brain’s stem cell right Majority vote to combine different models Accuracy 89.34%
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Web Counts: Problems The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care) “health”: returns nouns “care”: returns both verbs and nouns can be adjacent by chance can come from different sentences Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition Page hits are inaccurate
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Solution: MEDLINE+LQL MEDLINE: ~13 million abstracts We annotated: 1.4 million abstracts ~10 million sentences ~320 million annotations Layered Query Language: demo at ACL!
The System Built on top of an RDBMS system Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML Specialized query language LQL (Layered Query Language)
Annotated Example
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
Noun Compound Extraction (1) FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content layers’ beginnings should match layers’ endings should match
Noun Compound Extraction (2) SELECT LOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQL GROUP BY lc ORDER BY freq DESC
Noun Compound Extraction (3) SELECT LOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQL GROUP BY lc ORDER BY freq DESC layer negation artificial range
Finding Bigram Counts SELECT COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ [layer=’pos’ && tag_type="noun“ && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_type="noun“ && (content="virus"||content="viruses")] ] ] SELECT word1.content END_LQL GROUP BY lc ORDER BY freq DESC
Paraphrases Types of paraphrases (Warren,1978): Prepositional immunodeficiency virus in humans right Verbal virus causing human immunodeficiency left immunodeficiency virus found in humans left Copula immunodeficiency virus that is human right
Prepositional Paraphrases SELECT LOWER(prep.content) lp, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_type="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQL GROUP BY lp, ORDER BY freq DESC optional layer
Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation
obtained 418,678 noun compounds (NCs) annotated the top 232 NCs (after cleaning) agreement 88% kappa.606 baseline (left): 83.19% n-grams: Pr, #, χ 2 prepositional paraphrases for inflections, we used UMLS
Results correct N/Awrong
Discussion Semantics of bone marrow cells top verbal paraphrases cells derived from bone marrow (22 instances) cells isolated from bone marrow (14 instances) top prepositional paraphrases cells in bone marrow (456 instances) cells from bone marrow (108 instances) Finding hard examples for NC bracketing w 1 w 2 w 3 such that both w 1 w 2 and w 2 w 3 are MeSH terms
The End Thank you!