Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.

Similar presentations


Presentation on theme: "Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer."— Presentation transcript:

1 Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

2 Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

3 Overview Motivation: Need to re-use results of NLP processing: for additional processing for end applications: data mining etc. Proposed solution: Layers of annotations over text Illustration: Application to noun compound bracketing

4 Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

5 Noun Compound Bracketing (a)[ [ liver cell ] antibody ] (left bracketing) (b)[ liver [cell line] ] (right bracketing) In (a), the antibody targets the cell line. In (b), the cell line is derived from the liver.

6 Related Work Pustejosky et al. (1993) adjacency model:Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995) dependency model:Pr(w 1 |w 3 ) vs. Pr(w 2 |w 3 ) Keller & Lapata (2004): use the Web unigrams and bigrams Nakov & Hearst (2005): will be presented at coNLL! use the Web, Chi-squared n-grams paraphrases surface features

7 Nakov & Hearst (2005) Web page hits: proxy for n-gram frequencies Sample surface features amino-acid sequence  left brain stem’s cell  left brain’s stem cell  right Majority vote to combine different models Accuracy 89.34%

8 Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

9 Web Counts: Problems The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care) “health”: returns nouns “care”: returns both verbs and nouns can be adjacent by chance can come from different sentences Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition Page hits are inaccurate

10 Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

11 Solution: MEDLINE+LQL MEDLINE: ~13 million abstracts We annotated: 1.4 million abstracts ~10 million sentences ~320 million annotations Layered Query Language: demo at ACL! http://biotext.berkeley.edu/lql/

12 The System Built on top of an RDBMS system Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML Specialized query language LQL (Layered Query Language)

13 Annotated Example

14 Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

15 Noun Compound Extraction (1) FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content layers’ beginnings should match layers’ endings should match

16 Noun Compound Extraction (2) SELECT LOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQL GROUP BY lc ORDER BY freq DESC

17 Noun Compound Extraction (3) SELECT LOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQL GROUP BY lc ORDER BY freq DESC layer negation artificial range

18 Finding Bigram Counts SELECT COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ [layer=’pos’ && tag_type="noun“ && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_type="noun“ && (content="virus"||content="viruses")] ] ] SELECT word1.content END_LQL GROUP BY lc ORDER BY freq DESC

19 Paraphrases Types of paraphrases (Warren,1978): Prepositional immunodeficiency virus in humans  right Verbal virus causing human immunodeficiency  left immunodeficiency virus found in humans  left Copula immunodeficiency virus that is human  right

20 Prepositional Paraphrases SELECT LOWER(prep.content) lp, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_type="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQL GROUP BY lp, ORDER BY freq DESC optional layer

21 Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

22 obtained 418,678 noun compounds (NCs) annotated the top 232 NCs (after cleaning) agreement 88% kappa.606 baseline (left): 83.19% n-grams: Pr, #, χ 2 prepositional paraphrases for inflections, we used UMLS

23 Results correct N/Awrong

24 Discussion Semantics of bone marrow cells top verbal paraphrases cells derived from bone marrow (22 instances) cells isolated from bone marrow (14 instances) top prepositional paraphrases cells in bone marrow (456 instances) cells from bone marrow (108 instances) Finding hard examples for NC bracketing w 1 w 2 w 3 such that both w 1 w 2 and w 2 w 3 are MeSH terms

25 The End Thank you!


Download ppt "Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer."

Similar presentations


Ads by Google