Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Yansong Feng and Mirella Lapata
Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution Preslav Nakov and Marti Hearst Computer Science Division and.
Tricks for Statistical Semantic Knowledge Discovery: A Selectionally Restricted Sample Marti A. Hearst UC Berkeley.
® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
1 Final Projects  Please make an appointment to come talk to me (or office hours)  What additional things should you add to your project?  Are you on.
Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS.
A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley UCB Neyman.
Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Topic  Semantic similarity measures.
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley Joint work.
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
1 Noun compounds (NCs) Any sequence of nouns that itself functions as a noun asthma hospitalizations asthma hospitalization rates health care personnel.
UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Citances: Citation Sentences for Semantic Analysis of Bioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
1 The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA AQUAINT,
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha Adafre School of Computing Dublin City University.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005.
Flexible Text Mining using Interactive Information Extraction David Milward
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research.
Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar.
1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
The Unreasonable Effectiveness of Data
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Language Identification and Part-of-Speech Tagging
Linguistic Graph Similarity for News Sentence Searching
A tool for automated extraction of multi-word expressions
Clustering of Web pages
Supporting Annotation Layers for Natural Language Processing
Supporting Annotation Layers for Natural Language Processing
Supporting Annotation Layers for Natural Language Processing
Noun Compounds Interpretation简单调研
Supported by NSF DBI and a gift from Genentech
Category-Based Pseudowords
Supporting Annotation Layers for Natural Language Processing
Supported by NSF DBI and a gift from Genentech
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Statistical n-gram David ling.
CS246: Information Retrieval
Marti Hearst Associate Professor SIMS, UC Berkeley
Presentation transcript:

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI and a gift from Genentech

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Overview Motivation: Need to re-use results of NLP processing: for additional processing for end applications: data mining etc. Proposed solution: Layers of annotations over text Illustration: Application to noun compound bracketing

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Noun Compound Bracketing (a)[ [ liver cell ] antibody ] (left bracketing) (b)[ liver [cell line] ] (right bracketing) In (a), the antibody targets the cell line. In (b), the cell line is derived from the liver.

Related Work Pustejosky et al. (1993) adjacency model:Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995) dependency model:Pr(w 1 |w 3 ) vs. Pr(w 2 |w 3 ) Keller & Lapata (2004): use the Web unigrams and bigrams Nakov & Hearst (2005): will be presented at coNLL! use the Web, Chi-squared n-grams paraphrases surface features

Nakov & Hearst (2005) Web page hits: proxy for n-gram frequencies Sample surface features amino-acid sequence  left brain stem’s cell  left brain’s stem cell  right Majority vote to combine different models Accuracy 89.34%

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Web Counts: Problems The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care) “health”: returns nouns “care”: returns both verbs and nouns can be adjacent by chance can come from different sentences Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition Page hits are inaccurate

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Solution: MEDLINE+LQL MEDLINE: ~13 million abstracts We annotated: 1.4 million abstracts ~10 million sentences ~320 million annotations Layered Query Language: demo at ACL!

The System Built on top of an RDBMS system Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML Specialized query language LQL (Layered Query Language)

Annotated Example

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Noun Compound Extraction (1) FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content layers’ beginnings should match layers’ endings should match

Noun Compound Extraction (2) SELECT LOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQL GROUP BY lc ORDER BY freq DESC

Noun Compound Extraction (3) SELECT LOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQL GROUP BY lc ORDER BY freq DESC layer negation artificial range

Finding Bigram Counts SELECT COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ [layer=’pos’ && tag_type="noun“ && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_type="noun“ && (content="virus"||content="viruses")] ] ] SELECT word1.content END_LQL GROUP BY lc ORDER BY freq DESC

Paraphrases Types of paraphrases (Warren,1978): Prepositional immunodeficiency virus in humans  right Verbal virus causing human immunodeficiency  left immunodeficiency virus found in humans  left Copula immunodeficiency virus that is human  right

Prepositional Paraphrases SELECT LOWER(prep.content) lp, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_type="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQL GROUP BY lp, ORDER BY freq DESC optional layer

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

obtained 418,678 noun compounds (NCs) annotated the top 232 NCs (after cleaning) agreement 88% kappa.606 baseline (left): 83.19% n-grams: Pr, #, χ 2 prepositional paraphrases for inflections, we used UMLS

Results correct N/Awrong

Discussion Semantics of bone marrow cells top verbal paraphrases cells derived from bone marrow (22 instances) cells isolated from bone marrow (14 instances) top prepositional paraphrases cells in bone marrow (456 instances) cells from bone marrow (108 instances) Finding hard examples for NC bracketing w 1 w 2 w 3 such that both w 1 w 2 and w 2 w 3 are MeSH terms

The End Thank you!