Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley Joint work.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution Preslav Nakov and Marti Hearst Computer Science Division and.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Tricks for Statistical Semantic Knowledge Discovery: A Selectionally Restricted Sample Marti A. Hearst UC Berkeley.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Stemming, tagging and chunking Text analysis short of parsing.
Unambiguous + Unlimited = Unsupervised Marti Hearst School of Information, UC Berkeley Invited Talk, University of Toronto January 31, 2006 This research.
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley UCB Neyman.
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 Noun compounds (NCs) Any sequence of nouns that itself functions as a noun asthma hospitalizations asthma hospitalization rates health care personnel.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Unambiguous + Unlimited = Unsupervised or Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley This research.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1/17 Acquiring Selectional Preferences from Untagged Text for Prepositional Phrase Attachment Disambiguation Hiram Calvo and Alexander Gelbukh Presented.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
CSA2050 Introduction to Computational Linguistics Parsing I.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
NLP. Parsing Manually label a set of instances. Split the labeled data into training and testing sets. Use the training data to find patterns. Apply.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Natural Language Processing Vasile Rus
Language Identification and Part-of-Speech Tagging
Machine Learning in Natural Language Processing
Supported by NSF DBI and a gift from Genentech
Probabilistic and Lexicalized Parsing
Supported by NSF DBI and a gift from Genentech
CS246: Information Retrieval
Word embeddings (continued)
Presentation transcript:

Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley Joint work with Preslav Nakov BYU CS Colloquium, Dec 6, 2007 This research supported in part by NSF DBI

Marti Hearst, BYU CS 2007 Natural Language Processing  The ultimate goal: write programs that read and understand stories and conversations.  This is too hard! Instead we tackle sub-problems.  There have been notable successes lately:  Machine translation is vastly improved  Speech recognition is decent in limited circumstances  Text categorization works with some accuracy

Marti Hearst, BYU CS 2007 How can a machine understand these differences? Get the cat with the gloves.

Marti Hearst, BYU CS 2007 How can a machine understand these differences? Get the sock from the cat with the gloves. Get the glove from the cat with the socks.

Marti Hearst, BYU CS 2007 How can a machine understand these differences?  Decorate the cake with the frosting.  Decorate the cake with the kids.  Throw out the cake with the frosting.  Throw out the cake with the kids.

Marti Hearst, BYU CS 2007 Why is this difficult?  Same syntactic structure, different meanings.  Natural language processing algorithms have to deal with the specifics of individual words.  Enormous vocabulary sizes.  The average English speaker’s vocabulary is around 50,000 words,  Many of these can be combined with many others,  And they mean different things when they do!

Marti Hearst, BYU CS 2007 How to tackle this problem?  The field was stuck for quite some time.  Hand-enter all semantic concepts and relations  A new approach started around 1990  Get large text collections  Compute statistics over the words in those collections  There are many different algorithms.

Marti Hearst, BYU CS 2007 Size Matters Recent realization: bigger is better than smarter! Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL

Marti Hearst, BYU CS 2007 Example Problem  Grammar checker example: Which word to use?  Solution: use well-edited text and look at which words surround each use:  I am in my third year as the principal of Anamosa High School.  School-principal transfers caused some upset.  This is a simple formulation of the quantum mechanical uncertainty principle.  Power without principle is barren, but principle without power is futile. (Tony Blair)

Marti Hearst, BYU CS 2007 Using Very, Very Large Corpora  Keep track of which words are the neighbors of each spelling in well-edited text, e.g.:  Principal: “high school”  Principle: “rule”  At grammar-check time, choose the spelling best predicted by the surrounding words.  Surprising results:  Log-linear improvement even to a billion words!  Getting more data is better than fine-tuning algorithms!

Marti Hearst, BYU CS 2007 The Effects of LARGE Datasets  From Banko & Brill ‘01

Marti Hearst, BYU CS 2007 How to Extend this Idea?  This is an exciting result …  BUT relies on having huge amounts of text that has been appropriately annotated!

Marti Hearst, BYU CS 2007 How to Avoid Manual Labeling?  “Web as a baseline” (Lapata & Keller 04,05)  Main idea: apply web-determined counts to every problem imaginable.  Example: for t in { }  Compute f(w-1, t, w+1)  The largest count wins

Marti Hearst, BYU CS 2007 Web as a Baseline  Works very well in some cases  machine translation candidate selection  article generation  noun compound interpretation  noun compound bracketing  adjective ordering  But lacking in others  spelling correction  countability detection  prepositional phrase attachment  How to push this idea further? Significantly better than the best supervised algorithm. Not significantly different from the best supervised.

Marti Hearst, BYU CS 2007 Using Unambiguous Cases  The trick: look for unambiguous cases to start  Use these to improve the results beyond what co- occurrence statistics indicate.  An Early Example:  Hindle and Rooth, “Structural Ambiguity and Lexical Relations”, ACL ’90, Comp Ling’93  Problem: Prepositional Phrase attachment  I eat/v spaghetti/n1 with/p a fork/n2.  I eat/v spaghetti/n1 with/p sauce/n2.  Question: does n2 attach to v or to n1?

Marti Hearst, BYU CS 2007 Using Unambiguous Cases  How to do this with unlabeled data?  First try:  Parse some text into phrase structure  Then compute certain co-occurrences f(v, n1, p) f(n1, p) f(v, n1)  Problem: results not accurate enough  The trick: look for unambiguous cases:  Spaghetti with sauce is delicious. (pre-verbal)  I eat with a fork. (no direct object)  Use these to improve the results beyond what co- occurrence statistics indicate.

Marti Hearst, BYU CS 2007 Using Unambiguous Cases  Hindle & Rooth, final algorithm:  Parse text into phrase structure.  Create bigram counts (v, p) and (n1, p) as follows:  First, use unambiguous cases to populate bigram table  Then, for the ambiguous cases:  Compute a Lexical Association score comparing (v1, n1, p) to (n1, p, n2).  If this is greater than a threshold, update the bigram table with the assumed attachment  Else split the score and assign to both attachments  The bigram table is used for further computations of the Lexical Association score.

Marti Hearst, BYU CS 2007 Unambiguous + Unlimited = Unsupervised  Apply the Unambiguous Case Idea to the Very, Very Large Corpora idea  The potential of these approaches are not fully realized  Our work (with Preslav Nakov):  Structural Ambiguity Decisions  PP-attachment  Noun compound bracketing  Coordination grouping  Semantic Relation Acquisition  Hypernym (ISA) relations  Verbal relations between nouns  SAT Analogy problems

Marti Hearst, BYU CS 2007 Structural Ambiguity Problems  Apply the U + U = U idea to structural ambiguity  Noun compound bracketing  Prepositional Phrase attachment  Noun Phrase coordination  Motivation: BioText project  In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF).  Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment. BimL protein interact with Bcl-2 or Bcl-XL, or Bcl-w proteins (Immuno- precipitation (anti-Bcl-2 OR Bcl-XL or Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE- tagged BimL and (bcl-2 or bcl-XL or bcl-w) plasmids)

Marti Hearst, BYU CS 2007 Applying U + U = U to Structural Ambiguity  We introduce the use of (nearly) unambiguous features:  Surface features  Paraphrases  Combined with ngrams  From very, very large corpora  Achieve state-of-the-art results without labeled examples.

Marti Hearst, BYU CS 2007 Noun Compound Bracketing (a)[ [ liver cell ] antibody ] (left bracketing) (b)[ liver [cell line] ] (right bracketing) In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.

Marti Hearst, BYU CS 2007 Dependency Model  right bracketing: [w 1 [w 2 w 3 ] ]  w 2 w 3 is a compound (modified by w 1 )  home health care  w 1 and w 2 independently modify w 3  adult male rat  left bracketing : [ [w 1 w 2 ]w 3 ]  only 1 modificational choice possible  law enforcement officer w 1 w 2 w 3

Marti Hearst, BYU CS 2007 Related Work  Marcus(1980), Pustejosky&al.(1993), Resnik(1993)  adjacency model:Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 )  Lauer (1995)  dependency model:Pr(w 1 |w 2 ) vs. Pr(w 1 |w 3 )  Keller & Lapata (2004):  use the Web  unigrams and bigrams  Girju & al. (2005)  supervised model  bracketing in context  requires WordNet senses to be given Our approach: Web as data  2, n-grams paraphrases surface features

Marti Hearst, BYU CS 2007 Our U + U + U Algorithm  Compute bigram estimates  Compute estimates from surface features  Compute estimates from paraphrases  Combine these scores with a voting algorithm to choose left or right bracketing.  We use the same general approach for two other structural ambiguity problems.

Marti Hearst, BYU CS 2007 Using n-grams to make predictions  Say trying to distinguish: [home health] care home [health care]  Main idea: compare these co-occurrence probabilities  “home health” vs  “health care”

Marti Hearst, BYU CS 2007 Computing Bigram Statistics  Dependency Model, Frequencies  Compare #(w 1,w 2 ) to #(w 1,w 3 )  Dependency model, Probabilities  Pr(left) = Pr(w 1  w 2 |w 2 )Pr(w 2  w 3 |w 3 )  Pr(right) = Pr(w 1  w 3 |w 3 )Pr(w 2  w 3 |w 3 )  So we compare Pr(w 1  w 2 |w 2 ) to Pr(w 1  w 3 |w 3 ) w 1 w 2 w 3 left right

Marti Hearst, BYU CS 2007 Using ngrams to estimate probabilities  Using page hits as a proxy for n-gram counts  Pr(w 1  w 2 |w 2 ) = #(w 1, w 2 ) / #(w 2 )  #(w 2 ) word frequency; query for “w 2 ”  #(w 1, w 2 ) bigram frequency; query for “w 1 w 2 ”  smoothed by 0.5  Use  2 to determine if w 1 is associated with w 2 (thus indicating left bracketing), and same for w 1 with w 3

Marti Hearst, BYU CS 2007 Association Models:  2 (Chi Squared)  A = #(w i, w j )  B = #(w i ) – #(w i, w j )  C = #(w j ) – #(w i, w j )  D = N – (A+B+C)  N = 8 trillion (= A+B+C+D) 8 billion Web pages x 1,000 words

Marti Hearst, BYU CS 2007 Our U + U + U Algorithm  Compute bigram estimates  Compute estimates from surface features  Compute estimates from paraphrases  Combine these scores with a voting algorithm to choose left or right bracketing.

Marti Hearst, BYU CS 2007 Web-derived Surface Features  Authors often disambiguate noun compounds using surface markers, e.g.:  amino-acid sequence  left  brain stem’s cell  left  brain’s stem cell  right  The enormous size of the Web makes these frequent enough to be useful.

Marti Hearst, BYU CS 2007 Web-derived Surface Features: Dash (hyphen)  Left dash  cell-cycle analysis  left  Right dash  donor T-cell  right  Double dash  T-cell-depletion  unusable…

Marti Hearst, BYU CS 2007 Web-derived Surface Features: Possessive Marker  Attached to the first word  brain’s stem cell  right  Attached to the second word  brain stem’s cell  left  Combined features  brain’s stem-cell  right

Marti Hearst, BYU CS 2007 Web-derived Surface Features: Capitalization  anycase – lowercase – uppercase  Plasmodium vivax Malaria  left  plasmodium vivax Malaria  left  lowercase – uppercase – anycase  brain Stem cell  right  brain Stem Cell  right  Disable this on:  Roman digits  Single-letter words: e.g. vitamin D deficiency

Marti Hearst, BYU CS 2007 Web-derived Surface Features: Embedded Slash  Left embedded slash  leukemia/lymphoma cell  right

Marti Hearst, BYU CS 2007 Web-derived Surface Features: Parentheses  Single-word  growth factor (beta)  left  (brain) stem cell  right  Two-word  (growth factor) beta  left  brain (stem cell)  right

Marti Hearst, BYU CS 2007 Web-derived Surface Features: Comma, dot, semi-colon  Following the first word  home. health care  right  adult, male rat  right  Following the second word  health care, provider  left  lung cancer: patients  left

Marti Hearst, BYU CS 2007 Web-derived Surface Features: Dash to External Word  External word to the left  mouse -brain stem cell  right  External word to the right  tumor necrosis factor- alpha  left

Marti Hearst, BYU CS 2007 Web-derived Surface Features: Problems & Solutions  Problem: search engines ignore punctuation in queries  “brain-stem cell” does not work  Solution:  query for “brain stem cell”  obtain 1,000 document summaries  scan for the features in these summaries

Marti Hearst, BYU CS 2007 Other Web-derived Features: Possessive Marker  We can also query directly for possessives  Yes, “brain stem’s cell” sort of works.  Search engines:  drop the possessive marker  but s is kept  Still, we cannot query for “brain stems’ cell”

Marti Hearst, BYU CS 2007 Other Web-derived Features: Abbreviation  After the second word  tumor necrosis factor (NF)  right  After the third word  tumor necrosis (TN) factor  right  We query for, e.g., “tumor necrosis tn factor”  Problems:  Roman digits: IV, VI  States: CA  Short words: me

Marti Hearst, BYU CS 2007 Other Web-derived Features: Concatenation  Consider health care reform  healthcare : 79,500,000  carereform : 269  healthreform: 812  Adjacency model  healthcare vs. carereform  Dependency model  healthcare vs. healthreform  Triples  “healthcare reform” vs. “health carereform”

Marti Hearst, BYU CS 2007 Other Web-derived Features: Using Google’s * Operator  Each * allows a one-word wildcard  Single star  “health care * reform”  left  “health * care reform”  right  More stars and/or reverse order  “care reform * * health”  right

Marti Hearst, BYU CS 2007 Other Web-derived Features: Reorder  Reorders for “health care reform”  “care reform health”  right  “reform health care”  left

Marti Hearst, BYU CS 2007 Other Web-derived Features: Internal Inflection Variability  Vary inflection of second word  tyrosine kinase activation  tyrosine kinases activation

Marti Hearst, BYU CS 2007 Other Web-derived Features: Switch The First Two Words  Predict right, if we can reorder  adult male rat as  male adult rat

Marti Hearst, BYU CS 2007 Our U + U + U Algorithm  Compute bigram estimates  Compute estimates from surface features  Compute estimates from paraphrases  Combine these scores with a voting algorithm to choose left or right bracketing.

Marti Hearst, BYU CS 2007 Paraphrases  The semantics of a noun compound is often made overt by a paraphrase (Warren,1978)  Prepositional  stem cells in the brain  right  cells from the brain stem  left  Verbal  virus causing human immunodeficiency  left  Copula  office building that is a skyscraper  right

Marti Hearst, BYU CS 2007 Paraphrases  Lauer(1995), Keller&Lapata(2003), Girju&al. (2005) predict NC semantics by choosing the most likely preposition:  of, for, in, at, on, from, with, about, (like)  This could be problematic, when more than one preposition is possible  In contrast:  we try to predict syntax, not semantics  we do not disambiguate, just add up all counts  cells in (the) bone marrow  left  cells from (the) bone marrow  left

Marti Hearst, BYU CS 2007 Paraphrases  prepositional paraphrases:  We use: ~150 prepositions  verbal paraphrases:  We use: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.  copula paraphrases:  We use: is/was and that/which/who  optional elements:  articles: a, an, the  quantifiers: some, every, etc.  pronouns: this, these, etc.

Marti Hearst, BYU CS 2007 Paraphrases: pattern (1) (1)v n1 p n2  v n2 n1(noun)  Can we turn “n1 p n2” into a noun compound “n2 n1”?  meet/v demands/n1 from/p customers/n2   meet/v the customer/n2 demands/n1  Problem: ditransitive verbs like give  gave/v an apple/n1 to/p him/n2   gave/v him/n2 an apple/n1  Solution:  no determiner before n1  determiner before n2 is required  the preposition cannot be to

Marti Hearst, BYU CS 2007 Paraphrases: pattern (2) (2)v n1 p n2  v p n2 n1(verb)  If “p n2” is an indirect object of v, then it could be switched with the direct object n1.  had/v a program/n1 in/p place/n2   had/v in/p place/n2 a program/n1 Determiner before n1 is required to prevent “n2 n1” from forming a noun compound.

Marti Hearst, BYU CS 2007 Paraphrases: pattern (3) (3)v n1 p n2  p n2 * v n1(verb)  “*” indicates a wildcard position (up to three intervening words are allowed)  Looks for appositions, where the PP has moved in front of the verb, e.g.  I gave/v an apple/n1 to/p him/n2   to/p him/n2 I gave/v an apple/n1

Marti Hearst, BYU CS 2007 Paraphrases: pattern (4) (4)v n1 p n2  n1 p n2 v(noun)  Looks for appositions, where “n1 p n2” has moved in front of v  shaken/v confidence/n1 in/p markets/n2   confidence/n1 in/p markets/n2 shaken/v

Marti Hearst, BYU CS 2007 Paraphrases: pattern (5) (5)v n1 p n2  v PRONOUN p n2(verb)  n1 is a pronoun  verb (Hindle&Rooth, 93)  Pattern (5) substitutes n1 with a dative pronoun (him or her), e.g.  put/v a client/n1 at/p odds/n2   put/v him at/p odds/n2

Marti Hearst, BYU CS 2007 Paraphrases: pattern (6) (6)v n1 p n2  BE n1 p n2(noun)  BE is typically used with a noun attachment  Pattern (6) substitutes v with a form of to be (is or are), e.g.  eat/v spaghetti/n1 with/p sauce/n2   is spaghetti/n1 with/p sauce/n2

Marti Hearst, BYU CS 2007 Our U + U + U Algorithm  Compute bigram estimates  Compute estimates from surface features  Compute estimates from paraphrases  Combine these scores with a voting algorithm to choose left or right bracketing.

Marti Hearst, BYU CS 2007 Evaluation: Datasets  Lauer Set  244 noun compounds (NCs)  from Grolier’s encyclopedia  inter-annotator agreement: 81.5%  Biomedical Set  430 NCs  from MEDLINE  inter-annotator agreement: 88% (  =.606)

Marti Hearst, BYU CS 2007 Evaluation: Experiments  Exact phrase queries  Limited to English  Inflections:  Lauer Set: Carroll’s morphological tools  Biomedical Set: UMLS Specialist Lexicon

Marti Hearst, BYU CS 2007 Co-occurrence Statistics  Lauer set  Bio set

Marti Hearst, BYU CS 2007 Paraphrase and Surface Features Performance  Lauer Set  Biomedical Set

Marti Hearst, BYU CS 2007 Individual Surface Features Performance: Bio

Marti Hearst, BYU CS 2007 Individual Surface Features Performance: Bio

Marti Hearst, BYU CS 2007 Results Lauer

Marti Hearst, BYU CS 2007 Results: Comparing with Others

Marti Hearst, BYU CS 2007 Results Bio

Marti Hearst, BYU CS 2007 Results for Noun Compound Bracketing  Introduced search engine statistics that go beyond the n-gram (applicable to other tasks)  surface features  paraphrases  Obtained new state-of-the-art results on NC bracketing  more robust than Lauer (1995)  more accurate than Keller&Lapata (2004)

Marti Hearst, BYU CS 2007 Prepositional Phrase Attachment Problem: (a) Peter spent millions of dollars. (noun attach) (b) Peter spent time with his family. (verb attach) Which attachment for quadruple: (v, n1, p, n2) Results: Much simpler than other algorithms As good as or better than best unsupervised, and better than some supervised approaches

Marti Hearst, BYU CS 2007 Related Work Supervised  (Brill & Resnik, 94): transformation-based learning, WordNet classes, P=82%  (Ratnaparkhi & al., 94): ME, word classes (MI), P=81.6%  (Collins & Brooks, 95): back-off, P=84.5%  (Stetina & Makoto, 97): decision trees, WordNet, P=88.1%  (Toutanova & al., 04): morphology, syntax, WordNet, P=87.5% Unsupervised  (Hindle & Rooth, 93): partially parsed corpus, lexical associations over subsets of (v,n1,p), P=80%,R=80%  (Ratnaparkhi, 98): POS tagged corpus, unambiguous cases for (v,n1,p), (n1,p,n2), classifier: P=81.9%  (Pantel & Lin,00): collocation database, dependency parser, large corpus (125M words), P=84.3% Unsup. state-of-the-art

Marti Hearst, BYU CS 2007 PP-attachment: Our Approach  Unsupervised  (v,n1,p,n2) quadruples, Ratnaparkhi test set  Google and MSN Search  Exact phrase queries  Inflections: WordNet 2.0  Adding determiners where appropriate  Models:  n-gram association models  Web-derived surface features  paraphrases

Marti Hearst, BYU CS 2007 N-gram models  (i) Pr(p|n1) vs. Pr(p|v)  (ii) Pr(p,n2|n1) vs. Pr(p,n2|v)  I eat/v spaghetti/n1 with/p a fork/n2.  I eat/v spaghetti/n1 with/p sauce/n2.  Pr or # (frequency)  smoothing as in (Hindle & Rooth, 93)  back-off from (ii) to (i)  N-grams unreliable, if n1 or n2 is a pronoun.  MSN Search: no rounding of n-gram estimates

Marti Hearst, BYU CS 2007 Web-derived Surface Features  Example features  open the door / with a key  verb (100.00%, 0.13%)  open the door (with a key)  verb (73.58%, 2.44%)  open the door – with a key  verb (68.18%, 2.03%)  open the door, with a key  verb (58.44%, 7.09%)  eat Spaghetti with sauce  noun (100.00%, 0.14%)  eat ? spaghetti with sauce  noun (83.33%, 0.55%)  eat, spaghetti with sauce  noun (65.77%, 5.11%)  eat : spaghetti with sauce  noun (64.71%, 1.57%)  Summing achieves high precision, low recall. PRPR sum compare

Marti Hearst, BYU CS 2007 Paraphrases v n1 p n2  v n2 n1(noun)  v p n2 n1(verb)  p n2 * v n1(verb)  n1 p n2 v(noun)  v PRONOUN p n2(verb)  BE n1 p n2(noun)

Marti Hearst, BYU CS 2007 Evaluation Ratnaparkhi dataset  3097 test examples, e.g. prepare dinner for family V shipped crabs from province V  n1 or n2 is a bare determiner: 149 examples  problem for unsupervised methods left chairmanship of the N is the of kind N acquire securities for an N  special symbols: %, /, & etc.: 230 examples  problem for Web queries buy % for 10 V beat S&P-down from % V is 43%-owned by firm N

Marti Hearst, BYU CS 2007 Results Simpler but not significantly different from 84.3% (Pantel&Lin,00). For prepositions other then OF. (of  noun attachment) Models in bold are combined in a majority vote.

Marti Hearst, BYU CS 2007 Noun Phrase Coordination  (Modified) real sentence:  The Department of Chronic Diseases and Health Promotion leads and strengthens global efforts to prevent and control chronic diseases or disabilities and to promote health and quality of life.

Marti Hearst, BYU CS 2007 NC coordination: ellipsis  Ellipsis  car and truck production  means car production and truck production  No ellipsis  president and chief executive  All-way coordination  Securities and Exchange Commission

Marti Hearst, BYU CS 2007 NC Coordination: ellipsis  Quadruple (n1,c,n2,h)  Penn Treebank annotations  ellipsis: (NP car/NN and/CC truck/NN production/NN).  no ellipsis: (NP (NP president/NN) and/CC (NP chief/NN executive/NN))  all-way: can be annotated either way  This is a problem a parser must deal with. Collins’ parser always predicts ellipsis, but other parsers (e.g. Charniak’s) try to solve it.

Marti Hearst, BYU CS 2007 Results 428 examples from Penn TB

Marti Hearst, BYU CS 2007 New Application: Machine Translation  Main idea:  Use syntactic paraphrases of source sentences to create more training data examples for the same target translation.  Still working on this; starting to get measurable improvements

Marti Hearst, BYU CS 2007 Semantic Relation Detection  Goal: automatically augment a lexical database  Many potential relation types:  ISA (hypernymy/hyponymy)  Part-Of (meronymy)  Idea: find unambiguous contexts which (nearly) always indicate the relation of interest

Marti Hearst, BYU CS 2007 Lexico-Syntactic Patterns

Marti Hearst, BYU CS 2007 Lexico-Syntactic Patterns

Marti Hearst, BYU CS 2007 Adding a New Relation

Marti Hearst, BYU CS 2007 Semantic Relation Detection  Lexico-syntactic Patterns:  Should occur frequently in text  Should (nearly) always suggest the relation of interest  Should be recognizable with little pre-encoded knowledge.  These patterns have been used extensively by other researchers.

Marti Hearst, BYU CS 2007 Semantic Relation Detection  What relationship holds between two nouns?  olive oil – oil comes from olives  machine oil – oil used on machines  Assigning the meaning relations between these terms has been seen as a very difficult solution  Our solution:  Use clever queries against the web to figure out the relations.

Marti Hearst, BYU CS 2007 Queries for Semantic Relations  Convert the noun-noun compound into a query of the form:  noun2 that * noun1  “oil that * olive(s)”  This returns search result snippets containing interesting verbs.  In this case:  Come from  Be obtained from  Be extracted from  Made from  …

Marti Hearst, BYU CS 2007 Uncovering Semantic Relations  More examples:  Migraine drug -> treat, be used for, reduce, prevent  Wrinkle drug -> treat, be used for, reduce, smooth  Printer tray -> hold, come with, be folded, fit under, be inserted into  Student protest -> be led by, be sponsored by, pit, be, be organized by

Marti Hearst, BYU CS 2007 Conclusions  Unambiguous + Unlimited = Unsupervised  The enormous size of the web opens new opportunities for text analysis  There are many words, but they are more likely to appear together in a huge dataset  This allows us to do word-specific analysis  To counter the labeled-data roadblock, we start with unambiguous features that we can find naturally.  We’ve applied this to structural and semantic language problems.  These are stepping stones towards sophisticated language understanding.

Thank you! Supported in part by NSF DBI