The Role of Machine Learning in NLP Eduard Hovy USC Information Sciences Institute www.isi.edu/~hovy Confessions of an Addict: Machine Learning as the.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Machine Learning: Intro and Supervised Classification
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Part I: Classification and Bayesian Learning
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
9/8/20151 Natural Language Processing Lecture Notes 1.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Estimating N-gram Probabilities Language Modeling.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Annotating and measuring Temporal relations in texts Philippe Muller and Xavier Tannier IRIT,Université Paul Sabatier COLING 2004.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Natural Language Processing (NLP)
Improving a Pipeline Architecture for Shallow Discourse Parsing
Social Knowledge Mining
CS246: Information Retrieval
Natural Language Processing (NLP)
Injecting Linguistics into NLP by Annotation
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Information Retrieval
Stance Classification of Ideological Debates
Natural Language Processing (NLP)
Presentation transcript:

The Role of Machine Learning in NLP Eduard Hovy USC Information Sciences Institute Confessions of an Addict: Machine Learning as the Steroids of NLP or

Lesson 1: Banko and Brill, HLT-01 Task: Confusion set disambiguation: {youre | your}, {to | too | two}, {its | its} 5 Algorithms: ngram table, winnow, perceptron, transformation-based learning, decision trees Training: words Lessons: –All methods improved to almost same point –Simple method can end above complex one –Dont waste your time with algorithms and optimization You dont need a smart algorithm, you just need enough training data

Lesson 2: Och, ACL-02 Best MT system in world (Arabic English, by BLEU and NIST, 2002–2005): Ochs work Method: learn ngram correspondence patterns (alignment templates) using MaxEnt (log-linear translation model), trained to maximize BLEU score Approximately: EBMT + Viterbi search Lesson: the more you store, the better your MT You dont have to be smart, you just need enough storage

Storage needs Unigram translation table: bilingual dictionary – 200K words each side (2MB if each word is 10 chars) Bigram translation table (every bigram): – Lexicon: 200K = 2 18 words – Table entries: [200K 200K words + translations] = entries – Each entry size = 4 words 18 bits = 9 bytes – entries 9 bytes = bytes = 0,4 TB (under $1000 at todays prices!) Trigram translation table (every trigram): – 1, TB ready in 2008? Better: store only attested ngrams (up to 5? 7? 9?), fall back to shorter ones when not in table… – Carbonell et al. MT system…all 8grams of English

Lesson 3: Fleischman and Hovy, ACL-03 Text mining: classify locations and people in free- form text into fine-grain classes – Simple appositive IE patterns (Quarterback ROLE Joe Smith PER ) – 2+ mill examples, collapsed into 1 mill instances (avg: 2 mentions/instance, 40+ for George W. Bush) Test: QA on who is X?: – 100 questions from AskJeeves – System 1: Table of instances – System 2: ISIs TextMap QA system – Table system scored 25% better – Over half of questions that TextMap got wrong could have benefited from information in the concept-instance pairs – This method took 10 seconds, TextMap took ~9 hours You dont have to reason, you just need to collect the knowledge beforehand

Lesson 4: Chiang et al., HLT ,001 New Features for Statistical MT. David Chiang, Kevin Knight, Wei Wang Proc. NAACL HLT. Best paper award Learn Eng–Chi MT rules: NP-C(x0:NPB PP(IN(of x1:NPB)) x1 de x0 Featurize everything: – Several hundred count features : reward frequent rules; punish rules that overlap; punish rules that insert is, the, etc. into English … – 10,000 word context features: for each triple (f; e; f +1 ), feature that counts the number o f times that f is aligned to e and f +1 occurs to the right of f; and similarly for triples (f; e; f -1 ) with f -1 occurring to the left of f. Restrict words to the 100 most frequent in training data You dont have to know anything, you just need enough features

Four lessons You dont need a smart algorithm, you just need enough training data You dont have to be smart, you just need enough memory You dont have to be smart, you just need to collect the knowledge beforehand You dont have to be smart, you just need enough features Conclusion: the web has all you need memory gets cheaper computers get faster …we are moving to a new world: NLP as table lookup copy features from everyone

So what?

Performance ceilings Reliable surface-level preprocessing (POS tagging, word segmentation, NE extraction, etc.): 94%+ Shallow syntactic parsing: 93%+ for English (Charniak, Stanford, Lin) and deeper analysis (Hermjakob) IE: ~0.4–0.7 F-score for easy topics (MUC, ACE) Speech: ~80% word correct rate (large vocab); 20%+ (open vocab, noisy input) IR: 0.45–0.6 F-score (TREC) MT: ~70% depending on what you measure Summarization: ? (~0.6 F-score for extracts; DUC, TAC) QA: ? (~60% for factoids; TREC) 90s– 00s– 80s– 90s– 00s–

Why were stuck Just need better learning algorithms? New algorithms do do better, but only asymptotically More data? Even Google with all its data cant crack MT Better and deeper representations / features? Best MT now uses syntax; best QA uses inference A need for semantics, discourse, pragmatics…?

The danger of steroids In NLP, we have grown lazy: When we asymptote toward a performance ceiling, we dont think, we just look for the next sexy ML algorithm

What have we learned about NLP? Most NLP is notation transformation: – (Eng) sentence (Chi) sentence (MT) – (Eng) string parse tree frame – case frame (Eng) string (NLG) – sound waves text string (ASR) – long text short text (Summ, QA) …with some information added: – Labels: POS, syntactic, semantic, other – Brackets – Other associated docs FIRST, you need theorizing: designing the types, notation, model: level and formalism And THEN you need engineering: Selecting and tuning learning performance a (rapid) build- evaluate-build cycle

A hierarchy of transformations Direct: simple replacement Small changes: demorphing, etc. Adding info: POS tags, etc. Mid-level changes: syntax Adding more: semantic features Shallow semantics: frames Deep semantics: ? Generation Analysis Transformations at abstract level: filter, match parts, etc. Some transforms are deeper than others Each layer of abstraction defines classes/types of behavioral regularity These types solve the data sparseness problem

More phenomena of semantics Somewhat easier Word sense selection (incl. copula) NP structure: genitives, modifiers… Entity identification and coreference Pronoun classification (ref, bound, event, generic, other) Temporal relations (incl. discourse and aspect) Manner relations Spatial relations Comparatives Quotation and reported speech Opinions and other judgments Event identification and coreference Bracketing (scope) of predications More difficult / deeper Quantifier phrases and numerical expressions Concept structure (incl. frames and thematic roles) Coordination Info structure (theme/rheme, Focus) Discourse structure Modals and other adverbials (epistemic modals, evidentials) Concepts: ontology definition Pragmatics and Speech Acts Polarity/negation Presuppositions Metaphors

The better and more refined the representation levels we introduce, the better the quality of the output …and the more challenges and opportunities for machine learning

So, what to do? Some NLP people need to kick the habit: No more steroids, just hard thought Other NLP people can continue to play around with algorithms For them, you Machine Learning guys are the pushers and the pimps!

So you may be happy with this, but I am not … I want to understand whats going on in language and thought We have no theory of language or even of language processing in NLP Chasing after another algorithm that will be hot for 2 or 4 years is not really productive How can one inject understanding?

The role of corpus creation Basic methodological assumptions of NLP: – Statistical NLP: process is (somewhat) nondeterministic; probabilities predict likelihood of products – Underlying assumption: as long as annotator consistency can be achieved, there is systematicity, and systems will learn to find it Theory creation (and testing!) through corpus annotation – But we (still) have to manually identify generalizations (= equivalence classes of individual instances of phenomena) to obtain expressive generality/power – This is the theory – (and we need to understand how to do annotation properly) A corpus lasts 20 years; an algorithm lasts 3 years

Annotation! 1. Preparation – Choose the corpus – Build the interfaces 2. Instantiating the theory – Create the annotation choices – Test-run them for stability 3. Annotation – Annotate – Reconcile among annotators 4. Validation – Measure inter-annotator agreement – Possibly adjust theory instantiation 5. Delivery – Wrap the result –Which corpus? –Interface design issues –How remain true to theory? –How many annotators? –Which procedure? –Which measures? annotation science

A fruitful cycle Each one influences the others Different people like different kinds of work Analysis, theorizing, annotation Machine learning of transformations Storage in large tables, optimization, commercialization annotated corpus automated creation method problems: low performance evaluation Linguists, psycholinguists, cognitive linguists… Current NLP researchers NLP companies

How can you ML guys help? Dont give us yet another cool algorithm (all they do is another feature-based clustering) Do Help us build corpora Tell us when to use which algorithm

Its all about features Feature design: – Traditionally, concern of domain and task expert Feature ranking & selection: – Traditional main focus of ML

What would be really helpful Input: Dimensions of choice, for each new problem – Training data: Values: numerical (continuous) or categorical Skewedness: dependency on balanced/representative samples Granularity: delicacy of differentiation in feature space – Training session: Storage and processing requirements Speed of convergence Output: Expected accuracies, for different amounts of training data, for each learning algorithm

Help me… Help me know what algorithm to use Help me recognize when something is seriously amiss with the algorithm, not just with my data…then I can contact you Help my students kick the steroid habit and learn the value of thinking!

Thank you! NO PIMPS

Some readings Feature design: – Rich Caruana, Alexandru Niculescu-Mizil An Empirical Comparison of Supervised Learning Algorithms. Proceedings of the 23rd International Conference on Machine Learning (ICML `06). – Alexandru Niculescu-Mizil, Rich Caruana Predicting Good Probabilities With Supervised Learning. Proceedings of the 22nd International Conference on Machine Learning (ICML `05). Distinguished student paper award. – Rich Caruana, Alexandru Niculescu-Mizil Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria. Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD `04). – David B. Skalak, Alexandru Niculescu-Mizil, Rich Caruana Classifier Loss under Metric Uncertainty. Proceedings of the 18th European Conference on Machine Learning (ECML `07).