Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011.

Slides:



Advertisements
Similar presentations
Natural Language Processing (or NLP) Reading: Chapter 1 from Jurafsky and Martin, Speech and Language Processing: An Introduction to Natural Language Processing,
Advertisements

Language Processing Technology Machines and other artefacts that use language.
Leksička semantika i pragmatika 6. predavanje. Headlines Police Begin Campaign To Run Down Jaywalkers Iraqi Head Seeks Arms Teacher Strikes Idle Kids.
Oct 2009HLT1 Human Language Technology Overview. Oct 2009HLT2 Acknowledgement Material for some of these slides taken from J Nivre, University of Gotheborg,
Introduction to Computer Programming I CSE 113
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for Natural Language Processing Ling 571 January 3, 2011 Gina-Anne Levow.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Natural Language Processing AI - Weeks 19 & 20 Natural Language Processing Lee McCluskey, room 2/07
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
CS 410 Applied Algorithms Applied Algorithms Lecture #1 Introduction, class information, first problems.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
COMP 110 Introduction to Programming Mr. Joshua Stough August 22, 2007 Monday/Wednesday/Friday 3:00-4:15 Gardner Hall 307.
1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08.
1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.
COMP 14 – 02: Introduction to Programming Andrew Leaver-Fay August 31, 2005 Monday/Wednesday 3-4:15 pm Peabody 217 Friday 3-3:50pm Peabody 217.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Introduction to Programming Environments for Secondary Education CS 1140 Dr. Ben Schafer Department of Computer Science.
Computer Science 102 Data Structures and Algorithms V Fall 2009 Lecture 1: administrative details Professor: Evan Korth New York University 1.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
9/8/20151 Natural Language Processing Lecture Notes 1.
CS105 Lab 1 – Introduction Section: ??? TA: ??? ??? Announcements CITES Accounts Compass Netfiles Other Administrative Information CS105 Fall
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January 3, 2012.
COMP Introduction to Programming Yi Hong May 13, 2015.
1 Week 4 Questions / Concerns Comments about Lab1 What’s due: Lab1 check off this week (see schedule) Homework #3 due Wednesday (Define grammar for your.
LI 2013 NATHALIE F. MARTIN W ELCOME TO L INGUISTICS I.
1 Computational Linguistics Ling 200 Spring 2006.
CST 229 Introduction to Grammars Dr. Sherry Yang Room 213 (503)
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
English Language Arts 10-2 Introduction Instructor: Mrs. Adolf.
FALL 2011 TECHNICAL ORIENTATION. Session starts at 11:00 am We’ll be online shortly Speaker test starts about 10:45 To ask questions, use the chat window.
Catie Welsh January 10, 2011 MWF 1-1:50 pm Sitterson 014.
CSCI 51 Introduction to Computer Science Dr. Joshua Stough January 20, 2009.
Introduction to CL & NLP CMSC April 1, 2003.
Natural Language Processing Daniele Quercia Fall, 2000.
Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2.
Principles of Computer Science I Honors Section Note Set 1 CSE 1341 – H 1.
1 Introduction LING 570 Fei Xia Week 1: 9/26/07. 2 Outline Course overview Tokenization Homework #1 Quiz #1.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Introduction to ECE 2401 Data Structure Fall 2005 Chapter 0 Chen, Chang-Sheng
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Welcome to Academic Strategies CS Janine Przybyl "When there is a start to be made, don't step over! Start where you are." ~Edgar Cayce.
Advanced Legal Writing Seminar: Wednesdays, 10:00 p.m. EST Office Hours: Mondays from 3 – 5 p.m. EST, and by appointment AIM sign-in: cssouthall
1 Introduction LING 570 Fei Xia Week 1: 9/30/09. 2 Outline Course overview Tokenization Homework #1 Questionnaire.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571 January 6, 2014 Gina-Anne Levow.
Technical Orientation Summer Technical Orientation  Session starts at 11:00 am  We’ll be online shortly  Speaker test starts about 10:45  To.
CMSC 2021 CMSC 202 Computer Science II for Majors Spring 2001 Sections Ms. Susan Mitchell.
CM220 College Composition II Friday, January 29, Unit 1: Introduction to Effective Academic and Professional Writing Unit 1 Lori Martindale, Instructor.
Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571 January 5, 2015 Gina-Anne Levow.
Data Structures and Algorithms in Java AlaaEddin 2012.
Introduction: What is AI? CMSC Introduction to Artificial Intelligence January 3, 2002.
Research Experience Program (REP) Fall 2007 Psychology 100 Ψ.
Introduction: What is AI? CMSC Introduction to Artificial Intelligence January 7, 2003.
Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571 January 4, 2016 Gina-Anne Levow.
Welcome to Introduction to Psychology! Let’s share a bit about where we are all from…
Computer Engineering Department Islamic University of Gaza
CSC207 Fall 2016.
Natural Language Processing (NLP)
Natural Language Processing
A User study on Conversational Software
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Roadmap Course Overview Tokenization Homework #1

Course Overview

Course Information Course web page: Syllabus: Schedule and readings Links to other readings, slides, links to class recordings Slides posted before class, but may be revised Catalyst tools: GoPost discussion board for class issues CollectIt Dropbox for homework submission and TA comments Gradebook for viewing all grades

GoPost Discussion Board Main venue for course-related questions, discussion What not to post: Personal, confidential questions; Homework solutions What to post: Almost anything else course-related Can someone explain…? Is this really supposed to take this long to run? Key location for class participation Post questions or answers Your discussion space: Sanghoun & I will not jump in often

GoPost Emily’s 5-minute rule: If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost! Mechanics: Please use your UW NetID as your user id Please post early and often ! Don’t wait until the last minute Notifications: Decide how you want to receive GoPost postings

Should be used only for personal or confidential issues Grading issues, extended absences, other problems General questions/comments go on GoPost Please send from your UW account Include Ling570 in the subject If you don’t receive a reply in 24 hours, please follow-up

Homework Submission All homework should be submitted through CollectIt Tar cvf hw1.tar hw1_dir Homework due 11:45 Wednesdays Late homework receives 10%/day penalty (incremental) Most major programming languages accepted C/C++/C#, Java, Python, Perl, Ruby If you want to use something else, please check first Please follow naming, organization guidelines in HW Expect to spend hours/week, including HW docs

Grading Assignments: 90% Class participation: 10% No midterm or final exams Grades in Catalyst Gradebook TA feedback returned through CollectIt Incomplete: only if all work completed up last two weeks UW policy

Recordings All classes will be recorded Links to recordings appear in syllabus Available to all students, DL and in class Please remind me to: Record the meeting (look for the red dot) Repeat in-class questions Note: Instructor’s screen is projected in class Assume that chat window is always public

Contact Info Gina: Office hour: Fridays: (before Treehouse meeting) Location: Padelford B-201 Or by arrangement Available by Skype or Adobe Connect All DL students should arrange a short online meeting TA: Sanghoun Song: Office hour: Time: TBD, see GoPost Location:

Online Option Please check you are registered for correct section CLMS online: Section A State-funded: Section B CLMS in-class: Section C NLT/SCE online (or in-class): Section D Online attendance for in-class students Not more than 3 times per term (e.g. missed bus, ice) Please enter meeting room 5-10 before start of class Try to stay online throughout class

Online Tip If you see: You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set. Please contact your instructor/Meeting Host/etc. you do not have a Connect account but need to have one. For UWEO students: If you have just created your UW NetID or just enrolled in a course ….. Clear your cache, close and restart your browser

Course Description

Course Prerequisites Programming Languages: Java/C++/Python/Perl/.. Operating Systems: Basic Unix/linux CS 326 (Data structures) or equivalent Lists, trees, queues, stacks, hash tables, … Sorting, searching, dynamic programming,.. Automata, regular expressions,… Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….

Textbook Jurafsky and Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2 nd edition, 2008 Available from UW Bookstore, Amazon, etc Reference: Manning and Schutze, Foundations of Statistical Natural Language Processing

Topics in Ling570 Unit #1: Formal Languages and Automata (2-3 weeks) Formal languages Finite-state Automata Finite-state Transducers Morphological analysis Unit #2: Ngram Language Models and HMMs Ngram Language Models and Smoothing Part-of-speech (POS) tagging: HMM Ngram

Topics in Ling570 Unit #3: Classification (2-3 weeks) Intro to classification POS tagging with classifiers Chunking Named Entity (NE) recognition Other topics (2 weeks) Intro, tokenization Clustering Information Extraction Summary

Roadmap Motivation: Applications Language and Thought Knowledge of Language Cross-cutting themes Ambiguity, Evaluation, & Multi-linguality Course Overview

Motivation: Applications Applications of Speech and Language Processing Call routing Information retrieval Question-answering Machine translation Dialog systems Spam tagging Spell-, Grammar- checking Sentiment Analysis Information extraction….

Shallow vs Deep Processing Shallow processing (Ling 570) Usually relies on surface forms (e.g., words) Less elaborate linguistic representations E.g. Part-of-speech tagging; Morphology; Chunking

Shallow vs Deep Processing Shallow processing (Ling 570) Usually relies on surface forms (e.g., words) Less elaborate linguistic representations E.g. Part-of-speech tagging; Morphology; Chunking Deep processing (Ling 571) Relies on more elaborate linguistic representations Deep syntactic analysis (Parsing) Rich spoken language understanding (NLU)

Shallow or Deep? Applications of Speech and Language Processing Call routing Information retrieval Question-answering Machine translation Dialog systems Spam tagging Spell-, Grammar- checking Sentiment Analysis Information extraction….

Building on Many Fields Linguistics: Morphology, phonology, syntax, semantics,.. Psychology: Reasoning, mental representations Formal logic Philosophy (of language) Theory of Computation: Automata,.. Artificial Intelligence: Search, Reasoning, Knowledge representation, Machine learning, Pattern matching Probability..

Language & Intelligence Turing Test: (1949) – Operationalize intelligence Two contestants: human, computer Judge: human Test: Interact via text questions Question: Can you tell which contestant is human? Crucially requires language use and understanding

Limitations of Turing Test ELIZA (Weizenbaum 1966) Simulates Rogerian therapist User: You are like my father in some ways ELIZA: WHAT RESEMBLANCE DO YOU SEE User: You are not very aggressive ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE…

Limitations of Turing Test ELIZA (Weizenbaum 1966) Simulates Rogerian therapist User: You are like my father in some ways ELIZA: WHAT RESEMBLANCE DO YOU SEE User: You are not very aggressive ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE... Passes the Turing Test!! (sort of)

Limitations of Turing Test ELIZA (Weizenbaum 1966) Simulates Rogerian therapist User: You are like my father in some ways ELIZA: WHAT RESEMBLANCE DO YOU SEE User: You are not very aggressive ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE... Passes the Turing Test!! (sort of) “You can fool some of the people....”

Limitations of Turing Test ELIZA (Weizenbaum 1966) Simulates Rogerian therapist User: You are like my father in some ways ELIZA: WHAT RESEMBLANCE DO YOU SEE User: You are not very aggressive ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE... Passes the Turing Test!! (sort of) “You can fool some of the people....” Simple pattern matching technique Very shallow processing

Turing Test Revived “On the web, no one knows you’re a….” Problem: ‘bots’ Automated agents swamp services Challenge: Prove you’re human Test: Something human can do, ‘bot can’t

Turing Test Revived “On the web, no one knows you’re a….” Problem: ‘bots’ Automated agents swamp services Challenge: Prove you’re human Test: Something human can do, ‘bot can’t Solution: CAPTCHAs Distorted images: trivial for human; hard for ‘bot

Turing Test Revived “On the web, no one knows you’re a….” Problem: ‘bots’ Automated agents swamp services Challenge: Prove you’re human Test: Something human can do, ‘bot can’t Solution: CAPTCHAs Distorted images: trivial for human; hard for ‘bot Key: Perception, not reasoning

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to converse? Dave: Open the pod bay doors, HAL. HAL: I'm sorry, Dave. I'm afraid I can't do that.

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to converse? Dave: Open the pod bay doors, HAL. HAL: I'm sorry, Dave. I'm afraid I can't do that. Phonetics & Phonology (Ling 450/550) Sounds of a language, acoustics Legal sound sequences in words

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to converse? Dave: Open the pod bay doors, HAL. HAL: I'm sorry, Dave. I'm afraid I can't do that. Morphology (Ling 570) Recognize, produce variation in word forms Singular vs. plural: Door + sg: -> door; Door + plural -> doors Verb inflection: Be + 1 st person, sg, present -> am

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to converse? Dave: Open the pod bay doors, HAL. HAL: I'm sorry, Dave. I'm afraid I can't do that. Part-of-speech tagging (Ling 570) Identify word use in sentence Bay (Noun) --- Not verb, adjective

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to converse? Dave: Open the pod bay doors, HAL. HAL: I'm sorry, Dave. I'm afraid I can't do that. Syntax (Ling 566: analysis; Ling 570 – chunking; Ling 571- parsing) Order and group words in sentence I’m I do, sorry that afraid Dave I can’t.

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to converse? Dave: Open the pod bay doors, HAL. HAL: I'm sorry, Dave. I'm afraid I can't do that. Semantics (Ling 571) Word meaning: individual (lexical), combined (compositional) ‘Open’ : AGENT cause THEME to become open; ‘pod bay doors’ : (pod bay) doors

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to converse? Dave: Open the pod bay doors, HAL. (request) HAL: I'm sorry, Dave. I'm afraid I can't do that. (statement) Pragmatics/Discourse/Dialogue (Ling 571, maybe) Interpret utterances in context Speech act (request, statement) Reference resolution: I = HAL; that = ‘open doors’ Politeness: I’m sorry, I’m afraid I can’t

Language Processing Pipeline Shallow Processing Deep Processing

Cross-cutting Themes Ambiguity How can we select among alternative analyses? Evaluation How well does this approach perform: On a standard data set? When incorporated into a full system? Multi-linguality Can we apply this approach to other languages? How much do we have to modify it to do so?

Ambiguity “I made her duck” Means....

Ambiguity “I made her duck” Means.... I caused her to duck down

Ambiguity “I made her duck” Means.... I caused her to duck down I made the (carved) duck she has

Ambiguity “I made her duck” Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her

Ambiguity “I made her duck” Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned

Ambiguity “I made her duck” Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck

Ambiguity: POS “I made her duck” Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck V N Pron Poss

Ambiguity: Syntax “I made her duck” Means.... I made the (carved) duck she has ((VP (V made) (NP (POSS her) (N duck)))

Ambiguity: Syntax “I made her duck” Means.... I made the (carved) duck she has ((VP (V made) (NP (POSS her) (N duck))) I cooked duck for her ((VP (V made) (NP (PRON her)) (NP (N (duck)))

Ambiguity Pervasive Pernicious Particularly challenging for computational systems Problem we will return to again and again in class

Tokenization Given input text, split into words or sentences Tokens: words, numbers, punctuation Example: Sherwood said reaction has been "very positive.” Sherwood said reaction has been ” very positive. " Why tokenize?

Tokenization Given input text, split into words or sentences Tokens: words, numbers, punctuation Example: Sherwood said reaction has been "very positive.” Sherwood said reaction has been ” very positive. " Why tokenize? Identify basic units for downstream processing

Tokenization Proposal 1: Split on white space Good enough

Tokenization Proposal 1: Split on white space Good enough? No Why not?

Tokenization Proposal 1: Split on white space Good enough? No Why not? Multi-linguality: Languages without white space delimiters: Chinese, Japanese Agglutinative languages (Finnish) Compounding languages (German) E.g. Lebensversicherungsgesellschaftsangestellter “Life insurance company employee”

Tokenization Proposal 1: Split on white space Good enough? No Why not? Multi-linguality: Languages without white space delimiters: Chinese, Japanese Agglutinative languages (Finnish) Compounding languages (German) E.g. Lebensversicherungsgesellschaftsangestellter “Life insurance company employee” Even with English, misses punctuation

Tokenization - again Proposal 2: Split on white space and punctuation For English Good enough?

Tokenization - again Proposal 2: Split on white space and punctuation For English Good enough? No Problems:

Tokenization - again Proposal 2: Split on white space and punctuation For English Good enough? No Problems: Non-splitting punctuation 1.23  X; 1,23  1, 23 X don’t  don t X;  E mail X, etc

Tokenization - again Proposal 2: Split on white space and punctuation For English Good enough? No Problems: Non-splitting punctuation 1.23  X; 1,23  1, 23 X don’t  don t X;  E mail X, etc Problems: no-splitting whitespace Names: New York; Collocations: pick up

Tokenization - again Proposal 2: Split on white space and punctuation For English Good enough? No Problems: Non-splitting punctuation 1.23  X; 1,23  1, 23 X don’t  don t X;  E mail X, etc Problems: no-splitting whitespace Names: New York; Collocations: pick up What’s a word?

Sentence Segmentation Similar issues Proposal: Split on ‘.’ Problems?

Sentence Segmentation Similar issues Proposal: Split on ‘.’ Problems? Other punctuation: ?,!,etc Non-boundary periods: 1.23 Mr. Sherword M.p.g. …. Solutions?

Sentence Segmentation Similar issues Proposal: Split on ‘.’ Problems? Other punctuation: ?,!,etc Non-boundary periods: 1.23 Mr. Sherword M.p.g. …. Solutions? Rules, dictionaries esp. abbreviations Marked up text + machine learning

Homework #1

General Notes and Naming All code must run on the CL cluster under Condor Each programming question needs a corresponding condor command file named: E.g. hw1_q1.cmd And should run with: $ condor_submit hw1_q1.cmd General comments should appear in hw1.(pdf|txt) Your code may be tested on new data

Q1-Q2: Building and Testing a Rule-Based English Tokenizer Q1: eng_tokenize.* Condor file must contain Input = Output = You may use additional arguments ith input line should correspond to the ith output line Don’t waste too much time making it perfect!! Q2: Explain how your tokenizer handles different phenomena – numbers, hyphenated words, etc – and identify remaining problems.

Q3: make_vocab.* Given the input text: The teacher bought the bike. The bike is expensive. The output should be: The 2 the 1 teacher 1 bought 1 bike. 1 bike 1 is 1 expensive. 1

Q4: Investigating Tokenization Run programs from Q1, Q3 Compare vocabularies with and without tokenization

Next Time Probability Formal languages Finite-state automata