Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Corpus Processing and NLP
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
PUNCTUATION MARKS ETC. for Writing References & Citations.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Towards an NLP `module’ The role of an utterance-level interface.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Stemming, tagging and chunking Text analysis short of parsing.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
1 COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Capitalization and punctuation By Cristian walle.
CpSc 462/662: Database Management Systems (DBMS) (TEXNH Approach) HTML Basics James Wang.
TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Programming Fundamentals. Today’s Lecture Why do we need Object Oriented Language C++ and C Basics of a typical C++ Environment Basic Program Construction.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.
Introduction to programming in the Java programming language.
Algorithms  Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
CS621: Artificial Intelligence
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
Part-of-speech tagging
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
Machine Learning in Practice Lecture 13 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.
Word classes and part of speech tagging Chapter 5.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Lecture 9: Part of Speech
Advanced Computer Systems
Introduction to Machine Learning and Text Mining
Algorithms Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
Corpus Linguistics I ENG 617
CS 430: Information Discovery
Intro to PHP & Variables
CSCI 5832 Natural Language Processing
Natural Language Processing
Statistical NLP: Lecture 6
Part-of-Speech Tagging Using Hidden Markov Models
Presentation transcript:

Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2008/10/13 (Slides from Dr. Mary P. Harper,

Fall 2001EE669: Natural Language Processing2 What is a Corpus? (1) A collection of texts, especially if complete and self- contained: the corpus of Anglo-Saxon verse. (2) In linguistics and lexicography, a body of texts, utterances, or other specimens considered more or less representative of a language, and usually stored as an electronic database. Currently, computer corpora may store many millions of running words, whose features can be analyzed by means of tagging and the use of concordancing programs. [from The Oxford Companion to the English Language, ed. McArthur & McArthur 1992]

Fall 2001EE669: Natural Language Processing3 Corpus-Based Work Text corpora are usually big, often representative samples of some population of interest. For example, the Brown Corpus collected by Kucera and Francis was designed as a representative sample of written American English. Balance of subtypes (e.g., genre) is often desired. Corpus work involves collecting a large number of counts from corpora that need to be accessed quickly. There exists some software for processing corpora (see useful links on course homepage).

Fall 2001EE669: Natural Language Processing4 Taxonomies of Corpora Media: printed, electronic text, digitized audio, video, OCR text, etc. Raw (plain text) vs. Annotated (use a markup scheme to add codes to the file, e.g., part-of- speech tags) Language variables: –monolingual vs. multilingual –original vs. translation

Fall 2001EE669: Natural Language Processing5 Major Suppliers of Corpora Linguistic Data Consortium (LDC): European Language Resources Association (ELRA): Oxford Text Archive (OTA): Child Language Data Exchange System (CHILDES): International Computer Archive of Modern English (ICAME):

Fall 2001EE669: Natural Language Processing6

Fall 2001EE669: Natural Language Processing7 Software Text Editors: e.g., emacs Regular Expressions: to identify patterns in text (equivalent to a finite state machine; can process text in linear time). Programming Languages: C, C++, Java, Perl, Prolog, etc. Programming Techniques: –Data structures like hash tables are useful for mapping words to numbers. –Need counts to calculate probabilities (two pass: emit toke and then count later, e.g., CMU-Cambridge Statistical Language Modeling toolkit.

Fall 2001EE669: Natural Language Processing8 Challenges for Corpus Building Low-level formatting issues: dealing with junk and case What is a word? -- Tokenization To stem or not to stem? tokenization  token (or maybe toke) What is a sentence, and how can we detect their boundaries?

Fall 2001EE669: Natural Language Processing9 Low-Level Formatting Issues Junk Formatting/Content: Examples include document headers and separators, typesetter codes, tables and diagrams, garbled data in the file. Problems arise if data was obtained using OCR (unrecognized words). May need to remove junk content before any processing begins. Uppercase and Lowercase: Should we keep the case or not? The, the, and THE should all be treated as the same token but White in George White and white in white snow should be treated as distinct tokens. What about sentence initial capitalization (to downcase or not to downcase)?

Fall 2001EE669: Natural Language Processing10 Tokenization: What is a Word? Early in processing, we must divide the input text into meaningful units called tokens (e.g., words, numbers, puctuation). Tokenization is the process of breaking input from a text character stream into tokens to be normalized and saved (see Sampson’s 1995 book English for the Computer by Oxford University Press for a carefully designed and tested set of tokenization rules). A graphic word token (Kucera and Francis): –A string of contiguous alphanumeric characters with space on either side which may include hyphens and apostrophes, but no other punctuation marks. –Problems:Microsoft or :-)

Fall 2001EE669: Natural Language Processing11 Some of the Problems: Period Words are not always separated from other tokens by white space. For example, periods may signal an abbreviation (do not separate) or the end of sentence (separate?). –Abbreviations (haplology): etc. St. Dr. A single capital followed by a period, e.g., A. B. C. A sequence of letter-period-letter-period’s such as U.S., m.p.h. Mt. St. Wash. –End of sentence? I live on Burt St.

Fall 2001EE669: Natural Language Processing12 Some of the Problems: Apostrophes How should contractions and clitics be regarded? One or two tokens? –I’ll or I ’ll –The dog’s food or The dog ’s food –The boys’ club From the perspective of parsing, I’ll needs to be separated into two tokens because there is no category that combines nouns and verbs together.

Fall 2001EE669: Natural Language Processing13 Some of the Problems: Hyphens How should we deal with hyphens? Are hyphenated words comprised of one or multiple tokens? Useage: 1.Typographical to improve the right margins of a document: typically the hyphens should be removed since breaks occur at syllable boundaries; however, the hyphen may be part of the word too. 2.Lexical hyphens: inserted before or after small word formatives (e.g., co-operate, so-called, pro-university). 3.Word grouping: Take-it-or-leave-it, once-in-a-lifetime, text- based, etc. How many lexemes will you allow? –Data base, data-base, database –Cooperate, Co-operate –Mark-up, mark up

Fall 2001EE669: Natural Language Processing14 Some of the Problems: Hyphens Authors may not be consistent with hyphenation, e.g., cooperate and co-operate may appear in the same document. Dashes can be used as punctuation without separating them from words with space: I am happy-Bill is not.

Fall 2001EE669: Natural Language Processing15 Different Formats in Text Pattern

Fall 2001EE669: Natural Language Processing16 Some of the Problems: Homographs In some cases, lexemes have overlapping forms (homographs) as in: –I saw the dog. –When you saw the wood, please wear safety goggles. –The saw is sharp. These forms will need to be distinguished for part- of-speech tagging.

Fall 2001EE669: Natural Language Processing17 Some of the Problems: No space between Words There are no separators between words in languages like Chinese, so English tokenization methods are irrelevant. Waterloo is located in the south of Canada. Compounds in German: Lebensversicherungsgesellschaftsangesteller

Fall 2001EE669: Natural Language Processing18 Some of the Problems: Spaces within Words Sometimes spaces occur in the middle of something that we would prefer to call a single token: –Phone number: –Names: Mr. John Smith, New York, U. S. A. –Verb plus particle: work out, make up

Fall 2001EE669: Natural Language Processing19 Some of the Problems: Multiple Formats Numbers (format plus ambiguous separator): –English: 123, [0-9](([0-9] + [,]) * )([.][0-9] + ) –French: ,78 [0-9](([0-9] + [ ]) * )([,][0-9] + ) There are also multiple formats for: –Dates –Phone numbers –Addresses –Names

Fall 2001EE669: Natural Language Processing20 Morphology: What Should I Put in My Dictionary? Should all word forms be stored in the lexicon? Probably ok for English (little morphology) but not for Czech or German (lots of forms!) Stemming: Strip off affixes and leave the stem (lemma).  Not that helpful in English (from an IR point of view)  Perhaps more useful for other languages or in other contexts Multi-word tokens as a single word token can help.

Fall 2001EE669: Natural Language Processing21 What is a Sentence? Something ending with a ‘.’, ‘?’ or ‘!’. True in 90% of the cases.  Sentences may be split up by other punctuation marks (e.g., : ; --).  Sentences may be broken up, as in: “You should be here,” she said, “before I know it!”  Quote marks may be at the very end of the sentence.  Identifying sentence boundaries can involve heuristic methods that are hand-coded. Some effort to automate the sentence-boundary process has also been tried.

Fall 2001EE669: Natural Language Processing22 Heuristic Algorithm Place putative sentence boundaries after all occurrences of. ? !. Move boundary after following quotation marks, if any. Disqualify a period boundary in the following circumstances: –If it is preceded by a known abbreviation of a sort that does not normally occur word finally, but is commonly followed by a capitalized proper name, such as Prof. or vs.

Fall 2001EE669: Natural Language Processing23 –If it is preceded by a known abbreviation and not followed by an uppercase word. This will deal correctly with most usage of abbreviations like etc. or Jr. which can occur sentence medially or finally. Disqualify a boundary with a ? or ! If: –It is followed by a lowercase letter (or a known name) Regard other putative sentence boundaries as sentence boundaries. Heuristic Algorithm (cont.)

Fall 2001EE669: Natural Language Processing24 Adaptive Sentence Boundary Detect The group included Dr. J. M. Freeman and T.Boone Pickens Jr. David D. Palmer, Marti A. Hearst, Adaptive Sentence Boundary Disambiguation, Technical Report, 97/94,UC Berkeley: 98-99% correct The part-of-speech probabilities of the tokens surrounding a punctuation mark are input to a feed forward neural network, and the network’s output activation value indicates the role of the punctuation.

Fall 2001EE669: Natural Language Processing25 Adaptive Sentence Boundary Detect (cont.) To solve the problem of processing cycle, instead of assigning a single POS to each word, the algorithm uses the prior probabilities of all POS for that word. (20) Input: k*20, where k is the number of words of context surrounding an instance of an end-of- sentence punctuation mark. K hidden units with sigmoid squashing activation function. 1 Output indicates the results of the function.

Fall 2001EE669: Natural Language Processing26 Marking up Data: Mark-up Schemes Plain text corpora are useful, but more can be learned if information is added. –Boundaries for sentences, paragraphs, etc. –Lexical tags –Syntactic Structure –Semantic Representation –Semantic class Different Mark-up schemes: –COCOA format (header information in texts, e.g., author, date, title): uses angle brackets with the first letter indicating the broad semantics of the field). –Standard Generalized Markup Language or SGML (related: HTML, TEI, XML)

Fall 2001EE669: Natural Language Processing27 SGML Examples This book does not delve very deeply into SGML. … In XML, such empty elements may be specifically marked by ending the tag name with a forward slash character. SGML can be very useful. Character and Entity codes: begin with ampersand and end with semicolon –C is the less than symbol  < is the less than symbol –résumé  rèsumè

Fall 2001EE669: Natural Language Processing28 Marking up Data: Grammatical Coding Tagging corresponds to indicating the various conventional parts of speech. Tagging can be done automatically (we will talk about that in a later lecture). Different Tag Sets have been used, e.g., Brown Tag Set, University of Lancaster Tag Set, Penn Treebank Tag Set, British National Corpus (CLAWS*), Czech National Corpus The Design of a Tag Set: –Target Features: useful information on the grammatical class –Predictive Features: useful for predicting behavior of other words in context (e.g., distinguish modals and auxiliary verbs from regular verbs)

Fall 2001EE669: Natural Language Processing29 Penn Treebank Set Pronoun: PRP, PRP$, WP, WP$, EX Verb: VB, VBP, VBZ, VBD, VBG, VBN (have, be, and do are not distinguished) Infinitive marker (to): TO Preposition to: TO Other prepositions: IN Punctuation:. ;, - $ ( ) `` ’’ FW, SYM, LS Adjective: JJ, JJR, JJS Cardinal: CD Adverb: RB, RBR, RBS, WRB Conjunction: CC, IN (subordinating and that) Determiner: DT, PDT, WDT Noun: NN, NNS, NNP, NNPS (no distinction for adverbial)

Fall 2001EE669: Natural Language Processing30 Tag Sets General definition: –Tags can be represented as a vector: (c 1,c 2,...,c n ) –Thought of as a flat list T = {t i } i=1..n with some assumed 1:1 mapping T  (C 1,C 2,...,C n ) English tagsets: –Penn treebank (45) (VBZ: Verb,Pres,3,sg, JJR: Adj. Comp.) –Brown Corpus (87), Claws c5 (62), London-Lund (197)

Fall 2001EE669: Natural Language Processing31 Tag Sets for other Languages Differences: –Larger number of tags –categories covered (POS, Number, Case, Negation,...) –level of detail –presentation (short names vs. structured (“positional”)) Example: –Czech: AGFS3----1A---- POS SUBPOS GENDER NUMBER CASE POSSG POSSN PERSON TENSE DCOMP NEG VOICE VAR

Fall 2001EE669: Natural Language Processing32 Sentence Length Distribution

Fall 2001EE669: Natural Language Processing33