Statistical NLP: Lecture 6

Slides:



Advertisements
Similar presentations
Corpus Processing and NLP
Advertisements

MLA FORMATTING. What is MLA formatting and why do I need to use it? "MLA (Modern Language Association) style is most commonly used to write papers and.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
1 I256 Applied Natural Language Processing Fall 2009 Lecture 3 Morphology Stemming Tokenization Segmentation Barbara Rosario.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Stemming, tagging and chunking Text analysis short of parsing.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Evidence from Content INST 734 Module 2 Doug Oard.
1 COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
1 Programming Languages Tevfik Koşar Lecture - II January 19 th, 2006.
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Programming Fundamentals. Today’s Lecture Why do we need Object Oriented Language C++ and C Basics of a typical C++ Environment Basic Program Construction.
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
What it is and how it works
1 XML eXtensible Markup Language. 2 XML vs. HTML HTML is a HyperText Markup language HTML is a HyperText Markup language Designed for a specific application,
1 Indexing The syntax for creating a index is: CREATE [UNIQUE] INDEX index_name ON table_name (column1, column2,... column_n) [ COMPUTE STATISTICS ]; Why.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
© 2015 albert-learning.com Punctuation For Children Punctuation Punctuations.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Natural Language Processing Chapter 2 : Morphology.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
MORPHOLOGY definition; variability among languages.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Academic Computing Services 2007 Microsoft Word 2010 Publishing Long Documents This Guide will teach you how to work with long documents such as dissertations.
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 홍 정 아홍 정 아.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
ENGLISH MORPHOLOGY Week 4.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
CMSC201 Computer Science I for Majors Lecture 22 – Binary (and More)
عمادة التعلم الإلكتروني والتعليم عن بعد
Natural Language Processing (NLP)
C-Character Set Dept. of Computer Applications Prof. Harpreet Kaur
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Chapter 6 Morphology.
CS 430: Information Discovery
Topics in Linguistics ENG 331
Topics in Linguistics ENG 331
CSCI 5832 Natural Language Processing
Basic Text Processing: Sentence Segmentation
Inf 722 Information Organisation
Statistical n-gram David ling.
Introduction to Text Analysis
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
CSA2050: Introduction to Computational Linguistics
Basic Text Processing Word tokenization.
CSCI 5832 Natural Language Processing
Information Retrieval and Web Design
Natural Language Processing (NLP)
Presentation transcript:

Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Corpus-Based Work Text Corpora are usually big. They also need to be representative samples of the population of interest. • Corpus-Based work involves collecting a large number of counts from corpora that need to be access quickly. There exists some software for processing corpora (see useful links on course homepage).

Corpora Linguistically mark-up or not Representative sample of the population of interest American English vs. British English Written vs. Spoken Areas The performance of a system depends heavily on the entropy Text categorization Balanced corpus vs. all text available

Software/Coding Software Coding Text editor Regular expression Programming language C/C++, Perl, awk, Python, Prolog, Java Coding Mapping words to numbers Hashing CMU-Cambridge Statistical Language Modeling toolkit

Looking at Text (I) Low-Level Formatting Issues Mark-up of a text Formatting mark-up or explicit mark-up Junk formatting/Content. Examples: document headers and separators, typesetter codes, table and diagrams, garbled data in the computer file. Also other problems if data was retrieved through OCR (unrecognized words). Often one needs a filter to remove junk content before any processing begins. Uppercase and Lowercase: should we keep the case or not? The, the and THE should all be treated the same but “brown” in “George Brown” and “brown dog” should be treated separately.

Looking at Text (II): Tokenization What is a Word? An early step of processing is to divide the input text into units called tokens where each is either a word or something else like a number or a punctuation mark. Periods: haplologies or end of sentence? White spaces Periods : etc., 먹었다 하였다. 6.7, 3.1절 Single apostrophes: isn’t, I’ll  2 words ? 1 words Hyphenation: text-based, co-operation, e-mail, A-1-plus paper, “take-it-or-leave-it”, the 90-cent-an-hour raise, mark up  mark-up  mark(ed) up Homographs --> two lexemes :: “saw” 26.3$, www.hyowon.pusan.ac.kr, MicroSoft, :-), “책, ‘그’ 책”

Looking at Text (III): Tokenization What is a Word (Cont’d)? Word Segmentation in other languages: no whitespace ==> words segmentation is hard whitespace not indicating a word break. New York, data base the New York-New Haven railroad variant coding of information of a certain semantic type. +45 43 48 60 60, (202) 522-2230, 33 1 34 43 32 26, (44.171) 830 1007 Speech corpora. er, um,

Morphology Stemming: Strips off affixes. sit, sits, sat Lemmatization: transforms into base form (lemma, lexeme) Disambiguation Not always helpful in English (from an IR point of view) which has very little morphology. !! Stemming does not help the performance of classical IR business  busy Perhaps more useful in other contexts. Mutilpe words  a morpheme ??? Richer inflectional and derivational system Bantu language: KiHaya akabimu’ha (a-ka-bi-mu’-ha, 1SG-PAST-3PL-3SG-give) I gave them to him. Finnish Millions of inflected forms for each verb

Sentences: What is a sentence?” Something ending with a ‘.’, ‘?’ or ‘!’. True in 90% of the cases. Sometimes, however, sentences are split up by other punctuation marks or quotes. Often, solutions involve heuristic methods. However, these solutions are hand-coded. Some effort to automate the sentenceboundary process have also been done. “You remind me,” she remarked, “of your mother.” 우리말은 더욱 어려움!!! 마침표가 없기도 하고  종결형 어미 뒤? 연결형 어미이면서 종결형 어미 따옴표

End-of-Sentence Detection (I) Place EOS after all . ? ! (maybe ;:-) Move EOS after quotation marks, if any Disqualify a period boundary if: – Preceeded by known abbreviation followed by upper case letter, not normally sentence-final: e.g., Prof. vs. Mr.

End-of-Sentence Detection (II) – Precedeed by a known abbreviation not followed by upper case: e.g., Jr. etc. (abbreviation that is sentence-final or medial) Disqualify a sentence boundary with ? or ! If followed by a lower case (or a known name) Keep all the rest as EOS

Marked-Up Data I: Mark-up Schemes Schemes developed to mark up the structure of text Different Mark-up schemes: – COCOA format (older, and rather ad-hoc) – SGML [other related encodings: HTML, TEI, XML] DTD, XML Scheme

Marked-Up Data II: Grammatical Coding Tagging indicates the various conventional parts of speech. Tagging can be done automatically (we will talk about that in Week 9). Different Tag Sets have been used: e.g., Brown Tag Set, Penn Treebank Tag Set. Table 4.4, 4.5 설명 The Design of a Tag Set: Target Features versus Predictive Features. 국내 tag-set에 대해 설명 보조용언과 본용언 구별을 위한 예로 설명 ETRI, KAIST, …