Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical NLP: Lecture 6

Similar presentations


Presentation on theme: "Statistical NLP: Lecture 6"— Presentation transcript:

1 Statistical NLP: Lecture 6
Corpus-Based Work (Ch 4)

2 Corpus-Based Work Text Corpora are usually big. They also need to be representative samples of the population of interest. • Corpus-Based work involves collecting a large number of counts from corpora that need to be access quickly. There exists some software for processing corpora (see useful links on course homepage).

3 Corpora Linguistically mark-up or not
Representative sample of the population of interest American English vs. British English Written vs. Spoken Areas The performance of a system depends heavily on the entropy Text categorization Balanced corpus vs. all text available

4 Software/Coding Software Coding Text editor Regular expression
Programming language C/C++, Perl, awk, Python, Prolog, Java Coding Mapping words to numbers Hashing CMU-Cambridge Statistical Language Modeling toolkit

5 Looking at Text (I) Low-Level Formatting Issues
Mark-up of a text Formatting mark-up or explicit mark-up Junk formatting/Content. Examples: document headers and separators, typesetter codes, table and diagrams, garbled data in the computer file. Also other problems if data was retrieved through OCR (unrecognized words). Often one needs a filter to remove junk content before any processing begins. Uppercase and Lowercase: should we keep the case or not? The, the and THE should all be treated the same but “brown” in “George Brown” and “brown dog” should be treated separately.

6 Looking at Text (II): Tokenization What is a Word?
An early step of processing is to divide the input text into units called tokens where each is either a word or something else like a number or a punctuation mark. Periods: haplologies or end of sentence? White spaces Periods : etc., 먹었다 하였다. 6.7, 3.1절 Single apostrophes: isn’t, I’ll  2 words ? 1 words Hyphenation: text-based, co-operation, , A-1-plus paper, “take-it-or-leave-it”, the 90-cent-an-hour raise, mark up  mark-up  mark(ed) up Homographs --> two lexemes :: “saw” 26.3$, MicroSoft, :-), “책, ‘그’ 책”

7 Looking at Text (III): Tokenization What is a Word (Cont’d)?
Word Segmentation in other languages: no whitespace ==> words segmentation is hard whitespace not indicating a word break. New York, data base the New York-New Haven railroad variant coding of information of a certain semantic type. , (202) , , (44.171) Speech corpora. er, um,

8 Morphology Stemming: Strips off affixes.
sit, sits, sat Lemmatization: transforms into base form (lemma, lexeme) Disambiguation Not always helpful in English (from an IR point of view) which has very little morphology. !! Stemming does not help the performance of classical IR business  busy Perhaps more useful in other contexts. Mutilpe words  a morpheme ??? Richer inflectional and derivational system Bantu language: KiHaya akabimu’ha (a-ka-bi-mu’-ha, 1SG-PAST-3PL-3SG-give) I gave them to him. Finnish Millions of inflected forms for each verb

9 Sentences: What is a sentence?”
Something ending with a ‘.’, ‘?’ or ‘!’. True in 90% of the cases. Sometimes, however, sentences are split up by other punctuation marks or quotes. Often, solutions involve heuristic methods. However, these solutions are hand-coded. Some effort to automate the sentenceboundary process have also been done. “You remind me,” she remarked, “of your mother.” 우리말은 더욱 어려움!!! 마침표가 없기도 하고  종결형 어미 뒤? 연결형 어미이면서 종결형 어미 따옴표

10 End-of-Sentence Detection (I)
Place EOS after all . ? ! (maybe ;:-) Move EOS after quotation marks, if any Disqualify a period boundary if: – Preceeded by known abbreviation followed by upper case letter, not normally sentence-final: e.g., Prof. vs. Mr.

11 End-of-Sentence Detection (II)
– Precedeed by a known abbreviation not followed by upper case: e.g., Jr. etc. (abbreviation that is sentence-final or medial) Disqualify a sentence boundary with ? or ! If followed by a lower case (or a known name) Keep all the rest as EOS

12 Marked-Up Data I: Mark-up Schemes
Schemes developed to mark up the structure of text Different Mark-up schemes: – COCOA format (older, and rather ad-hoc) – SGML [other related encodings: HTML, TEI, XML] DTD, XML Scheme

13 Marked-Up Data II: Grammatical Coding
Tagging indicates the various conventional parts of speech. Tagging can be done automatically (we will talk about that in Week 9). Different Tag Sets have been used: e.g., Brown Tag Set, Penn Treebank Tag Set. Table 4.4, 4.5 설명 The Design of a Tag Set: Target Features versus Predictive Features. 국내 tag-set에 대해 설명 보조용언과 본용언 구별을 위한 예로 설명 ETRI, KAIST, …


Download ppt "Statistical NLP: Lecture 6"

Similar presentations


Ads by Google