Download presentation
Presentation is loading. Please wait.
1
Corpus Processing and NLP
Madrid 2010 Kilgarriff: Corpus Processing and NLP
2
Kilgarriff: Corpus Processing and NLP
What is NLP? Natural Language Processing natural language vs. computer languages Other names Computational Linguistics emphasizes scientific not technological Language Engineering official European Union term, ca Human Language Technology (HLT) preferred EU and US Government term) Language Technology Madrid 2010 Kilgarriff: Corpus Processing and NLP
3
NLP and linguistics L I N N L G P supply ideas interpret results
test theories expose gaps plus turn into technology Madrid 2010 Kilgarriff: Corpus Processing and NLP
4
Example: regular morphology
LINGUISTICS: Rules: stems -> inflected forms NLP: program the rules apply rules to a lexicon of stems Is the output correct? Errors? refine the theory Needed for: web search, spell-checkers, machine translation, speech recognition systems etc. Madrid 2010 Kilgarriff: Corpus Processing and NLP
5
Kilgarriff: Corpus Processing and NLP
Application areas web search Basic search Filtering results spelling and grammar checking machine translation (MT) talking to computers speech processing as well information extraction (IE) finding facts in a database of documents; populating a database, answering questions Madrid 2010 Kilgarriff: Corpus Processing and NLP
6
How can NLP make better dictionaries?
By pre-processing a corpus: tokenization sentence splitting lemmatization POS-tagging parsing Each step builds on predecessors Madrid 2010 Kilgarriff: Corpus Processing and NLP
7
Kilgarriff: Corpus Processing and NLP
Tokenization “identifying the words” from: he didn't arrive. to: He did n’t arrive . Madrid 2010 Kilgarriff: Corpus Processing and NLP
8
Automatic tokenization
Western writing systems easy! space is separator Chinese, Japanese, some other writing systems do not use word-separator hard like POS-tagging (below) Madrid 2010 Kilgarriff: Corpus Processing and NLP
9
Why isn't space=separator enough (even for English)?
what is a space linebreaks, paragraph breaks, tabs Punctuation characters do not form parts of words but may be attached to words (with no spaces) brackets, quotation marks Hyphenation is co-op one word or two? is well-managed? Madrid 2010 Kilgarriff: Corpus Processing and NLP
10
Kilgarriff: Corpus Processing and NLP
Sentence splitting to: <s> He did n’t arrive . </s> “identifying the sentences” from: he didn't arrive. to: He did n’t arrive . Madrid 2010 Kilgarriff: Corpus Processing and NLP
11
Kilgarriff: Corpus Processing and NLP
Lemmatization Mapping from text-word to lemma help (verb) text-word to lemma help help (v) helps help (v) helping help (v) helped help (v) . Madrid 2010 Kilgarriff: Corpus Processing and NLP
12
Kilgarriff: Corpus Processing and NLP
Lemmatization Mapping from text-word to lemma help (verb) help (noun), helping (noun) text-word to lemma help help (v), help (n) helps help (v), helps (n)** helping help (v), helping (n) helped help (v) helpings helping (n) **help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending. . Madrid 2010 Kilgarriff: Corpus Processing and NLP
13
Kilgarriff: Corpus Processing and NLP
Lemmatization Dictionary entries are for lemmas so lemmatization is required for a match between text-word and dictionary-word . Madrid 2010 Kilgarriff: Corpus Processing and NLP
14
Kilgarriff: Corpus Processing and NLP
Lemmatization Searching by lemma English: little inflection French: 36 forms per verb Finno-Ugric: 2000. Not always wanted: English royalty singular: kings and queens plural royalties: payments to authors Madrid 2010 Kilgarriff: Corpus Processing and NLP
15
Automatic lemmatization
Write rules: if word ends in "ing", delete "ing"; if the remainder is verb lemma, add to list of possible lemmas If detailed grammar available, use it full lemma list is also required Often available from dictionary companies Madrid 2010 Kilgarriff: Corpus Processing and NLP
16
Part-of-speech (POS) tagging
He PNP pers pronoun did VVD past tense verb n’t XNOT not arrive VV base form of verb . C punctuation </s> “identifying parts of speech” from: he didn't arrive. to: … . Madrid 2010 Kilgarriff: Corpus Processing and NLP
17
Kilgarriff: Corpus Processing and NLP
Tagsets The set of part-of-speech tags to choose between Basic: noun, verb, pronoun … Advanced: examples - CLAWS English tagset NN2 plural noun VVG -ing form of lexical verb Based on linguistics of the language. Madrid 2010 Kilgarriff: Corpus Processing and NLP
18
Kilgarriff: Corpus Processing and NLP
POS-tagging: why? Use grammar when searching Nouns modified by buckle Verbs that buckle is object of Madrid 2010 Kilgarriff: Corpus Processing and NLP
19
Kilgarriff: Corpus Processing and NLP
POS-tagging: how? Big topic for computational linguistics well understood taggers available for major languages Some taggers use lemmatized input, others do not Methods constraint-based: set of rules of the form if previous word is "the" and VERB is one of the possibilities, delete VERB Statistical: Machine learning from tagged corpus Various methods Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999. Madrid 2010 Kilgarriff: Corpus Processing and NLP
20
Kilgarriff: Corpus Processing and NLP
Parsing Find the structure: Phrase structure (trees) The cat sat on the mat Dependency structure (links) The cat sat on the mat Madrid 2010 Kilgarriff: Corpus Processing and NLP
21
Kilgarriff: Corpus Processing and NLP
Automatic parsing Big topic see Jurafsky and Martin or other NLP textbook Many methods too slow for large corpora Sketch Engine usually uses “shallow parsing” Patterns of POS-tags Regular expressions Madrid 2010 Kilgarriff: Corpus Processing and NLP
22
Kilgarriff: Corpus Processing and NLP
Regular expressions Search for any pattern Very useful in lots of places Exercises Madrid 2010 Kilgarriff: Corpus Processing and NLP
23
Kilgarriff: Corpus Processing and NLP
Summary What is NLP? How can it help? Tokenizing Sentence splitting Lemmatizing POS-tagging Parsing Madrid 2010 Kilgarriff: Corpus Processing and NLP
24
Kilgarriff: Corpus Processing and NLP
Exercise A sentence of your language A tagset of your language Tokenize For each word, decide What is the lemma (doesn’t apply in Chinese) Which tag applies … NN2 relative relatives VVG visit Visiting Tag Lemma Word Madrid 2010 Kilgarriff: Corpus Processing and NLP
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.