Presentation is loading. Please wait.

Presentation is loading. Please wait.

Madrid 2010Kilgarriff: Corpus Processing and NLP1 Corpus Processing and NLP.

Similar presentations


Presentation on theme: "Madrid 2010Kilgarriff: Corpus Processing and NLP1 Corpus Processing and NLP."— Presentation transcript:

1 Madrid 2010Kilgarriff: Corpus Processing and NLP1 Corpus Processing and NLP

2 Madrid 2010Kilgarriff: Corpus Processing and NLP2 What is NLP? Natural Language Processing –natural language vs. computer languages Other names –Computational Linguistics emphasizes scientific not technological –Language Engineering official European Union term, ca –Human Language Technology (HLT)‏ preferred EU and US Government term)‏ –Language Technology

3 Madrid 2010Kilgarriff: Corpus Processing and NLP3 NLP and linguistics LINGLING NLPNLP supply ideas interpret results test theories expose gaps plus turn into technology

4 Madrid 2010Kilgarriff: Corpus Processing and NLP4 Example: regular morphology LINGUISTICS: –Rules: stems -> inflected forms NLP: –program the rules –apply rules to a lexicon of stems –Is the output correct? Errors? LINGUISTICS: –refine the theory Needed for: web search, spell-checkers, machine translation, speech recognition systems etc.

5 Madrid 2010Kilgarriff: Corpus Processing and NLP5 Application areas web search –Basic search –Filtering results spelling and grammar checking machine translation (MT) talking to computers – speech processing as well information extraction (IE)‏ –finding facts in a database of documents; populating a database, answering questions

6 Madrid 2010Kilgarriff: Corpus Processing and NLP6 How can NLP make better dictionaries? By pre-processing a corpus: tokenization sentence splitting lemmatization POS-tagging parsing Each step builds on predecessors

7 Madrid 2010Kilgarriff: Corpus Processing and NLP7 Tokenization “identifying the words” from: he didn't arrive. to: He did n’t arrive.

8 Madrid 2010Kilgarriff: Corpus Processing and NLP8 Automatic tokenization Western writing systems –easy! space is separator Chinese, Japanese, some other writing systems – do not use word-separator –hard like POS-tagging (below)

9 Madrid 2010Kilgarriff: Corpus Processing and NLP9 Why isn't space=separator enough (even for English)? what is a space –linebreaks, paragraph breaks, tabs Punctuation –characters do not form parts of words but may be attached to words (with no spaces)‏ brackets, quotation marks Hyphenation –is co-op one word or two? is well-managed?

10 Madrid 2010Kilgarriff: Corpus Processing and NLP10 Sentence splitting “identifying the sentences” from: he didn't arrive. to: He did n’t arrive. to: He did n’t arrive.

11 Madrid 2010Kilgarriff: Corpus Processing and NLP11 Lemmatization Mapping from text-word to lemma help (verb)‏ text-word to lemma help help (v)‏ helps help (v)‏ helping help (v)‏ helped help (v).

12 Madrid 2010Kilgarriff: Corpus Processing and NLP12 Lemmatization Mapping from text-word to lemma help (verb) help (noun), helping (noun)‏ text-word to lemma help help (v), help (n)‏ helps help (v), helps (n)** helping help (v), helping (n)‏ helped help (v) helpingshelping (n)‏ **help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending..

13 Madrid 2010Kilgarriff: Corpus Processing and NLP13 Lemmatization Dictionary entries are for lemmas so lemmatization is required for a match between text-word and dictionary-word.

14 Madrid 2010Kilgarriff: Corpus Processing and NLP14 Lemmatization Searching by lemma –English: little inflection –French: 36 forms per verb –Finno-Ugric: Not always wanted: –English royalty singular: kings and queens plural royalties: payments to authors

15 Madrid 2010Kilgarriff: Corpus Processing and NLP15 Automatic lemmatization Write rules: –if word ends in "ing", delete "ing"; –if the remainder is verb lemma, add to list of possible lemmas If detailed grammar available, use it full lemma list is also required –Often available from dictionary companies

16 Madrid 2010Kilgarriff: Corpus Processing and NLP16 Part-of-speech (POS) tagging “identifying parts of speech” from: he didn't arrive. to: …. to: HePNP pers pronoun didVVD past tense verb n’t XNOT not arriveVV base form of verb.C punctuation

17 Madrid 2010Kilgarriff: Corpus Processing and NLP17 Tagsets The set of part-of-speech tags to choose between –Basic: noun, verb, pronoun … –Advanced: examples - CLAWS English tagset NN2 plural noun VVG -ing form of lexical verb Based on linguistics of the language.

18 Madrid 2010Kilgarriff: Corpus Processing and NLP18 POS-tagging: why? Use grammar when searching –Nouns modified by buckle –Verbs that buckle is object of

19 Madrid 2010Kilgarriff: Corpus Processing and NLP19 POS-tagging: how? Big topic for computational linguistics –well understood –taggers available for major languages Some taggers use lemmatized input, others do not Methods –constraint-based: set of rules of the form if previous word is "the" and VERB is one of the possibilities, delete VERB –Statistical: Machine learning from tagged corpus Various methods Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999.

20 Madrid 2010Kilgarriff: Corpus Processing and NLP20 Parsing Find the structure: –Phrase structure (trees)‏ The cat sat on the mat –Dependency structure (links)‏ – The cat sat on the mat

21 Madrid 2010Kilgarriff: Corpus Processing and NLP21 Automatic parsing Big topic –see Jurafsky and Martin or other NLP textbook Many methods too slow for large corpora Sketch Engine usually uses “shallow parsing” –Patterns of POS-tags –Regular expressions

22 Madrid 2010Kilgarriff: Corpus Processing and NLP Regular expressions Search for any pattern Very useful in lots of places Exercises –http://www.sketchengine.co.uk/exercises/regex

23 Madrid 2010Kilgarriff: Corpus Processing and NLP23 Summary What is NLP? How can it help? –Tokenizing –Sentence splitting –Lemmatizing –POS-tagging –Parsing

24 Madrid 2010Kilgarriff: Corpus Processing and NLP24 Exercise A sentence of your language A tagset of your language Tokenize For each word, decide –What is the lemma (doesn’t apply in Chinese)‏ –Which tag applies … NN2relativerelatives VVGvisitVisiting TagLemmaWord


Download ppt "Madrid 2010Kilgarriff: Corpus Processing and NLP1 Corpus Processing and NLP."

Similar presentations


Ads by Google