Download presentation
Presentation is loading. Please wait.
1
Learning Bit by Bit Class 3 – Stemming and Tokenization
2
Morphology The study of the way words are constructed from smaller components
3
Morphology The study of the way words are constructed from smaller components Stems – “talk” Affixes – “ing”
4
Morphology Orthographic Rules – General Morphological Rules - Specific
5
Parsing Analyzing a text in pieces
6
Parsing Morphological Parsing – decomposing a word into its constituent morphemes Foxes -> fox + es
7
Morphological Parsing Must recognize proper words “spelling” Must not recognize improper words “computering”
8
Morphological Parsing Should not require a list of all possible words
9
Morphological Parsing Web Search Spell check, grammar check Machine translation Sentiment analysis
10
Computational Lexicon Stems Affixes Rules
11
Computational Lexicon
13
Finite State Transducer FSA which maps an input to an output relationships
14
Finite State Transducer c:ca:at:t +N: ε + PL:s Input – cat +N +PL Output - cats
15
Porter Stemmer Returns the stem of each word Input: cats, output: cat Input: positivity, output: positive Input: pitted, output: pit
16
Porter Stemmer ATIONAL : ATE (relational -> relate) ING : ε (motoring - > motor) SSES : SS (grasses -> grass)
17
Porter Stemmer Errors: – Organization -> organ – Doing -> do – Policy -> Polici
18
Tokenization Breaking a text into words or sentences
19
Tokenization Mrs. Wilson’s reaction to the damage was “quite positive.” She asked for $15.55.
20
Tokenization Simplest tokenizer is regex-based
21
Tokenization IndoEuropean Tokenizer General purpose alphabetic Token = letters + numbers Splits on whitespace, punctuation, special characters
22
Sentence Tokenization What is the challenge?
23
Sentence Tokenization Binary Classifier
24
Stop List List of words to remove [the, a, an…]
25
Stop List EnglishStopTokenizerFactory: “a, be, had, it, only, she, was, about, because, has, its, of, some, we, after, been, have, last, on, such, were, all, but, he, more, one, than, when, also, by, her, most, or, that, which, an, can, his, mr, other, the, who, any, co, if, mrs, out, their, will, and, corp, in, ms, over, there, with, are, could, inc, mz, s, they, would, as, for, into, no, so, this, up, at, from, is, not, says, to”
26
Homework Program a stop list tokenizer (you can use my example as a starting point) Blog about what makes a good stop list, how major search engines use them and how yours compares
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.