Learning Bit by Bit Class 3 – Stemming and Tokenization
Morphology The study of the way words are constructed from smaller components
Morphology The study of the way words are constructed from smaller components Stems – “talk” Affixes – “ing”
Morphology Orthographic Rules – General Morphological Rules - Specific
Parsing Analyzing a text in pieces
Parsing Morphological Parsing – decomposing a word into its constituent morphemes Foxes -> fox + es
Morphological Parsing Must recognize proper words “spelling” Must not recognize improper words “computering”
Morphological Parsing Should not require a list of all possible words
Morphological Parsing Web Search Spell check, grammar check Machine translation Sentiment analysis
Computational Lexicon Stems Affixes Rules
Computational Lexicon
Finite State Transducer FSA which maps an input to an output relationships
Finite State Transducer c:ca:at:t +N: ε + PL:s Input – cat +N +PL Output - cats
Porter Stemmer Returns the stem of each word Input: cats, output: cat Input: positivity, output: positive Input: pitted, output: pit
Porter Stemmer ATIONAL : ATE (relational -> relate) ING : ε (motoring - > motor) SSES : SS (grasses -> grass)
Porter Stemmer Errors: – Organization -> organ – Doing -> do – Policy -> Polici
Tokenization Breaking a text into words or sentences
Tokenization Mrs. Wilson’s reaction to the damage was “quite positive.” She asked for $15.55.
Tokenization Simplest tokenizer is regex-based
Tokenization IndoEuropean Tokenizer General purpose alphabetic Token = letters + numbers Splits on whitespace, punctuation, special characters
Sentence Tokenization What is the challenge?
Sentence Tokenization Binary Classifier
Stop List List of words to remove [the, a, an…]
Stop List EnglishStopTokenizerFactory: “a, be, had, it, only, she, was, about, because, has, its, of, some, we, after, been, have, last, on, such, were, all, but, he, more, one, than, when, also, by, her, most, or, that, which, an, can, his, mr, other, the, who, any, co, if, mrs, out, their, will, and, corp, in, ms, over, there, with, are, could, inc, mz, s, they, would, as, for, into, no, so, this, up, at, from, is, not, says, to”
Homework Program a stop list tokenizer (you can use my example as a starting point) Blog about what makes a good stop list, how major search engines use them and how yours compares