Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Bit by Bit Class 3 – Stemming and Tokenization.

Similar presentations


Presentation on theme: "Learning Bit by Bit Class 3 – Stemming and Tokenization."— Presentation transcript:

1 Learning Bit by Bit Class 3 – Stemming and Tokenization

2 Morphology The study of the way words are constructed from smaller components

3 Morphology The study of the way words are constructed from smaller components Stems – “talk” Affixes – “ing”

4 Morphology Orthographic Rules – General Morphological Rules - Specific

5 Parsing Analyzing a text in pieces

6 Parsing Morphological Parsing – decomposing a word into its constituent morphemes Foxes -> fox + es

7 Morphological Parsing Must recognize proper words “spelling” Must not recognize improper words “computering”

8 Morphological Parsing Should not require a list of all possible words

9 Morphological Parsing Web Search Spell check, grammar check Machine translation Sentiment analysis

10 Computational Lexicon Stems Affixes Rules

11 Computational Lexicon

12

13 Finite State Transducer FSA which maps an input to an output relationships

14 Finite State Transducer c:ca:at:t +N: ε + PL:s Input – cat +N +PL Output - cats

15 Porter Stemmer Returns the stem of each word Input: cats, output: cat Input: positivity, output: positive Input: pitted, output: pit

16 Porter Stemmer ATIONAL : ATE (relational -> relate) ING : ε (motoring - > motor) SSES : SS (grasses -> grass)

17 Porter Stemmer Errors: – Organization -> organ – Doing -> do – Policy -> Polici

18 Tokenization Breaking a text into words or sentences

19 Tokenization Mrs. Wilson’s reaction to the damage was “quite positive.” She asked for $15.55.

20 Tokenization Simplest tokenizer is regex-based

21 Tokenization IndoEuropean Tokenizer General purpose alphabetic Token = letters + numbers Splits on whitespace, punctuation, special characters

22 Sentence Tokenization What is the challenge?

23 Sentence Tokenization Binary Classifier

24 Stop List List of words to remove [the, a, an…]

25 Stop List EnglishStopTokenizerFactory: “a, be, had, it, only, she, was, about, because, has, its, of, some, we, after, been, have, last, on, such, were, all, but, he, more, one, than, when, also, by, her, most, or, that, which, an, can, his, mr, other, the, who, any, co, if, mrs, out, their, will, and, corp, in, ms, over, there, with, are, could, inc, mz, s, they, would, as, for, into, no, so, this, up, at, from, is, not, says, to”

26 Homework Program a stop list tokenizer (you can use my example as a starting point) Blog about what makes a good stop list, how major search engines use them and how yours compares


Download ppt "Learning Bit by Bit Class 3 – Stemming and Tokenization."

Similar presentations


Ads by Google