Presentation is loading. Please wait.

Presentation is loading. Please wait.

CH.4 PROBABILITY AND TEXT SAMPLING 2011.10.19. Data mining LAB 이아람.

Similar presentations


Presentation on theme: "CH.4 PROBABILITY AND TEXT SAMPLING 2011.10.19. Data mining LAB 이아람."— Presentation transcript:

1 CH.4 PROBABILITY AND TEXT SAMPLING 2011.10.19. Data mining LAB 이아람

2 4.5 THE BAG-OF-WORDS MODEL Only analyzing word frequencies Word order is irrelevant

3 4.6 THE EFFECT OF SAMPLE SIZE How the number of types is related to the number of tokens as the sample size increases. Types vs Tokens as the sample size increases

4 4.6.1 TOKENS vs TYPES Tokens : every word is counted, including repetitions Types : repetitions are ignored The cat ate the bird.

5 Notation N = the size of the text sample the number of tokens V(N) = the number of types w i = Labeled word f( w i, N ) = the frequency of the word w i in a text of size N

6 TOKENS vs TYPES

7 Tokens vs Types Figure 4.5

8 Tokens vs Tokens/Types Figure 4.6

9 Tokens vs Tokens/Types (2) Figure 4.7 The Black cat 3.17 tokens per type The Unparalleled Adventures of One Hans Pfaall 5.61 tokens per type N -> 1,000~4,000

10 Size of sample In corpus linguistics, take samples of equal size. Smaller than each text -> analyzed in a similar fashion The corpora use this approach. ex) the Brown Corpus


Download ppt "CH.4 PROBABILITY AND TEXT SAMPLING 2011.10.19. Data mining LAB 이아람."

Similar presentations


Ads by Google