Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon Words  What constitutes a word? Does it matter?  Word tokens vs. word types; type-token curves  Zipf’s law, Mandlebrot’s law; explanation.

Similar presentations


Presentation on theme: "Carnegie Mellon Words  What constitutes a word? Does it matter?  Word tokens vs. word types; type-token curves  Zipf’s law, Mandlebrot’s law; explanation."— Presentation transcript:

1 Carnegie Mellon Words  What constitutes a word? Does it matter?  Word tokens vs. word types; type-token curves  Zipf’s law, Mandlebrot’s law; explanation  Heterogeneity of language: written vs. spoken period, genre, register, domain topic (hierarchy), speaker, audience  “uncertainty principle of language modeling”

2 Carnegie Mellon Sub-language Example 1  “Wall Street Journal” Corpus (WSJ): Newspaper articles, Written English, rich vocabulary (leaning towards finance)  “Switchboard” Corpus (SWB): Transcribed spoken conversations over the telephone Proscribed topic (one of 70) 1990’s  “Broadcast News” Corpus (BN): Transcribed TV/Radio News programs Spoken, but somewhat scripted

3 Carnegie Mellon Unigram Type-Token Curve – BN vs. SWB

4 Carnegie Mellon Unigram Type-Token Curve – BN vs. SWB (log scale)

5 Carnegie Mellon Unigram Type-Token Curve – BN vs. SWB vs. WSJ

6 Carnegie Mellon Unigram Type-Token Curve – BN vs. SWB vs. WSJ (log scale)

7 Carnegie Mellon Bigram Token-Type Curve – BN vs. SWB

8 Carnegie Mellon Bigram Token Type Curve – BN vs. SWB (log scale)

9 Carnegie Mellon Trigram Token-Type Curve – BN vs. SWB

10 Carnegie Mellon Trigram Token-Type Curve – BN vs. SWB (log scale)

11 Carnegie Mellon Head of Word Frequency List (counts per 1,000 tokens) WSJBNSWB THE49 62I38 42THE49AND34 TO24TO27 31 OF24AND25THE28 A22A YOU26 AND19OF21UH26 IN19IN17A24 THAT9 16TO23 FOR9IS13THAT20 IS8YOU12IT17 ONE7I12OF17 ON6IT10KNOW16 POINT5FOR8YEAH14 AS5THIS8IN12 SAID5ON7+NOISE+12 WITH5HAVE6THEY10 IT5ARE6UH-HUH10 FIVE5WE6HAVE10 TWO5THEY6BUT9 DOLLARS5BE6SO8 AT5WITH6IT’S8 MR.5BUT5IS8 BY5WAS5WE8

12 Carnegie Mellon Tail of Word Frequency List: Count=1 (“Singletons”) WSJBNSWB ZENZEROSYEARBOOK ZENKERZHAYEARS” ZEOLITEZHIVAGOSYELLER ZEROS’ZIANGSHINGYELLOWISH ZEROEDZILLIONSYELLS ZEROSZIMBABBWE’SYIELD ZESTYZINGAYIP ZEUS’SZIONYOGURT ZHIZIONLISTYORKER ZHONGTIANZOGYOUNT ZIGZAGZOISTYOURSELFER ZIGZAGGINGZOO’SYUPPISH ZILLIONZOOMEDZACK ZIONISTZUCKERMANZAK’S ZIPZULUZALES ZIPPERZUICHZANTH ZIPPYZWEIMARZEALAND ZOOZWICK’SZEROED ZOOKEEPERZWINKELSZIRCONIUHS

13 Carnegie Mellon Sub-language Example 2  The Diabetes set includes 9 Diabetes-related journals and a total of 4.5M tokens and 95K types.  The Veterinary science set includes 11 journals and 3.2M tokens and 87K types.  All Journals were extracted from PubMed in Oct 2010 and they include everything that was available by those journals up until then.  This example is provided by Dana Movshovitz-Attias.

14 Carnegie Mellon Diabetes vs. Veterinary: Type-Token Curve

15 Carnegie Mellon Diabetes vs. Veterinary: Type-Token Curve (log scale)

16 Carnegie Mellon Head of Word Frequency List (counts per 1,000 tokens) diabetescountveterinarycount THE42THE57 OF35OF39 AND31AND30 IN29IN29 TO16TO17 WITH13A14 A13WERE11 FOR10WAS10 WAS10FOR10 WERE9WITH9 DIABETES7FROM7 THAT7 6 BY6IS6 6AS6 26BY6 AS5ON5 INSULIN5AT5 OR514 GLUCOSE5BE4 15THIS4

17 Carnegie Mellon Tail of Word Frequency List: Count=1 (“Singletons”) DiabetesVeterinary QUESTIONNAIRE-BASEDMOLARITIES CAPACITY-CONSTRAINEDLIDOCAIN DNDMULTIORGAN MICROGLIA-MEDIATED ENZYME-INHIBITORNALYSIS ALVEOLUS-CAPILLARY10702 KUZUYABLUE-DNA $6054HAIR-LOSS SENTENCINGPOPULATION-DYNAMICAL PAPER-AND-PENCILSTATE-TRANSITION

18 Carnegie Mellon Zipf’s Law – Frequency vs. Rank (Brown Corpus)

19 Carnegie Mellon Zipf’s Law – Frequency vs. Rank (Brown Corpus) (log scale)

20 Carnegie Mellon Zipf’s Law – Frequency vs. Rank (Brown Corpus) (log scale) + theoretical Zipf distribution


Download ppt "Carnegie Mellon Words  What constitutes a word? Does it matter?  Word tokens vs. word types; type-token curves  Zipf’s law, Mandlebrot’s law; explanation."

Similar presentations


Ads by Google