Carnegie Mellon Words  What constitutes a word? Does it matter?  Word tokens vs. word types; type-token curves  Zipf’s law, Mandlebrot’s law; explanation.

Slides:



Advertisements
Similar presentations
Formulaic Language in Academic Study
Advertisements

Language Modeling: Ngrams
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
1990s DARPA Programmes WSJ and BN Dapo Durosinmi-Etti Bo Xu Xiaoxiao Zheng.
Introduction: A discourse perspective on grammar
REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry.
Centro per la Ricerca Scientifica e Tecnologica Spoken language technologies: recent advances and future challenges Gianni Lazzari VIENNA July 26.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003 Towards Dolphin Recognition.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
LING 438/538 Computational Linguistics
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
THE PRINCIPLES OF LIMITED GOVERNMENT
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.
How Can Corpora Help Me To Be Successful in CO150?
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
GSL & NGSL. Comparison: GSL 1953 (Michael West) 1995 ( John Bauman & Brent Culligan) Today’s version 2284 Word families (famous early 20th century researchers;
Language and Statistics
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Modeling Latent Biographic Attributes in Conversational Genres Nikesh Garera David Yarowsky.
1 Ch 1. VOCABULARY SIZE, TEXT COVERAGE & WORD LISTS Nation& Waring.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Statistical Properties of Text
Experiments in Adaptive Language Modeling Lidia Mangu & Geoffrey Zweig.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Rom Words to Pictures: Text Analysis and Visualization Nicholas Diakopoulos Computational Journalism Lab – College of Journalism University of Maryland.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

In the name of God Language Modeling Mohammad Bahrani Feb 2011.
LING/C SC 581: Advanced Computational Linguistics
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Language and Statistics
Lemma: canonical (citation) form of a lexeme, which conventionally represents the set of related words Lexeme: the set of related words But….
Panagiotis G. Ipeirotis Luis Gravano
LING/C SC/PSYC 438/538 Lecture 13 Sandiway Fong.
TITLE OF PRESENTATION Name of Presenter.
Presentation transcript:

Carnegie Mellon Words  What constitutes a word? Does it matter?  Word tokens vs. word types; type-token curves  Zipf’s law, Mandlebrot’s law; explanation  Heterogeneity of language: written vs. spoken period, genre, register, domain topic (hierarchy), speaker, audience  “uncertainty principle of language modeling”

Carnegie Mellon Sub-language Example 1  “Wall Street Journal” Corpus (WSJ): Newspaper articles, Written English, rich vocabulary (leaning towards finance)  “Switchboard” Corpus (SWB): Transcribed spoken conversations over the telephone Proscribed topic (one of 70) 1990’s  “Broadcast News” Corpus (BN): Transcribed TV/Radio News programs Spoken, but somewhat scripted

Carnegie Mellon Unigram Type-Token Curve – BN vs. SWB

Carnegie Mellon Unigram Type-Token Curve – BN vs. SWB (log scale)

Carnegie Mellon Unigram Type-Token Curve – BN vs. SWB vs. WSJ

Carnegie Mellon Unigram Type-Token Curve – BN vs. SWB vs. WSJ (log scale)

Carnegie Mellon Bigram Token-Type Curve – BN vs. SWB

Carnegie Mellon Bigram Token Type Curve – BN vs. SWB (log scale)

Carnegie Mellon Trigram Token-Type Curve – BN vs. SWB

Carnegie Mellon Trigram Token-Type Curve – BN vs. SWB (log scale)

Carnegie Mellon Head of Word Frequency List (counts per 1,000 tokens) WSJBNSWB THE49 62I38 42THE49AND34 TO24TO27 31 OF24AND25THE28 A22A YOU26 AND19OF21UH26 IN19IN17A24 THAT9 16TO23 FOR9IS13THAT20 IS8YOU12IT17 ONE7I12OF17 ON6IT10KNOW16 POINT5FOR8YEAH14 AS5THIS8IN12 SAID5ON7+NOISE+12 WITH5HAVE6THEY10 IT5ARE6UH-HUH10 FIVE5WE6HAVE10 TWO5THEY6BUT9 DOLLARS5BE6SO8 AT5WITH6IT’S8 MR.5BUT5IS8 BY5WAS5WE8

Carnegie Mellon Tail of Word Frequency List: Count=1 (“Singletons”) WSJBNSWB ZENZEROSYEARBOOK ZENKERZHAYEARS” ZEOLITEZHIVAGOSYELLER ZEROS’ZIANGSHINGYELLOWISH ZEROEDZILLIONSYELLS ZEROSZIMBABBWE’SYIELD ZESTYZINGAYIP ZEUS’SZIONYOGURT ZHIZIONLISTYORKER ZHONGTIANZOGYOUNT ZIGZAGZOISTYOURSELFER ZIGZAGGINGZOO’SYUPPISH ZILLIONZOOMEDZACK ZIONISTZUCKERMANZAK’S ZIPZULUZALES ZIPPERZUICHZANTH ZIPPYZWEIMARZEALAND ZOOZWICK’SZEROED ZOOKEEPERZWINKELSZIRCONIUHS

Carnegie Mellon Sub-language Example 2  The Diabetes set includes 9 Diabetes-related journals and a total of 4.5M tokens and 95K types.  The Veterinary science set includes 11 journals and 3.2M tokens and 87K types.  All Journals were extracted from PubMed in Oct 2010 and they include everything that was available by those journals up until then.  This example is provided by Dana Movshovitz-Attias.

Carnegie Mellon Diabetes vs. Veterinary: Type-Token Curve

Carnegie Mellon Diabetes vs. Veterinary: Type-Token Curve (log scale)

Carnegie Mellon Head of Word Frequency List (counts per 1,000 tokens) diabetescountveterinarycount THE42THE57 OF35OF39 AND31AND30 IN29IN29 TO16TO17 WITH13A14 A13WERE11 FOR10WAS10 WAS10FOR10 WERE9WITH9 DIABETES7FROM7 THAT7 6 BY6IS6 6AS6 26BY6 AS5ON5 INSULIN5AT5 OR514 GLUCOSE5BE4 15THIS4

Carnegie Mellon Tail of Word Frequency List: Count=1 (“Singletons”) DiabetesVeterinary QUESTIONNAIRE-BASEDMOLARITIES CAPACITY-CONSTRAINEDLIDOCAIN DNDMULTIORGAN MICROGLIA-MEDIATED ENZYME-INHIBITORNALYSIS ALVEOLUS-CAPILLARY10702 KUZUYABLUE-DNA $6054HAIR-LOSS SENTENCINGPOPULATION-DYNAMICAL PAPER-AND-PENCILSTATE-TRANSITION

Carnegie Mellon Zipf’s Law – Frequency vs. Rank (Brown Corpus)

Carnegie Mellon Zipf’s Law – Frequency vs. Rank (Brown Corpus) (log scale)

Carnegie Mellon Zipf’s Law – Frequency vs. Rank (Brown Corpus) (log scale) + theoretical Zipf distribution