Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Slides:

Advertisements

Similar presentations

Z-squared: the origin and use of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English.

Advertisements

Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London

Robin L. Donaldson May 5, 2010 Prospectus Defense Florida State University College of Communication and Information.

Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch．2

Using an Enhanced MDA Model in study of World Englishes Richard Xiao

Business Statistics for Managerial Decision

Analysis of frequency counts with Chi square

Chi-square test Chi-square test or  2 test. Chi-square test countsUsed to test the counts of categorical data ThreeThree types –Goodness of fit (univariate)

CHAPTER 11 Inference for Distributions of Categorical Data

PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 12 Chicago School of Professional Psychology.

Clustered or Multilevel Data

Chi-square Test of Independence

Daniel Nkemleke, Humboldt Kolleg Kamerun, 30/07/2008 Corpus Linguistics and Language Education: Development and Utility of the Corpus of Cameroon English.

BA 427 – Assurance and Attestation Services

Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,

The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.

Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.

This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.

COLLECTING QUANTITATIVE DATA: Sampling and Data collection

McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)

English Corpus Linguistics Introducing the Diachronic Corpus of Present-Day Spoken English (DCPSE) Sean Wallis UCL.

Representatıvness, balance and samplıng ın a corpus Lınguistıcs.

Collecting Quantitative Data

MA in English Linguistics Experimental design and statistics Sean Wallis Survey of English Usage University College London

Chapter 11 Chi-Square Procedures 11.3 Chi-Square Test for Independence; Homogeneity of Proportions.

 Collecting Quantitative  Data  By: Zainab Aidroos.

Chi-square (χ 2 ) Fenster Chi-Square Chi-Square χ 2 Chi-Square χ 2 Tests of Statistical Significance for Nominal Level Data (Note: can also be used for.

Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster May 2009 Sean Wallis Survey of English Usage University College London.

MA in English Linguistics Experimental design and statistics II Sean Wallis Survey of English Usage University College London

Chi-Square X 2. Parking lot exercise Graph the distribution of car values for each parking lot Fill in the frequency and percentage tables.

Sampling, sample size estimation, and randomisation

Difference Between Means Test (“t” statistic) Analysis of Variance (“F” statistic)

Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.

Chapter-8 Chi-square test. Ⅰ The mathematical properties of chi-square distribution  Types of chi-square tests  Chi-square test  Chi-square distribution.

How Can Corpora Help Me To Be Successful in CO150?

Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English.

Chapter 6: 1 Sampling. Introduction Sampling - the process of selecting observations Often not possible to collect information from all persons or other.

Retain H o Refute hypothesis and model MODELS Explanations or Theories OBSERVATIONS Pattern in Space or Time HYPOTHESIS Predictions based on model NULL.

Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.

Chi-Square X 2. Review: the “null” hypothesis Inferential statistics are used to test hypotheses Whenever we use inferential statistics the “null hypothesis”

Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London

Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.

1 Introduction to Statistics. 2 What is Statistics? The gathering, organization, analysis, and presentation of numerical information.

Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.

Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.

Chi-Square X 2. Review: the “null” hypothesis Inferential statistics are used to test hypotheses Whenever we use inferential statistics the “null hypothesis”

Copyright c 2001 The McGraw-Hill Companies, Inc.1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent variable.

Tutorial I: Missing Value Analysis

LECTURE 3 1 APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY.

T-tests Chi-square Seminar 7. The previous week… We examined the z-test and one-sample t-test. Psychologists seldom use them, but they are useful to understand.

The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.

COGS Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I.

Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.

Warm Up Check your understanding on p You do NOT need to calculate ALL the expected values by hand but you need to do at least 2. You do NOT need.

CHAPTER 11 Inference for Distributions of Categorical Data

Hypothesis Testing Review

CHAPTER 11 Inference for Distributions of Categorical Data

APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

Survey of English Usage University College London

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

Presentation transcript:

Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University College London

Motivating questions What is meant by the phrase ‘a balanced corpus’? –How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data?

Motivating questions What is meant by the phrase ‘a balanced corpus’? –How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? Examples: ICE-GB and DCPSE –Should the data have been more sociolinguistically representative, by social class and region?

Motivating questions What is meant by the phrase ‘a balanced corpus’? –How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? Examples: ICE-GB and DCPSE –Should the data have been more sociolinguistically representative, by social class and region? –Should texts have been stratified: sampled so that speakers of all categories of gender and age were (equally) represented in each genre?

Motivating questions What is meant by the phrase ‘a balanced corpus’? –How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? Examples: ICE-GB and DCPSE –Should the data have been more sociolinguistically representative, by social class and region? –Should texts have been stratified: sampled so that speakers of all categories of gender and age were (equally) represented in each genre? Can we compensate for sampling problems in our data analysis?

ICE-GB British Component of ICE Corpus of speech and writing ( ) –60% spoken, 40% written; 1 million words; orthographically transcribed speech, marked up, tagged and fully parsed Sampling principles –International sampling scheme, including broad range of spoken and written categories –But: Adults who had completed secondary education ‘British corpus’ geographically limited –speakers mostly from London / SE UK (or sampled there)

DCPSE Diachronic Corpus of Present-day Spoken English (late 1950s - early 1990s) –800,000 words (nominal) –London-Lund component annotated as ICE-GB orthographically transcribed and fully parsed Created from subsamples of LLC and ICE-GB –Matching numbers of texts in text categories –Not sampled over equal duration LLC ( ) ICE-GB ( ) –Text passages in LLC larger than ICE-GB LLC (5,000 words) ICE-GB (2,000 words) But text passages may include subtexts –telephone calls and newspaper articles are frequently short

DCPSE Representative? –Text categories of unequal size –Broad range of text types sampled –Not balanced by speaker demography

DCPSE Representative? –Text categories of unequal size –Broad range of text types sampled –Not balanced by speaker demography

A balanced corpus? Corpora are reusable experimental datasets –Data collection (sampling) should avoid limiting future research goals –Samples should be representative What are they representative of? Quantity vs. quality –Large/lighter annotation vs. small/richer –Are larger corpora more (easily) representative? Problems for historical corpora –Can we add samples to make the corpus more representative?

“Representativeness” Do we mean representative... –of the language? A sample in the corpus is a genuine random sample of the type of text in the language

“Representativeness” Do we mean representative... –of the language? A sample in the corpus is a genuine random sample of the type of text in the language –of text types? Effort made to include examples of all types of language “text types” (including speech contexts)

“Representativeness” Do we mean representative... –of the language? A sample in the corpus is a genuine random sample of the type of text in the language –of text types? Effort made to include examples of all types of language “text types” (including speech contexts) –of speaker types? Sampling decisions made to include equal numbers (by gender, age, geography, etc.) of participants in each text category Should subdivide data independently (stratification)

“Representativeness” Do we mean representative... –of the language? A sample in the corpus is a genuine random sample of the type of text in the language –of text types? Effort made to include examples of all types of language “text types” (including speech contexts) –of speaker types? Sampling decisions made to include equal numbers (by gender, age, geography, etc.) of participants in each text category Should subdivide data independently (stratification) “broad” “stratified” “random sample”

Stratified sampling Ideal –Corpus independently subdivided by each variable

Stratified sampling Ideal –Corpus independently subdivided by each variable

Stratified sampling Ideal –Corpus independently subdivided by each variable –Equal subdivisions?

Stratified sampling Ideal –Corpus independently subdivided by each variable –Equal subdivisions? Not required Independent variables = constant probability in each subset –e.g. proportion of words spoken by women not affected by text genre –e.g. same ratio of women:men in age groups, etc.

Stratified sampling Ideal –Corpus independently subdivided by each variable –Equal subdivisions? Not required Independent variables = constant probability in each subset –e.g. proportion of words spoken by women not affected by text genre What is the reality?

ICE-GB: gender / written-spoken Proportion of words in each category spoken by women and men –The authors of some texts are unspecified –Some written material may be jointly authored –female/male ratio varies slightly (  =0.02) TOTAL spoken written female male p

ICE-GB: gender / spoken genres Gender variation in spoken subcategories TOTAL spoken dialogue private direct conversations telephone calls public broadcast discussions broadcast interviews business transactions classroom lessons legal cross-examinations parliamentary debates mixed broadcast news monologue scripted broadcast talks non-broadcast speeches unscripted demonstrations legal presentations spontaneous commentaries unscripted speeches p female male

ICE-GB: gender / written genres Gender variation in written genres TOTAL written non-printed correspondence business letters social letters non-professional writing student examination scripts untimed student essays printed academic writing humanities natural sciences social sciences technology creative writing novels/stories instructional writing administrative/regulatory skills/hobbies non-academic writing humanities natural sciences social sciences technology persuasive writing press editorials reportage press news reports p femalemale

ICE-GB Sampling was not stratified across variables –Women contribute 1/3 of corpus words –Some genres are all male (where specified) speech: spontaneous commentary, legal presentation academic writing: technology, natural sciences non-academic writing: technology, social science

ICE-GB Sampling was not stratified across variables –Women contribute 1/3 of corpus words –Some genres are all male (where specified) speech: spontaneous commentary, legal presentation academic writing: technology, natural sciences non-academic writing: technology, social science –Is this representative?

ICE-GB Sampling was not stratified across variables –Women contribute 1/3 of corpus words –Some genres are all male (where specified) speech: spontaneous commentary, legal presentation academic writing: technology, natural sciences non-academic writing: technology, social science –Is this representative? –When we compare technology writing with creative writing academic writing with student essays –are we also finding gender effects?

ICE-GB Sampling was not stratified across variables –Women contribute 1/3 of corpus words –Some genres are all male (where specified) speech: spontaneous commentary, legal presentation academic writing: technology, natural sciences non-academic writing: technology, social science –Is this representative? –When we compare technology writing with creative writing academic writing with student essays –are we also finding gender effects? –Difficult to compensate for absent data in analysis!

Disentangling variables When we compare –technology writing with creative writing are we also finding gender effects?

Disentangling variables When we compare –technology writing with creative writing are we also finding gender effects? Rebalancing the corpus –Subsample the corpus on stratified lines, or mathematically rescale corpus reduces the amount of data what do we do about missing data?

Disentangling variables When we compare –technology writing with creative writing are we also finding gender effects? Rebalancing the corpus –Subsample the corpus on stratified lines, or mathematically rescale corpus reduces the amount of data what do we do about missing data? Rebalancing the dataset

Disentangling variables When we compare –technology writing with creative writing are we also finding gender effects? Rebalancing the corpus Rebalancing the dataset Test contribution of interacting variables –Evaluate each independent variable and their interaction in predicting DV –cf. analysis of covariance (ANCOVA) but for categorical variables

Rebalancing corpora Aim: equalise the ratios –spoken:written (across m/f) –male:female (across sp/w) Drawback: –throws away information –problems with empty subsets Methods: –random subsampling –rescaling counting instances as <1 item f m spw

Rebalancing datasets Attempting to obtain a balanced corpus is good practice in data-collection –avoid zero speakers for each sociolinguistic combination

Rebalancing datasets Attempting to obtain a balanced corpus is good practice in data-collection –avoid zero speakers for each sociolinguistic combination But different research questions are likely to obtain different ratios –Tensed VP density in DCPSE (Bowie et al 2013) formal f-to-f informal f-to-f telephone b discussions b interviews commentary parliament legal x-exam assort spont prepared sp Total 1960s 1990s

Accounting for interaction Another way of considering the problem –We cannot be sure that we are seeing independent effects of two variables A B C

Accounting for interaction Another way of considering the problem –We cannot be sure that we are seeing independent effects of two variables –Or that the two variables are essentially the same A B C

Accounting for interaction Another way of considering the problem –We cannot be sure that we are seeing independent effects of two variables –Or that the two variables are essentially the same –In the worst case the two variables measure the same thing (e.g. m = sp, f = w) A B C

Testing for interaction A statistical test checks whether ratios are constant (homogeneity) –2x2 chi-square χ 2 = 0 –Cramér’s φ =  χ 2 /kN = 0 k = diagonal - 1 f m spw

Testing for interaction A statistical test checks whether ratios are constant (homogeneity) –2x2 chi-square χ 2 = 0 –Cramér’s φ =  χ 2 /kN = 0 k = diagonal - 1 Can we use χ 2 to see if an uneven distribution causes the variables to interact? –Assume A, B and C are binary variables for simplicity f m spw

Testing for interaction We can use χ 2 to test –A  C N = values of C values of A χ 2 = 6.99  = 0.33

Testing for interaction We can use χ 2 to test –A  C and B  C values of B C χ 2 = 6.99  = 0.33 χ 2 = 9.91  = 0.40

Testing for interaction We can use χ 2 to test –A  C and B  C Now use χ 2 to test –A  B  C C B A

Testing for interaction We can use χ 2 to test –A  C and B  C Now use χ 2 to test –A  B  C Method –Create a 3D table 1 2D ‘layer’ for each value of C C B A

Testing for interaction We can use χ 2 to test –A  C and B  C Now use χ 2 to test –A  B  C Method –Create a 3D table 1 2D ‘layer’ for each value of C –Define expected distribution e abc = n ab  n c / N –expected = no variation across C –compensates for uneven sample C B A n ab ncnc uneven sample

Testing for interaction We can use χ 2 to test –A  C and B  C Now use χ 2 to test –A  B  C Method –Create a 3D table 1 2D ‘layer’ for each value of C –Define expected distribution e abc = n ab  n c / N –expected = no variation across C –Calculate χ 2 = Σ(o – e) 2 /e test has single degree of freedom C B A χ 2 =  = 0.47

Testing for interaction Method –Create a 3D table 1 2D ‘layer’ for each value of C –Define expected distribution e abc = n ab  n c / N –expected = no variation across C –Calculate χ 2 = Σ(o – e) 2 /e test has single degree of freedom –χ 2 = 13.79,  = 0.47 –BUT this tests A or B Subtract χ 2 (A) and χ 2 (B) –result non-significant (or < 0)  no interaction

Conclusions Ideal would be that: –the corpus was “representative” in all 3 ways: a genuine random sample a broad range of text types a stratified sampling of speakers –But these principles are unlikely to be compatible e.g. speaker age and utterance context

Conclusions Ideal would be that: –the corpus was “representative” in all 3 ways: a genuine random sample a broad range of text types a stratified sampling of speakers –But these principles are unlikely to be compatible e.g. speaker age and utterance context Some compensatory approaches may be employed at research (data analysis) stage –what about absent or atypical data? –what if we have few speakers/writers?

Conclusions Data-collection is important –Pay attention to stratification in selecting texts/speakers consider replacing texts in outlying categories –Justify and document non-inclusion of stratum by evidence e.g. “there are no published articles attributable to authors of this age in this time period”

Conclusions Data-collection is important –Pay attention to stratification in selecting texts/speakers consider replacing texts in outlying categories –Justify and document non-inclusion of stratum by evidence e.g. “there are no published articles attributable to authors of this age in this time period” But a stratified corpus does not guarantee a stratified dataset –need to disentangle effects of variables

Conclusions Testing for interaction –χ 2 can measure degree to which combination of  and B affects the choice of C use uneven sampling for expected distribution –Cramér’s φ is derived from χ 2 Analysis of covariance –Subtracting χ 2 for  and B allows us to test if remaining interaction is significant a significant result means –the variables interact to obtain a new result no effect means –the variables may be dependent (measure the same thing)

References Bowie, J., Wallis, S.A., and Aarts, B Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H.J. and van der Auwera, J. (ed.) English Modality, Berlin: De Gruyter, 57–94.

DCPSE: gender / genre DCPSE has a simpler genre categorisation –also divided by time TOTAL face-to-face conversations formal informal telephone conversations broadcast discussions broadcast interviews spontaneous commentary parliamentary language legal cross-examination assorted spontaneous prepared speechfemale male p

DCPSE: gender / time DCPSE has a simpler genre categorisation –also divided by time note the gap p time

DCPSE: genre / time Proportion in each spoken genre, over time –sampled by matching LLC and ICE-GB overall this is a ‘stratified sample’ (but only LLC:ICE-GB) uneven sampling over 5-year periods (within LLC) Informal face-to-face formal face-to-face spontaneous commentary telephone conversations prepared speech p ICE-GB target for LLC

DCPSE LLC sampling not stratified –Issue not considered, data collected over extended period –Some data was surreptitiously recorded

DCPSE LLC sampling not stratified –Issue not considered, data collected over extended period –Some data was surreptitiously recorded DCPSE matched samples by ‘genre’ –Same text category sizes in ICE-GB and LLC –But problems in LLC (and ICE) percolate

DCPSE LLC sampling not stratified –Issue not considered, data collected over extended period –Some data was surreptitiously recorded DCPSE matched samples by ‘genre’ –Same text category sizes in ICE-GB and LLC –But problems in LLC (and ICE) percolate No stratification by speaker –Result: difficult and sometimes impossible to separate out speaker-demographic effects from text category