Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Getting to know your variables Jane E. Miller, PhD The Chicago Guide to Writing.

Similar presentations


Presentation on theme: "The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Getting to know your variables Jane E. Miller, PhD The Chicago Guide to Writing."— Presentation transcript:

1 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Getting to know your variables Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

2 Overview Why “get to know” your variables? Unit of analysis Restrictions on your analytic sample Information about each variable – Level of measurement – Missing values – Valid range of values – Substantive interpretation of values – Distribution of observed values in your data set The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

3 Why is it important to get to know your variables? Each variable measures – A specific concept Numeric values have particular meanings that differ depending on the nature of that concept – In a particular context When, where, to whom do those numbers pertain? – Collected with a specific study design Need to understand why some values are missing – By design – Due to non-response

4 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Example of failing to get to know variables In a nationally representative survey sample from a developing country circa 2002, birth weight in grams observed range up to 9999. – Data set downloaded from a research data web site; not cleaned or evaluated before use. – Mean birth weight over 8000 in the sample First red flag: Implausible as an actual birth weight, given its meaning and units. 9,999 grams ~= 22 lbs. – 9999 was a code for missing value Lesson learned: Must become familiar with what a particular value means for that concept and context.

5 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Second red flag 2/3 of sample had a birth weight value of 9999 – Very high value for a substantial share of the sample, unlikely to be explained solely by outliers data entry errors Lesson learned: Look at study documentation and questionnaire to find out why this distribution was observed. – Occurred due to a skip pattern designed to minimize recall bias in birth weight reporting.

6 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Resources needed for this exercise Documentation on the data source – Description of study design – Questionnaire – Codebook for electronic data file Electronic file of database Statistical software Your research question Articles, books, web sites etc. on your topic – Dependent and key independent variables

7 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Time needed for this exercise This is not an assignment that can be done overnight Multi-step process involving multiple resources – About data source – About topic Results from early steps will inform later steps in the exercise May also need feedback from – Mentors – Colleagues – Persons involved in original data collection

8 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Getting to know your variables is project-specific Several issues that inform this assignment are specific to research question and data set. – Unit of analysis – Restrictions on your analytic sample – Roles of different variables in your analysis Dependent, independent, control, filter Even experienced researchers should complete this assignment when undertaking a project with a new topic or data set.

9 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Valuable information for all parts of a well-written research paper Reading the literature on your topic will provide information needed for – Introduction – Literature review – Discussion Detailed knowledge of study design and variables from documentation, questionnaire and codebook will provide information for – A comprehensive data section – Appropriate model specification – Interpretation of statistical results

10 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Unit of analysis Do data pertain to – Individual person? – Family? – Census tract? – Institution? Knowing unit of analysis helps ascertain plausible range of values, e.g., – mean number of family members will be much lower than population of a census tract or a school

11 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Analytic sample Before you acquaint yourself with the range of values for each variable in your analysis, impose any limits related to your research question. Exclude cases – to whom the topic does not pertain. – that are part of a group with too few cases – for whom a key variable was not collected

12 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Restrictions on your analytic sample Limit your analytic sample to cases that meet certain criteria, related to your research question. e.g., – particular demographic traits – minimum test scores – a specific disease Exclude subgroups that don’t meet minimum sample size if – there aren’t enough cases in one or more subgroups of a key variable to provide sufficient statistical power – it would not be theoretically sensible to combine them with other subgroups used in your analysis

13 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Attributes of each variable to familiarize yourself with prior to analysis

14 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Labeling, coding, and missing value information for your variables To help you create a comprehensive record of information on each of the variables in your analysis, fill out a grid like this one – An electronic version can be found online in the “getting to know your variables” assignment. Variable name Variable label (e.g. acronym on your data set) Type of variable (nominal, ordinal, interval, or ratio) Coding (for categorical variables) OR Units (for continuous variables) Plausible range of values (excluding missing values) Missing value codes (if any) Skip pattern?* (e.g., conditions under which variable not collected) Original or created variable?† Saw doctor last year DOCLYNominal1 = yes 2 = no 1, 27 = refused 8 = don’t know 9 = missing None for this variable Original Birth weightBWGRMSRatioGrams0–60009999 = missing Asked only about children under age 5 years at time of survey. Original

15 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. “Old” and “new” variables Familiarize yourself with all variables to be used in your analysis – Variables analyzed in the same form in which they appeared in the original data set – Variables you created from those variables, e.g., Dummy variables created from categorical variables Categorical versions of continuous variables Aggregated variables, e.g., – income calculated from several sources – scales that combine responses to multiple items Calculated variables – E.g., body mass index calculated from weight and height Transformed variables (logged, standardized)

16 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Organizing variables within the grid Using major row headings, label sections for each of the following, based on their role in your analysis – Dependent variable(s) – Key independent variables – Control variables – Sampling weights – Filter questions (e.g., used to restrict sample)

17 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Variable names and labels For each variable, fill in – Variable name: a short (up to ~8 character) acronym used to identify the variable in the software program you are using – Variable label: a descriptive phrase of up to 40 characters that helps convey the meaning of the variable If you rename an item with a more informative variable name (e.g., “gender” instead of Q117), include the original question name in the variable label

18 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Level of measurement Categorical variables are those that are classified into ranges or categories. Continuous variables – Measured in numeric units, but not grouped. – Two types of continuous variables: Interval – Zero is not lowest possible value – e.g., temperature °Fahrenheit Ratio – Zero is lowest possible value – e.g., temperature °Kelvin, height, weight Helps to anticipate limits on range of values

19 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Categorical variables Nominal variables No inherent order to the categories Numeric value labels have NO mathematical interpretation Examples: – Gender – Race – Geographic region Ordinal variables Categories have an inherent numeric order Examples: – Letter grades – Age group – Likert scale items E.g., from strongly disagree to strongly agree

20 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Units of measurement System of measurement – E.g., Metric or British or other? income in dollars or euros or pesos? Level of aggregation – E.g., income per hour or per week or per year? Scale – E.g., income in dollars or thousands of dollars or millions of dollars? See also podcast on “reporting one number”

21 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Missing values Missing values on a variable can occur because they are – Not applicable – Missing by design – Non-response See chapters 4 and 13 of The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition for more on missing values and missing by design.

22 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Not applicable Some questions are not asked of specific respondents because they don’t pertain. – E.g., if someone reports that they are unemployed, they wouldn’t be asked about their current job type or earnings Look – At the questionnaire or form used to collect the data for a filter question a skip pattern – At the codebook for one or more missing value codes for non-response

23 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Missing by design Some topics are not asked of specific subgroups due to concern about the accuracy of their responses. – E.g., to minimize recall bias, mothers asked birth weight only for children under age 10 years. Surveys sometimes administer specialized topic modules only to a randomly selected subsample of respondents. – Used to obtain a smaller representative sample that meets statistical power requirements while reducing study costs. Read the study design documentation to find out whether these pertain to the question you are using.

24 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Item non-response Another reason for missing values is when a respondent does not answer a question that was asked of them. Item non-response is particularly common for – stigmatized topics – questions that require complex or detailed answers – unclear instructions about number of allowed responses Examples: Respondent was asked – to report income, but didn’t know it – immigration status, but had concerns about deportation

25 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Types of non-response Don’t know Refused to answer the question Didn’t answer the question (unspecified reason) Other – marked too many answers to a single response question – wrote an illegible answer Look up the pertinent missing value codes for each of your variables in the codebook for your data set.

26 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Valid range of values Definitional limits Conceptually plausible range Context of measurement Observed range Watch for numeric values for missing values – Label them in your electronic database, so they are treated correctly during analysis.

27 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Definitional limits on values Some variables by definition have limits on the range of values they can assume: – A percentage share of a whole must fall between 0 and 100 Likewise, a proportion must fall between 0 and 1 – However, a percentage change can be Negative (<0) Greater than 100 – Variables at the ratio level of measurement cannot take on negative values – Other topic- or field-specific variables also have such restrictions E.g., a Gini coefficient must fall between 0 and 1

28 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Plausible range of values for the concept being measured A value of 10,000 Makes sense in at least some contexts for – Annual family income in dollars – Population of a census tract – An annual death rate per 100,000 persons Does NOT make sense for – Hourly income in dollars – Birth weight in grams – Number of persons in a family – A Likert scale item – A proportion – An annual death rate per 1,000 persons

29 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Another example of plausible range of values A value of –1 Makes sense in at least some contexts for – Temperature in degrees Fahrenheit or Celsius – Change in rating on a 5 point scale – Change in death rate – Percentage change in income Does NOT make sense for – Temperature in degrees Kelvin – Number of persons in a family – Death rate – A Likert scale item – A proportion

30 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Context of measurement When, where, who, e.g., family income will be – Higher now than it was 200 years ago in a given place and group – Higher in a currently developed than developing country – Higher in a sample of all households than in a sample of low-income households Remember to take into account restrictions you imposed on the analytic sample to fit your research question and data.

31 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Descriptive statistics on your variables After you have – Imposed restrictions on your analytic sample – Filled in missing value codes for each variable Complete a grid like the one below, with descriptive statistics on each of the variables in your analysis. An electronic version of this grid can be found online in the “getting to know your variables” assignment. Variable name Variable label (e.g. acronym on your data set) Number of valid cases for that variable (excl. missing values) Observed values from data set Reference value from external source Values & range consistent w/ codebook? MinimumMaximum Mean (for continuous variables) Modal value

32 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Familiarizing yourself with the concepts under study To identify plausible ranges of values for each of your variables, read the literature on your dependent and key independent variables. Read for – how each concept is operationalized in the data set – standards, cutoffs, or transformations commonly used for that variable in your field – range of values observed but pay attention to differences in who, when, where studied

33 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Check each distribution against the codebook for the original source Codebooks for some data sets provide information on – frequency distribution of categorical variables – range and/or mean values for continuous variables – number of cases with missing values, by reason for missing value (not applicable, refused, etc.) Check the distribution of values observed in your analytic sample for each variable against the codebook for your data set. If any distributions are inconsistent, do NOT analyze the data until you have resolved the discrepancies!

34 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Identify reasons for inconsistencies Review your answers to the previous steps in this exercise to identify possible reasons for discrepancies between your statistics and the codebook, such as: – Units of analysis e.g., family instead of individual – Restrictions on your analytic sample e.g., excluding a subgroup that is included in the statistics shown in the overall codebook – Scale e.g., grams instead of kilograms – Transformations you have made to the variables, e.g., logged values multiples of standard deviations rather than original units

35 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Check each distribution against the literature on similar variables Track down information in the published literature on each of your main variables for a similar population. Check the distribution of values for each variable in your data set against the values from the external source of information about that variable. Again, if the values of your data are substantially different from those used in other studies of the same concepts, do NOT analyze the data until you have resolved the discrepancies!

36 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Identify reasons for inconsistencies Review your answers to the previous steps in this assignment to explain possible reasons for discrepancies between your data and other similar data sets, such as: – Population studied, e.g., substantially different time, place, and/or subgroup – Units of analysis, e.g., family instead of individual – Units of measurement, e.g., metric instead of British units – Scale, e.g., grams instead of kilograms – Transformations of the variables, e.g., percentiles instead of original value

37 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Summary Before you conduct your analyses, it is critical that you and other members of your research team become familiar with the following for each variable – Levels of measurement – Units and categories – Plausible and observed ranges of values – Missing values and their reasons Compare observed values against – Documentation for the data set you use in your analysis – The published literature on your topic The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

38 Summary, continued These attributes are essential information for – Data preparation Inclusion criteria for your analytic sample Creation of new variables – Choice of pertinent descriptive and multivariate statistics – Design of correct charts and tables – Writing correct prose descriptions for the data and methods and results sections of your paper. Even experienced researchers should complete this assignment when they undertake a project with a new topic or data set.

39 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Reasons for getting to know your variables, redux Exercises in this podcast are time-consuming but very valuable for generating in-depth knowledge needed for your paper Reading the literature on your topic will yield information needed for the introduction, literature review and discussion sections. Detailed knowledge of study design and variables from documentation, questionnaire and codebook will yield information for the data and methods and results sections.

40 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Resources on your topic Articles, books, reports, or web sites related to the main independent and dependent variables in your data – Definitions of concepts under study – Operationalization (how those concepts are actually measured in a particular data set) – Observed distributions in populations similar to those from which your sample is drawn – Commonly used transformations of those variables prior to analysis

41 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Resources on your data set Documentation on study design – Context (who, when, where) – Sampling – Unit of analysis Questionnaire or other data collection instrument – Modules – Wording of questions – Skip patterns Codebook – Levels of measurement, categories, units – Missing value codes – Distribution of observed values for the study sample

42 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Suggested readings Miller, J. E. 2013. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. – chapter 4 on levels of measurement, units, standards and cutoffs – chapters 7 and 10 on choice of contrasts to suit the variable – chapter 13 on data and methods – chapters 4 and 13 on missing values and missing by design Chambliss, Daniel F., and Russell K. Schutt. 2012. Making Sense of the Social World: Methods of Investigation, 4 th Edition. Thousand Oaks, CA: Sage Publications, or other research methods book for information on – study design, conceptualization, and measurement The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

43 Suggested online resources Podcasts on – Reporting one number (re: units) – Comparing two numbers or series of numbers (re: levels of measurement) – Defining the Goldilocks problem

44 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Suggested practice exercises Study guide to The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. – Problem sets for chapter 4, questions # 6 and 13 chapter 10, question #1 and 3 – Suggested course extensions for chapter 4 – “Reviewing” exercise #1 – “Estimating statistics and writing” exercises #1 and 2 chapter 10 – “Reviewing” exercise #1 – “Estimating statistics and writing” exercises #1 and 2 – “Revising” exercises #1, 2 and 3 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

45 Contact information Jane E. Miller, PhD jmiller@ifh.rutgers.edu Online materials available at http://press.uchicago.edu/books/miller/multivariate/index.html The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.


Download ppt "The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Getting to know your variables Jane E. Miller, PhD The Chicago Guide to Writing."

Similar presentations


Ads by Google