StatLab Workshop Fall 2011 Brian Fried and Jeremy Green.

Presentation on theme: "StatLab Workshop Fall 2011 Brian Fried and Jeremy Green."— Presentation transcript:

StatLab Workshop Fall 2011 Brian Fried and Jeremy Green

Outline of Workshop Part I: Causation and Statistics What is Causation? Correlation? Why Statistics? Threats to Inference Part II: Gathering and Using Data Gathering Data Managing Data Part III: Writing with Statistics A General Outline, with an example

Causation vs. Correlation Causation… …correlation

Why Statistics Probabilistic Relationships (see previous graph) Multivariate Relationships We can analyze the relationships between multiple variables at the same time. (e.g. education, age, gender, income …. -> voting) What is a regression?

When not to use statistics Insufficient observations. Observations are not easily comparable. …. Other methods may still be appropriate (historical, interpretative, qualitative, constructivist to name a few in the social sciences).

Threats to Inference Endogeneity (vs exogeneity of errors) Autocorrelation (time series) Homo/Heteroskedasticity Internal vs. external validity Probably the most important step in research design; advanced techniques can often compensate.

Part II: Data Think about analyses early! (Ideal vs. Possible) Whats Possible? Whats Convincing? Experimental Ideal Practical Data Limitations Collecting Your Own Data Using Other Data Some data sources: Statlab Webpage (http://statlab.stat.yale.edu) Advisors/Professional Contacts Yale StatCat (http://ssrs.yale.edu/statcat/) ICPSR (http://www.icpsr.umich.edu) Reference Librarian (Julie Linden) Public sources (generally government or academic)

Data Types and Uses Dependent Variable ( response, outcome, criterion) Independent Variables ( explanatory or predictor variables) Control / Confounding Variables Categorical and Continuous Variables Individual vs. Aggregate Data Remember: Types of variables we choose determine the statistics we use Qualitative knowledge always helps!

Once Youve Found or Collected Your Data Download the data and documentation StatTransfer (Statlab) Determine data file type Often a text file (.txt,.dat,.raw) Other formats (.csv,.sas,.xls) Converting text & delimited files Choose a statistical software program SPSS a good place for basic analysis

Managing your data Back up all Master Data Files Codebook Merging Data Adding variables, cases, computing new variables Keep a roadmap Keep a log of all analyses with what you have done Save syntax files

Syntax Files What are they? Text-files used to enter commands in bulk Why? You will make mistakes, need to make changes Many statistical programs let you use pull down menus How do I know what to write? Program manuals (help screens, online documentation, hard copy books) provide the underlying commands

Part III: Writing Introduction Theory (Lit Review) Data Description Analysis/Results Conclusion

Introduction Question What is the question you want to answer? Why should we care? Audience Hypothesis Succinctly state your claim Context & Summary

Motivation Broad Question: Are politics becoming more programmatic in Brazil? Testable hypothesis/question: Is Bolsa Familia, a conditional cash transfer (CCT) program that benefits a quarter of Brazils population, programmatic? An Illustrative Example: Bolsa Familia

Programa Bolsa Família – key facts Conditional cash transfer (CCT) program, launched in October 2003. This was not the first CCT program in Brazil; some existing programs (like Bolsa Escola) were incorporated into Bolsa Familia. Benefits families with per capita income below US\$78. 12 million poor families (almost 50 million people) currently receive support in all 5,564 Brazilian municipalities; Size of stipend: between US\$13 and US\$114, depending on the familys size and poverty level. Average amount: US\$54 per family 2009 Budget: US\$ 10.5 billion (0.4% of Brazils GDP) An Illustrative Example: Bolsa Familia

Theory/Lit. Review What does existing theory say? What do you believe? Position yourself within theoretical debates. Identify Testable Hypotheses Choose Method Best Suited to Testing Your Hypothesis Do you need statistics after all? Quantitative vs. Qualitative research

Research Question Do political criteria explain the variation in Bolsa Familias coverage across municipalities? Theoretical (Cox and McCubbins 1986, Dixit and Londregan 1996, Lindbeck and Weibell 1987) and empirical (Ames 1987, Levitt and Snyder 1995, Schady 2000, Dahberg and Johansson 2002, Stokes 2004, Kitschelt 2010) reasons to believe that political spending is often targeted, especially given Brazils history with clientelism and pork. An Illustrative Example: Bolsa Familia

How do politicians target? Core Swing Mobilization An Illustrative Example: Bolsa Familia

Descriptive Statistics Variables Dependent Variable(s) Explanatory (Independent) Variable(s) Control (Independent) Variable(s) Graphs Summary Statistics on Key Variables Number, Mean, Minimum, Maximum, Standard Deviation Cross-Tabs

Descriptive Statistics Mean Stand. Dev. MinMaxMissing Dependent Variable Coverage in 20090.9760.2290.0186.27612 Explanatory Variables PT Vote Share for Deputado Federal 0.0600.0480.0000.326345 PT Vote Share for President0.4700.1070.1100.82618 An Illustrative Example: Bolsa Familia

Descriptive Figures An Illustrative Example: Bolsa Familia

Descriptive Statistics Mean Stand. Dev. MinMaxMissing Explanatory Variables PT Mayor in 20080.0980.297010 Base Mayor in 20080.6090.488010 Change in Support for PT Presidential Candidate 0.0550.08000.60318 Close Presidential Election in 2006 0.1900.392010 An Illustrative Example: Bolsa Familia

Coverage in 2009 This continuous variable is the ratio of recipients over the number estimated to be poor in each municipality in November of 2009. PT Voteshare for Deputado Federal This continuous variable captures a core targeting strategy and measures average PT vote share for federal deputy across the 2002 and 2006 elections. PT Voteshare for President This continuous variable captures a core targeting strategy and measures average PT vote share for president across the 2002 and 2006 elections. Key Variables An Illustrative Example: Bolsa Familia

So, how do I analyze my data? Correlational design Correlation allows you to quantify relationships between variables (r, r-squared) Correlation, partial correlation Regression allows you predict scores on 1 variable from subjects score on another variable(s) Group differences t-test & ANOVA Chi-square for categorical and frequency data Significance v. effect size Simulations

We discussed this in Part I, but one generally devotes a section to explaining how one will identify a causal relationship prior to the results section. Coverage = β 0 + β 1 (political criteria) + β X X + e

Results: Explaining Coverage in 2009 Explanatory VariableRegression Coefficient Core Indicators PT Vote Share for Deputado Federal-.473*** PT Vote Share for President-.0972*** PT Mayor-.0241** Base Mayor-.0208*** Swing Indicators Change in Support for PT Presidential Candidate -.175*** Close Presidential Election.00651 An Illustrative Example: Bolsa Familia

Effect of Standard Deviation Shift of Explanatory Variables on Coverage in 2009 Shift Explained by Political Criteria Effect of Shift in Support PT Vote Share for Deputado Federal -0.023 PT Vote Share for President -0.010 PT Mayor*-0.024 Base Mayor*-0.021 Change in Support for PT Presidential Candidate -0.014 Close Presidential Election in 2006* 0.007 An Illustrative Example: Bolsa Familia

Identify Threats to Inference! (Do I have any?)

Robustness Check: Relationship between Coverage in 2004 and Prior Elections Shift Explained by Political Criteria Effect of Shift in Support PT Vote Share for Deputado Federal in 20020.018 PT Vote Share for President in 20020.034 PT Mayor in 2000*0.002 Base Mayor in 2000*0.005 Change in Support for PT Presidential Candidate (1998 to 2002) -0.003 Close Presidential Election in 2002*-0.016 An Illustrative Example: Bolsa Familia

Putting Output into a Paper Cut and Paste Graphs Cut and Paste into Word Processing document Save as.jpeg or.tif file Tables Cut and Paste Format in Word Processing document Import into Excel, format, and then place in Word

More Advanced Analysis Multivariate techniques are only a start; they do help to account for confounding factors, allow for testing change over time and more complex hypotheses … 1) Be honest about your abilities. 2) Ask for help 3) Best off including techniques that you fully understand, but may be worth learning something new!

Take Away Messages 1) Begin by thinking about what question interests. 2) Look for data and consider appropriate methods; identify what hypotheses are actually testable. 3) Design and run analysis; keep a codebook/syntax files! 4) Back up data 5) Ask for help-especially when choosing method and seek feedback on research design. 6) Research and Writing an Iterative Process

Feedback Please give us feedback! (Go to the StatLabs workshop page, click on we welcome feedback, and select Introductory Research Design Or go to: statlab.stat.yale.edu/workshops/feedback.jsp)