Presentation on theme: "Working with the ECLS-K Datasets Weights and other issues."— Presentation transcript:
1Working with the ECLS-K Datasets Weights and other issues. Information is courtesy of the Institute of Educational Sciences,National Center for Education Statisticsand is used in their training seminars.
2Sampling Weights What are sampling weights and why are they important? How are weights used?What weights are on the ECLS-K data files and when should they be used?
3What is a “Weight” ?A weight is used to indicate the relative strength of an observation.In the simplest case, each observation is counted equally.For example, if we have five observations, and wish to calculate the mean, we just add up the values and divide by 5.
4How are Weights Used? Dataset with 5 cases. Value 4 2 1 5 2 Sample mean ( ) = 2.8Weighted mean (4*1) + (2*2) + (1*4) + (5*1) + (2*2)/sum of weights = ( )/10 = 2.1
5What is the Difference Between Weighted and Unweighted Data? With unweighted data, each case is counted equally.Unweighted data represent only those in the sample who provide data.With weighted data, each case is counted relative to its representation in the population.Weights allow analyses that represent the target population.
6ECLS-K and WeightsThe ECLS-K is a sample, i.e. the entire population was not surveyed.The ECLS-K is not a simple random sample (SRS). That is, not all schools, teachers, and children had an equal probability of selection.Not all schools, teachers, and children participated.
7Why Use Weights in the ECLS-K? The ECLS-K weights allow you to make statements about the population of U.S. children that were in kindergarten in or in first grade in Without using weights, estimated are not nationally representative.Weights adjust for differential selection probabilities and reduce bias associated with non-response by adjusting for differential nonresponse.
8Examples of Weighted vs. Unweighted Data Base Year CharacteristicUnweighted %Weighted % for sampling design (base weight)Weighted % for sampling design and non-response (C1CW0)Race/EthnicityWhite575658Black1516Hispanic182019Asian63School TypePublic788785Private2213
9Examples of Weighted vs. Unweighted Data First-Grade CharacteristicUnweighted %Weighted % for sampling design(base weight)Weighted % for sampling design and non-response (C4PW0)Household SESBottom 20%171920Middle 60%5960Highest 20%2421Family TypeTwo parents787673Single parent22Other2
10Types of Weights on the ECLS-K Weights vary according to:Level of analysis: child, teacher, or school (only child-level after base year).Round(s) of data: cross-sectional or longitudinal.Source(s) of data: child assessment, parent interview, and/or teacher questionnaires.
11Level of Analysis – Base Year The first element in a weight variable name indicates the level of analysisWeights for School-level analyses begin with “S”.Weights for Teacher-level analyses begin with “B”.Weights for Child-level analyses begin with “C” (cross-sectional).Weights for Child-level analyses begin with “BY” (longitudinal).
12Level of Analysis – 1st, 3rd and 5th Grades Weights for Child-level analyses (cross sectional and longitudinal) begin with “C”.One exception: weight Y2COMW0 is for child-level analyses of assessment data from rounds 1, 2 and 4 and parent and/or teacher data from spring of first grade, and one or more base year rounds of parent and/or teacher data.
13Data Round(s)The second element in a weight variable name indicates the round(s) of data.Weights for cross-sectional analyses have a single round number: 1,2,3,4,5 or 6.Weights for longitudinal analyses have 2 or more numbers, for example:“45” for rounds 4 and 5.“124” for rounds 1,2 and 4 (exception in Y2COMW0).“1_4” for rounds 1,2,3 and 4.“1_6F” for rounds 1,2,4,5,6 (F=full sample).“1_5S” for rounds 1,2,3,4,5 (S=subsample).
14Source of the DataThe third element in a weight variable name indicates the source(s) of data. Weights for analyses using data from:Child assessments (alone or in conjunction with any combination of a limited set of child characteristic, e.g. age, sex, race/ethnicity) have a “C”.Parent interview (with or without child data) have a “P”.Child AND parent AND teacher have a “CPT”.In 5th grade, the “CPT” is followed by either “R”, “M” or “S” for reading, math or science teacher.
15Sources of the Data Two exceptions: BYCOMW0: Child assessment data from fall AND spring kindergarten in conjunction with one or more rounds of parent and/or teacher base year data.Y2COMW0: Child direct assessment data from fall AND spring kindergarten AND spring first grade, in conjunction with parent and/or teacher data from spring first grade, AND one or more base year rounds of parent and/or teacher data.
16Sources that do not affect choice of weight Source of the DataSources that do not affect choice of weightSchool administrator questionnaireFacilities checklistTeacher questionnaire CSpecial education questionnairesStudent record abstract dataHead Start dataSalary and benefits data
17Example C23PW0 “C” for child level analysis. “23” for analysis of data from rounds 2 and 3.“P” for analysis of parent interview data.
18Example C6CPTM0 “C” for child level analysis. “6” for analysis of data from round 6.“CPTM” for analysis of child, parent, and math teacher.
19Cross-sectional Examples: C1PW0 -- Child-level analyses from round 1, parent interview data (with or without child assessment data).B1TW0 -- Teacher level analyses (teacher data) from round 1.S2SAQW0 -- School-level analysis (SAQ data) from round 2.C6CW0 -- Child assessment data from round 6.C5CPTW0 -- Child-level analyses from round 5 with child, parent AND teacher data.
20Longitudinal Examples All longitudinal weights are for child-level analyses.BYPW0 – Round 1 and 2 parent interview data.BYCOMW0 – Round 1 and 2 assessment data and some other parent and teacher data.C24PW0 – Round 2 and 4 parent interview data.C245CW0 – Round 2, 4 and 5 assessment data.C1_6FCO – Round 1,2,4,5 and 6 assessment data.
21Third and Fifth-Grade Weights Unlike the first grade sample, the ECLS-K sample was not freshened in third and fifth grade.The ECLS-K sample does not represent all third graders in or fifth graders in These samples represent all children who began kindergarten in 1998 or began first grade in 1999.
22How to Use Weights In SAS, use the “WEIGHT” statement. In SPSS, use the “WEIGHT BY” statement.Key Fact: All ECLS-K weights sum to population totals.
23Weights in SAS SAS uses the WEIGHT statement in various PROCedures. PROC FREQ data = test;Tables Age Gender Score;Weight weightvar;Run;
24Weights in SPSS LIST VARIABLES = age to weightvar. Frequencies variables = age, score /sta=default.weight by weightvar.frequencies variables = age, score /sta=default.
25Weights in STATA clear use “c:\temp\test1.dta" tabulate score age gender [pweight=weightvar]
26Weights for HLM Users ECLS-K weights are adjusted for nonresponse. ECLS-K weights are not normalized (they sum to the population N rather than the sample n).A within-school child-level weight can be approximated by dividing a regular child-level weight by the school-level weight.If the analysis includes children that stayed in the same school at each round of the analysis, the school weight (S2SAQW0) can be used as a school-level weight.
27Other Frequently Asked Questions When selecting a weight, do I have to subset my dataset?What happens to cases where there is no positive weight?What weights do I use if analyzing a subsample of cases?What if I’m running a regression – what weights do I use?
28Summary about WeightsWeights should be used when analyzing data from the ECLS-K.The appropriate weight should be selected based on: Level of analysis, Round(s) of data, and Source(s) of data.There may not be a “perfect” weight for some analyses. The best weight can be determined with some descriptive analysis.
29Variance, Calculating Standard Errors Why are standard errors important?Why not use standard errors that assume a simple random sample (SRS)?How to use “exact” methods for estimating standard errors.How to use approximation methods for estimating standard errors.
30Why are Standard Errors Important? Standard errors are produced for estimates from sample surveys. They are a measure of the variance in the estimates associated with the selected sample being one of many possible samples.Standard errors are used to test hypotheses and to study group differences when making inferences to a population.Using inaccurate standard errors can lead to identification of statistically significant results where none are present and vice versa.
31Important Considerations All weights on the ECLS-K data files sum to population totals and not sample totals.The ECLS-K has a complex sample design and is not a simple random sample.
32The ECLS-K Sample Design: Oversampling The ECLS-K includes oversamples of private schools, and private school children.The ECLS-K also oversamples Asian and Pacific Islander children.
33The ECLS-K Sample Design: Clustering Sample children were clustered within primary sampling units (PSUs) to reduce field costs.Children were in closer geographical proximity than would occur in a simple random sample.Children in a clustered sample tend to be more alike than those in a simple random sample.
34Complex Samples and Standard Errors The usual standard error formula assumes a simple random sample.Standard errors for estimates from a complex sample must account for the within cluster/across cluster variation.Special software can make the adjustment, or this adjustment can be approximated using the design effect.
35OptionsExact Methods such as the TAYLOR series and REPLICATION techniques.Approximation Method
36Exact Methods Taylor series Extract PSU and strata Ids from data file. Software available: SUDAAN, STATA (using SVY commands), and SAS (using PROC SURVEY commands).
37Exact Methods Replication Techniques Extract replication weights (90 of them).ECLS-K replication weights use jackknife 2 (JK2) methods.Software: WESVAR replication series (JK2), AM (JK2), and SAS callable SUDAAN.
38Approximation Method Two stages: First, normalize weights so standard error is based on actual sample size rather than population size.Then, use design effect (DEFF) to account for complex sampling design.
391) Normalizing Weights * Weights on the ECLS-K sum to the population totals.Calculate a new weight that sums to the sample size.Normalized weights = (ECLS-K weight) * (sample n/population N).*SAS users do not need this step since estimates are produced based on the actual sample size.
40Example – Normalizing Weights Weight to be normalized: C2PW0Sum of weights: 3,865,946Total number of cases with a positive weight: 18,950Normalized weight = C2PW0 * (18,950 / 3,865,946)
412) Adjusting for Complex Design The ECLS-K has a complex sample design; it is not a simple random sample.Software packages designed for simple random samples tend to underestimate the standard errors for complex sample designs.Special methods are required for complex designs.
42Using Design Effects (DEFF) What is a design effect (DEFF)?It’s the ratio of the variance found in actual (complex) sample design to the variance expected in a simple random sample of the same sample size.
43Using Design Effects (DEFF) DEFT = the square root of DEFF = (Design standard error/ simple random sample error).Example for fall-kindergarten reading scoresSE (SRS) = 0.063SE (Design) = 0.156DEFF = / = 6.15DEFT = 0.156/0.063 = square root of 6.15 = 2.48
443 Ways of Using the DEFFMultiply the SRS (simple random sample) standard error produced by statistical software (when using normalized weights) by the square root of the DEFF (DEFT).OrAdjust the t-statistic by dividing it by the square root of the design effect (DEFT) or adjust the F-statistic by dividing it by the DEFF.Adjust the weight such that an adjusted standard error is produced.
45Using a DEFF- Adjusted Weight First step, create a weight that sums to the sample size (normalized weight.Second, divide this normalized weight by the DEFF.Third, use this weight for analyses. The standard errors produced will approximate the standard errors obtained using “exact” methods.
46Where to find ECLS-K DEFF’s Training material: “ECLS-K Specifications for Computing Standard Errors”ECLS-K users’ manuals:Base Year (Kindergarten): Table 4.12First Grade: Tables 4.13 and 9.4Third Grade: Tables 4.14 and 9.2Fifth Grade: Tables 4.19 and 9.2
47For SAS UsersSAS base procedures such as PROC REG, PROC FREQ, and PROC MEANS do account for the actual sample size but not for complex sampling.SAS procedures such as PROC SURVEYMEAN and PROC SURVEYREG (and other procedures that begin with “Survey”) use the Taylor series method to account for complex sampling and provide exact estimates of the standard errors.
48PROC SURVEYREG Example Example using ECLS-K data, spring kindergarten and spring first grade variables.proc surveyreg data = fscores;model c4r3mscl = c2r3mscl lowkread t4learn;cluster c24cstr;strata c24cpsu;weight c24cw0;where lowkmath = 0;run;
49PROC SURVEYLOGISTIC Example Example using ECLS-K data, spring kindergarten and spring first grade variables.proc surveylogistic data = fscores;model lowkread (desc) = c2r3mscl t4learn;cluster c24cstr;strata c24cpsu;weight c24cw0;where lowkmath = 0;run;
50PROC SURVEYFREQ Example Example using ECLS-K data, spring kindergarten and spring first grade variables.proc surveyfreq data = fscores;tables lowkread c2r3mscl t4learn;cluster c24cstr;strata c24cpsu;weight c24cw0;run;
51STATA Code for Complex Design Logistic Regression Example, 3rd Grade DataSvyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU)Svy, subpop (male) : logit highbmi white
52STATA Code for Complex Design Regression Example, 3rd Grade DataSvyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU)Svy, subpop (male) : reg highbmi white
53STATA Code for Complex Design Means Example, 3rd Grade DataSvyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU)Svy, subpop (male) : mean highbmi female
54SPSS for Complex Sample Design Use add-on to SPSS called, SPSS Complex Samples™Complex Samples Logistic Regression (CSLOGISTIC)—Performs binary logistic regression analysis, as well as multiple logistic regression (MLR) analysis, for samples drawn by complex sampling methods. The procedure estimates variances by taking into account the sample design used to select the sample, including equal probability and PPS methods, and WR and WOR sampling procedures. Optionally, CSLOGISTIC performs analyses for subpopulations.Courtesy of SPSS
55Regression AnalysisUse appropriate software such as AM, WESVAR, SUDAAN or SAS (SURVEYREG procedure).For SAS (PROC REG procedure), use DEFF-adjusted weights.For SPSS, use normalized, DEFF-adjusted weights.
56SummaryAll statistical tests should be based on standard errors that are calculated to account for the complex sample design of the ECLS-K.Preferred: Use software that incorporates JK2 replication methods, orUse software that incorporates Taylor series method, orLast resort: Make approximate adjustments based on design effects.
57ECLS-K Data Availability Base Year (Kindergarten) through 5th Grade restricted use and Public Use datasets have been released.8th Grade restricted use dataset should be released in the winter of 2008 and the public datasets should be released in March 2009.
58Differences in Restricted Use and Public Use ECLS-K Datasets. Here’s a short explanation from the NCES:Chapter 7 in the ECLS-K, 5th Grade User’s Guide has Tables 7-15 and 7-16 that describe the differences in the public and restricted datasets. The User’s Guide can be found online at: