# Working with the ECLS-K Datasets Weights and other issues.

## Presentation on theme: "Working with the ECLS-K Datasets Weights and other issues."— Presentation transcript:

Working with the ECLS-K Datasets Weights and other issues.
Information is courtesy of the Institute of Educational Sciences, National Center for Education Statistics and is used in their training seminars.

Sampling Weights What are sampling weights and why are they important?
How are weights used? What weights are on the ECLS-K data files and when should they be used?

What is a “Weight” ? A weight is used to indicate the relative strength of an observation. In the simplest case, each observation is counted equally. For example, if we have five observations, and wish to calculate the mean, we just add up the values and divide by 5.

How are Weights Used? Dataset with 5 cases. Value 4 2 1 5 2
Sample mean ( ) = 2.8 Weighted mean (4*1) + (2*2) + (1*4) + (5*1) + (2*2)/sum of weights = ( )/10 = 2.1

What is the Difference Between Weighted and Unweighted Data?
With unweighted data, each case is counted equally. Unweighted data represent only those in the sample who provide data. With weighted data, each case is counted relative to its representation in the population. Weights allow analyses that represent the target population.

ECLS-K and Weights The ECLS-K is a sample, i.e. the entire population was not surveyed. The ECLS-K is not a simple random sample (SRS). That is, not all schools, teachers, and children had an equal probability of selection. Not all schools, teachers, and children participated.

Why Use Weights in the ECLS-K?
The ECLS-K weights allow you to make statements about the population of U.S. children that were in kindergarten in or in first grade in Without using weights, estimated are not nationally representative. Weights adjust for differential selection probabilities and reduce bias associated with non-response by adjusting for differential nonresponse.

Examples of Weighted vs. Unweighted Data
Base Year Characteristic Unweighted % Weighted % for sampling design (base weight) Weighted % for sampling design and non-response (C1CW0) Race/Ethnicity White 57 56 58 Black 15 16 Hispanic 18 20 19 Asian 6 3 School Type Public 78 87 85 Private 22 13

Examples of Weighted vs. Unweighted Data
First-Grade Characteristic Unweighted % Weighted % for sampling design (base weight) Weighted % for sampling design and non-response (C4PW0) Household SES Bottom 20% 17 19 20 Middle 60% 59 60 Highest 20% 24 21 Family Type Two parents 78 76 73 Single parent 22 Other 2

Types of Weights on the ECLS-K
Weights vary according to: Level of analysis: child, teacher, or school (only child-level after base year). Round(s) of data: cross-sectional or longitudinal. Source(s) of data: child assessment, parent interview, and/or teacher questionnaires.

Level of Analysis – Base Year
The first element in a weight variable name indicates the level of analysis Weights for School-level analyses begin with “S”. Weights for Teacher-level analyses begin with “B”. Weights for Child-level analyses begin with “C” (cross-sectional). Weights for Child-level analyses begin with “BY” (longitudinal).

Level of Analysis – 1st, 3rd and 5th Grades
Weights for Child-level analyses (cross sectional and longitudinal) begin with “C”. One exception: weight Y2COMW0 is for child-level analyses of assessment data from rounds 1, 2 and 4 and parent and/or teacher data from spring of first grade, and one or more base year rounds of parent and/or teacher data.

Data Round(s) The second element in a weight variable name indicates the round(s) of data. Weights for cross-sectional analyses have a single round number: 1,2,3,4,5 or 6. Weights for longitudinal analyses have 2 or more numbers, for example: “45” for rounds 4 and 5. “124” for rounds 1,2 and 4 (exception in Y2COMW0). “1_4” for rounds 1,2,3 and 4. “1_6F” for rounds 1,2,4,5,6 (F=full sample). “1_5S” for rounds 1,2,3,4,5 (S=subsample).

Source of the Data The third element in a weight variable name indicates the source(s) of data. Weights for analyses using data from: Child assessments (alone or in conjunction with any combination of a limited set of child characteristic, e.g. age, sex, race/ethnicity) have a “C”. Parent interview (with or without child data) have a “P”. Child AND parent AND teacher have a “CPT”. In 5th grade, the “CPT” is followed by either “R”, “M” or “S” for reading, math or science teacher.

Sources of the Data Two exceptions:
BYCOMW0: Child assessment data from fall AND spring kindergarten in conjunction with one or more rounds of parent and/or teacher base year data. Y2COMW0: Child direct assessment data from fall AND spring kindergarten AND spring first grade, in conjunction with parent and/or teacher data from spring first grade, AND one or more base year rounds of parent and/or teacher data.

Sources that do not affect choice of weight
Source of the Data Sources that do not affect choice of weight School administrator questionnaire Facilities checklist Teacher questionnaire C Special education questionnaires Student record abstract data Head Start data Salary and benefits data

Example C23PW0 “C” for child level analysis.
“23” for analysis of data from rounds 2 and 3. “P” for analysis of parent interview data.

Example C6CPTM0 “C” for child level analysis.
“6” for analysis of data from round 6. “CPTM” for analysis of child, parent, and math teacher.

Cross-sectional Examples:
C1PW0 -- Child-level analyses from round 1, parent interview data (with or without child assessment data). B1TW0 -- Teacher level analyses (teacher data) from round 1. S2SAQW0 -- School-level analysis (SAQ data) from round 2. C6CW0 -- Child assessment data from round 6. C5CPTW0 -- Child-level analyses from round 5 with child, parent AND teacher data.

Longitudinal Examples
All longitudinal weights are for child-level analyses. BYPW0 – Round 1 and 2 parent interview data. BYCOMW0 – Round 1 and 2 assessment data and some other parent and teacher data. C24PW0 – Round 2 and 4 parent interview data. C245CW0 – Round 2, 4 and 5 assessment data. C1_6FCO – Round 1,2,4,5 and 6 assessment data.

Unlike the first grade sample, the ECLS-K sample was not freshened in third and fifth grade. The ECLS-K sample does not represent all third graders in or fifth graders in These samples represent all children who began kindergarten in 1998 or began first grade in 1999.

How to Use Weights In SAS, use the “WEIGHT” statement.
In SPSS, use the “WEIGHT BY” statement. Key Fact: All ECLS-K weights sum to population totals.

Weights in SAS SAS uses the WEIGHT statement in various PROCedures.
PROC FREQ data = test; Tables Age Gender Score; Weight weightvar; Run;

Weights in SPSS LIST VARIABLES = age to weightvar.
Frequencies variables = age, score /sta=default. weight by weightvar. frequencies variables = age, score /sta=default.

Weights in STATA clear use “c:\temp\test1.dta"
tabulate score age gender [pweight=weightvar]

Weights for HLM Users ECLS-K weights are adjusted for nonresponse.
ECLS-K weights are not normalized (they sum to the population N rather than the sample n). A within-school child-level weight can be approximated by dividing a regular child-level weight by the school-level weight. If the analysis includes children that stayed in the same school at each round of the analysis, the school weight (S2SAQW0) can be used as a school-level weight.

When selecting a weight, do I have to subset my dataset? What happens to cases where there is no positive weight? What weights do I use if analyzing a subsample of cases? What if I’m running a regression – what weights do I use?

Summary about Weights Weights should be used when analyzing data from the ECLS-K. The appropriate weight should be selected based on: Level of analysis, Round(s) of data, and Source(s) of data. There may not be a “perfect” weight for some analyses. The best weight can be determined with some descriptive analysis.

Variance, Calculating Standard Errors
Why are standard errors important? Why not use standard errors that assume a simple random sample (SRS)? How to use “exact” methods for estimating standard errors. How to use approximation methods for estimating standard errors.

Why are Standard Errors Important?
Standard errors are produced for estimates from sample surveys. They are a measure of the variance in the estimates associated with the selected sample being one of many possible samples. Standard errors are used to test hypotheses and to study group differences when making inferences to a population. Using inaccurate standard errors can lead to identification of statistically significant results where none are present and vice versa.

Important Considerations
All weights on the ECLS-K data files sum to population totals and not sample totals. The ECLS-K has a complex sample design and is not a simple random sample.

The ECLS-K Sample Design: Oversampling
The ECLS-K includes oversamples of private schools, and private school children. The ECLS-K also oversamples Asian and Pacific Islander children.

The ECLS-K Sample Design: Clustering
Sample children were clustered within primary sampling units (PSUs) to reduce field costs. Children were in closer geographical proximity than would occur in a simple random sample. Children in a clustered sample tend to be more alike than those in a simple random sample.

Complex Samples and Standard Errors
The usual standard error formula assumes a simple random sample. Standard errors for estimates from a complex sample must account for the within cluster/across cluster variation. Special software can make the adjustment, or this adjustment can be approximated using the design effect.

Options Exact Methods such as the TAYLOR series and REPLICATION techniques. Approximation Method

Exact Methods Taylor series Extract PSU and strata Ids from data file.
Software available: SUDAAN, STATA (using SVY commands), and SAS (using PROC SURVEY commands).

Exact Methods Replication Techniques
Extract replication weights (90 of them). ECLS-K replication weights use jackknife 2 (JK2) methods. Software: WESVAR replication series (JK2), AM (JK2), and SAS callable SUDAAN.

Approximation Method Two stages:
First, normalize weights so standard error is based on actual sample size rather than population size. Then, use design effect (DEFF) to account for complex sampling design.

1) Normalizing Weights *
Weights on the ECLS-K sum to the population totals. Calculate a new weight that sums to the sample size. Normalized weights = (ECLS-K weight) * (sample n/population N). *SAS users do not need this step since estimates are produced based on the actual sample size.

Example – Normalizing Weights
Weight to be normalized: C2PW0 Sum of weights: 3,865,946 Total number of cases with a positive weight: 18,950 Normalized weight = C2PW0 * (18,950 / 3,865,946)

The ECLS-K has a complex sample design; it is not a simple random sample. Software packages designed for simple random samples tend to underestimate the standard errors for complex sample designs. Special methods are required for complex designs.

Using Design Effects (DEFF)
What is a design effect (DEFF)? It’s the ratio of the variance found in actual (complex) sample design to the variance expected in a simple random sample of the same sample size.

Using Design Effects (DEFF)
DEFT = the square root of DEFF = (Design standard error/ simple random sample error). Example for fall-kindergarten reading scores SE (SRS) = 0.063 SE (Design) = 0.156 DEFF = / = 6.15 DEFT = 0.156/0.063 = square root of 6.15 = 2.48

3 Ways of Using the DEFF Multiply the SRS (simple random sample) standard error produced by statistical software (when using normalized weights) by the square root of the DEFF (DEFT). Or Adjust the t-statistic by dividing it by the square root of the design effect (DEFT) or adjust the F-statistic by dividing it by the DEFF. Adjust the weight such that an adjusted standard error is produced.

First step, create a weight that sums to the sample size (normalized weight. Second, divide this normalized weight by the DEFF. Third, use this weight for analyses. The standard errors produced will approximate the standard errors obtained using “exact” methods.

Where to find ECLS-K DEFF’s
Training material: “ECLS-K Specifications for Computing Standard Errors” ECLS-K users’ manuals: Base Year (Kindergarten): Table 4.12 First Grade: Tables 4.13 and 9.4 Third Grade: Tables 4.14 and 9.2 Fifth Grade: Tables 4.19 and 9.2

For SAS Users SAS base procedures such as PROC REG, PROC FREQ, and PROC MEANS do account for the actual sample size but not for complex sampling. SAS procedures such as PROC SURVEYMEAN and PROC SURVEYREG (and other procedures that begin with “Survey”) use the Taylor series method to account for complex sampling and provide exact estimates of the standard errors.

PROC SURVEYREG Example
Example using ECLS-K data, spring kindergarten and spring first grade variables. proc surveyreg data = fscores; model c4r3mscl = c2r3mscl lowkread t4learn; cluster c24cstr; strata c24cpsu; weight c24cw0; where lowkmath = 0; run;

PROC SURVEYLOGISTIC Example
Example using ECLS-K data, spring kindergarten and spring first grade variables. proc surveylogistic data = fscores; model lowkread (desc) = c2r3mscl t4learn; cluster c24cstr; strata c24cpsu; weight c24cw0; where lowkmath = 0; run;

PROC SURVEYFREQ Example
Example using ECLS-K data, spring kindergarten and spring first grade variables. proc surveyfreq data = fscores; tables lowkread c2r3mscl t4learn; cluster c24cstr; strata c24cpsu; weight c24cw0; run;

STATA Code for Complex Design
Logistic Regression Example, 3rd Grade Data Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU) Svy, subpop (male) : logit highbmi white

STATA Code for Complex Design
Regression Example, 3rd Grade Data Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU) Svy, subpop (male) : reg highbmi white

STATA Code for Complex Design
Means Example, 3rd Grade Data Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU) Svy, subpop (male) : mean highbmi female

SPSS for Complex Sample Design
Use add-on to SPSS called, SPSS Complex Samples™ Complex Samples Logistic Regression (CSLOGISTIC)—Performs binary logistic regression analysis, as well as multiple logistic regression (MLR) analysis, for samples drawn by complex sampling methods. The procedure estimates variances by taking into account the sample design used to select the sample, including equal probability and PPS methods, and WR and WOR sampling procedures. Optionally, CSLOGISTIC performs analyses for subpopulations. Courtesy of SPSS

Regression Analysis Use appropriate software such as AM, WESVAR, SUDAAN or SAS (SURVEYREG procedure). For SAS (PROC REG procedure), use DEFF-adjusted weights. For SPSS, use normalized, DEFF-adjusted weights.

Summary All statistical tests should be based on standard errors that are calculated to account for the complex sample design of the ECLS-K. Preferred: Use software that incorporates JK2 replication methods, or Use software that incorporates Taylor series method, or Last resort: Make approximate adjustments based on design effects.

ECLS-K Data Availability
Base Year (Kindergarten) through 5th Grade restricted use and Public Use datasets have been released. 8th Grade restricted use dataset should be released in the winter of 2008 and the public datasets should be released in March 2009.

Differences in Restricted Use and Public Use ECLS-K Datasets.
Here’s a short explanation from the NCES: Chapter 7 in the ECLS-K, 5th Grade User’s Guide has Tables 7-15 and 7-16 that describe the differences in the public and restricted datasets. The User’s Guide can be found online at: