Presentation on theme: "Working with the ECLS-K Datasets Weights and other issues. Information is courtesy of the Institute of Educational Sciences, National Center for Education."— Presentation transcript:
Working with the ECLS-K Datasets Weights and other issues. Information is courtesy of the Institute of Educational Sciences, National Center for Education Statistics and is used in their training seminars.
Sampling Weights What are sampling weights and why are they important? How are weights used? What weights are on the ECLS-K data files and when should they be used?
What is a Weight ? A weight is used to indicate the relative strength of an observation. In the simplest case, each observation is counted equally. For example, if we have five observations, and wish to calculate the mean, we just add up the values and divide by 5.
How are Weights Used? Dataset with 5 cases. Value 4 2 1 5 2 Weight 1 2 4 1 2 Sample mean (4+2+1+5+2) = 2.8 Weighted mean (4*1) + (2*2) + (1*4) + (5*1) + (2*2)/sum of weights = (4 + 4 + 4 + 5 + 4)/10 = 2.1
What is the Difference Between Weighted and Unweighted Data? With unweighted data, each case is counted equally. Unweighted data represent only those in the sample who provide data. With weighted data, each case is counted relative to its representation in the population. Weights allow analyses that represent the target population.
ECLS-K and Weights The ECLS-K is a sample, i.e. the entire population was not surveyed. The ECLS-K is not a simple random sample (SRS). That is, not all schools, teachers, and children had an equal probability of selection. Not all schools, teachers, and children participated.
Why Use Weights in the ECLS-K? The ECLS-K weights allow you to make statements about the population of U.S. children that were in kindergarten in 1998-99 or in first grade in 1999-2000. Without using weights, estimated are not nationally representative. Weights adjust for differential selection probabilities and reduce bias associated with non-response by adjusting for differential nonresponse.
Base Year Characteristic Unweighted %Weighted % for sampling design (base weight) Weighted % for sampling design and non- response (C1CW0) Race/Ethnicity White575658 Black1516 Hispanic182019 Asian633 School Type Public788785 Private221315 Examples of Weighted vs. Unweighted Data
First-Grade Characteristic Unweighted %Weighted % for sampling design (base weight) Weighted % for sampling design and non-response (C4PW0) Household SES Bottom 20%171920 Middle 60%5960 Highest 20%242120 Family Type Two parents787673 Single parent202224 Other222
Types of Weights on the ECLS-K Weights vary according to: Level of analysis: child, teacher, or school (only child- level after base year). Round(s) of data: cross-sectional or longitudinal. Source(s) of data: child assessment, parent interview, and/or teacher questionnaires.
Level of Analysis – Base Year Weights for School-level analyses begin with S. Weights for Teacher-level analyses begin with B. Weights for Child-level analyses begin with C (cross- sectional). Weights for Child-level analyses begin with BY (longitudinal). The first element in a weight variable name indicates the level of analysis
Level of Analysis – 1 st, 3 rd and 5 th Grades Weights for Child-level analyses (cross sectional and longitudinal) begin with C. One exception: weight Y2COMW0 is for child-level analyses of assessment data from rounds 1, 2 and 4 and parent and/or teacher data from spring of first grade, and one or more base year rounds of parent and/or teacher data.
Data Round(s) Weights for cross-sectional analyses have a single round number: 1,2,3,4,5 or 6. Weights for longitudinal analyses have 2 or more numbers, for example: 45 for rounds 4 and 5. 124 for rounds 1,2 and 4 (exception in Y2COMW0). 1_4 for rounds 1,2,3 and 4. 1_6F for rounds 1,2,4,5,6 (F=full sample). 1_5S for rounds 1,2,3,4,5 (S=subsample). The second element in a weight variable name indicates the round(s) of data.
Source of the Data Child assessments (alone or in conjunction with any combination of a limited set of child characteristic, e.g. age, sex, race/ethnicity) have a C. Parent interview (with or without child data) have a P. Child AND parent AND teacher have a CPT. In 5 th grade, the CPT is followed by either R, M or S for reading, math or science teacher. The third element in a weight variable name indicates the source(s) of data. Weights for analyses using data from:
Sources of the Data Two exceptions: BYCOMW0: Child assessment data from fall AND spring kindergarten in conjunction with one or more rounds of parent and/or teacher base year data. Y2COMW0: Child direct assessment data from fall AND spring kindergarten AND spring first grade, in conjunction with parent and/or teacher data from spring first grade, AND one or more base year rounds of parent and/or teacher data.
Source of the Data School administrator questionnaire Facilities checklist Teacher questionnaire C Special education questionnaires Student record abstract data Head Start data Salary and benefits data Sources that do not affect choice of weight
Example C23PW0 C for child level analysis. 23 for analysis of data from rounds 2 and 3. P for analysis of parent interview data.
Example C6CPTM0 C for child level analysis. 6 for analysis of data from round 6. CPTM for analysis of child, parent, and math teacher.
Cross-sectional Examples: C1PW0 -- Child-level analyses from round 1, parent interview data (with or without child assessment data). B1TW0 -- Teacher level analyses (teacher data) from round 1. S2SAQW0 -- School-level analysis (SAQ data) from round 2. C6CW0 -- Child assessment data from round 6. C5CPTW0 -- Child-level analyses from round 5 with child, parent AND teacher data.
Longitudinal Examples BYPW0 – Round 1 and 2 parent interview data. BYCOMW0 – Round 1 and 2 assessment data and some other parent and teacher data. C24PW0 – Round 2 and 4 parent interview data. C245CW0 – Round 2, 4 and 5 assessment data. C1_6FCO – Round 1,2,4,5 and 6 assessment data. All longitudinal weights are for child-level analyses.
Third and Fifth-Grade Weights Unlike the first grade sample, the ECLS-K sample was not freshened in third and fifth grade. The ECLS-K sample does not represent all third graders in 2001-02 or fifth graders in 2003-04. These samples represent all children who began kindergarten in 1998 or began first grade in 1999.
How to Use Weights In SAS, use the WEIGHT statement. In SPSS, use the WEIGHT BY statement. Key Fact: All ECLS-K weights sum to population totals.
Weights in SAS SAS uses the WEIGHT statement in various PROCedures. PROC FREQ data = test; Tables Age Gender Score; Weight weightvar; Run;
Weights in SPSS LIST VARIABLES = age to weightvar. Frequencies variables = age, score /sta=default. weight by weightvar. frequencies variables = age, score /sta=default.
Weights in STATA clear use c:\temp\test1.dta" tabulate score age gender [pweight=weightvar]
Weights for HLM Users ECLS-K weights are adjusted for nonresponse. ECLS-K weights are not normalized (they sum to the population N rather than the sample n). A within-school child-level weight can be approximated by dividing a regular child-level weight by the school- level weight. If the analysis includes children that stayed in the same school at each round of the analysis, the school weight (S2SAQW0) can be used as a school-level weight.
Other Frequently Asked Questions When selecting a weight, do I have to subset my dataset? What happens to cases where there is no positive weight? What weights do I use if analyzing a subsample of cases? What if Im running a regression – what weights do I use?
Summary about Weights Weights should be used when analyzing data from the ECLS-K. The appropriate weight should be selected based on: Level of analysis, Round(s) of data, and Source(s) of data. There may not be a perfect weight for some analyses. The best weight can be determined with some descriptive analysis.
Variance, Calculating Standard Errors Why are standard errors important? Why not use standard errors that assume a simple random sample (SRS)? How to use exact methods for estimating standard errors. How to use approximation methods for estimating standard errors.
Why are Standard Errors Important? Standard errors are produced for estimates from sample surveys. They are a measure of the variance in the estimates associated with the selected sample being one of many possible samples. Standard errors are used to test hypotheses and to study group differences when making inferences to a population. Using inaccurate standard errors can lead to identification of statistically significant results where none are present and vice versa.
Important Considerations All weights on the ECLS-K data files sum to population totals and not sample totals. The ECLS-K has a complex sample design and is not a simple random sample.
The ECLS-K Sample Design: Oversampling The ECLS-K includes oversamples of private schools, and private school children. The ECLS-K also oversamples Asian and Pacific Islander children.
The ECLS-K Sample Design: Clustering Sample children were clustered within primary sampling units (PSUs) to reduce field costs. Children were in closer geographical proximity than would occur in a simple random sample. Children in a clustered sample tend to be more alike than those in a simple random sample.
Complex Samples and Standard Errors The usual standard error formula assumes a simple random sample. Standard errors for estimates from a complex sample must account for the within cluster/across cluster variation. Special software can make the adjustment, or this adjustment can be approximated using the design effect.
Options Exact Methods such as the TAYLOR series and REPLICATION techniques. Approximation Method
Exact Methods Taylor series Extract PSU and strata Ids from data file. Software available: SUDAAN, STATA (using SVY commands), and SAS (using PROC SURVEY commands).
Exact Methods Replication Techniques Extract replication weights (90 of them). ECLS-K replication weights use jackknife 2 (JK2) methods. Software: WESVAR replication series (JK2), AM (JK2), and SAS callable SUDAAN.
Approximation Method Two stages: First, normalize weights so standard error is based on actual sample size rather than population size. Then, use design effect (DEFF) to account for complex sampling design.
1) Normalizing Weights * Weights on the ECLS-K sum to the population totals. Calculate a new weight that sums to the sample size. Normalized weights = (ECLS-K weight) * (sample n/population N). *SAS users do not need this step since estimates are produced based on the actual sample size.
Example – Normalizing Weights Weight to be normalized: C2PW0 Sum of weights: 3,865,946 Total number of cases with a positive weight: 18,950 Normalized weight = C2PW0 * (18,950 / 3,865,946)
2) Adjusting for Complex Design The ECLS-K has a complex sample design; it is not a simple random sample. Software packages designed for simple random samples tend to underestimate the standard errors for complex sample designs. Special methods are required for complex designs.
Using Design Effects (DEFF) What is a design effect (DEFF)? Its the ratio of the variance found in actual (complex) sample design to the variance expected in a simple random sample of the same sample size.
Using Design Effects (DEFF) DEFT = the square root of DEFF = (Design standard error/ simple random sample error). Example for fall-kindergarten reading scores SE (SRS) = 0.063 SE (Design) = 0.156 DEFF = 0.156 2 /0.063 2 = 6.15 DEFT = 0.156/0.063 = square root of 6.15 = 2.48
3 Ways of Using the DEFF Multiply the SRS (simple random sample) standard error produced by statistical software (when using normalized weights) by the square root of the DEFF (DEFT). Or Adjust the t-statistic by dividing it by the square root of the design effect (DEFT) or adjust the F-statistic by dividing it by the DEFF. Or Adjust the weight such that an adjusted standard error is produced.
Using a DEFF- Adjusted Weight First step, create a weight that sums to the sample size (normalized weight. Second, divide this normalized weight by the DEFF. Third, use this weight for analyses. The standard errors produced will approximate the standard errors obtained using exact methods.
Where to find ECLS-K DEFFs Training material: ECLS-K Specifications for Computing Standard Errors ECLS-K users manuals: Base Year (Kindergarten): Table 4.12 First Grade: Tables 4.13 and 9.4 Third Grade: Tables 4.14 and 9.2 Fifth Grade: Tables 4.19 and 9.2
For SAS Users SAS base procedures such as PROC REG, PROC FREQ, and PROC MEANS do account for the actual sample size but not for complex sampling. SAS procedures such as PROC SURVEYMEAN and PROC SURVEYREG (and other procedures that begin with Survey) use the Taylor series method to account for complex sampling and provide exact estimates of the standard errors.
PROC SURVEYREG Example Example using ECLS-K data, spring kindergarten and spring first grade variables. proc surveyreg data = fscores; model c4r3mscl = c2r3mscl lowkread t4learn; cluster c24cstr; strata c24cpsu; weight c24cw0; where lowkmath = 0; run;
PROC SURVEYLOGISTIC Example Example using ECLS-K data, spring kindergarten and spring first grade variables. proc surveylogistic data = fscores; model lowkread (desc) = c2r3mscl t4learn; cluster c24cstr; strata c24cpsu; weight c24cw0; where lowkmath = 0; run;
PROC SURVEYFREQ Example Example using ECLS-K data, spring kindergarten and spring first grade variables. proc surveyfreq data = fscores; tables lowkread c2r3mscl t4learn; cluster c24cstr; strata c24cpsu; weight c24cw0; run;
STATA Code for Complex Design Logistic Regression Example, 3 rd Grade Data Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU) Svy, subpop (male) : logit highbmi white
STATA Code for Complex Design Regression Example, 3rd Grade Data Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU) Svy, subpop (male) : reg highbmi white
STATA Code for Complex Design Means Example, 3rd Grade Data Svyset [pweight=C5CW0], strata (C5TCWSTR) psu (C5CWPSU) Svy, subpop (male) : mean highbmi female
SPSS for Complex Sample Design Use add-on to SPSS called, SPSS Complex Samples Complex Samples Logistic Regression (CSLOGISTIC) Performs binary logistic regression analysis, as well as multiple logistic regression (MLR) analysis, for samples drawn by complex sampling methods. The procedure estimates variances by taking into account the sample design used to select the sample, including equal probability and PPS methods, and WR and WOR sampling procedures. Optionally, CSLOGISTIC performs analyses for subpopulations. Courtesy of SPSS
Regression Analysis Use appropriate software such as AM, WESVAR, SUDAAN or SAS (SURVEYREG procedure). For SAS (PROC REG procedure), use DEFF- adjusted weights. For SPSS, use normalized, DEFF-adjusted weights.
Summary Preferred: Use software that incorporates JK2 replication methods, or Use software that incorporates Taylor series method, or Last resort: Make approximate adjustments based on design effects. All statistical tests should be based on standard errors that are calculated to account for the complex sample design of the ECLS-K.
ECLS-K Data Availability Base Year (Kindergarten) through 5 th Grade restricted use and Public Use datasets have been released. 8 th Grade restricted use dataset should be released in the winter of 2008 and the public datasets should be released in March 2009.
Differences in Restricted Use and Public Use ECLS-K Datasets. Heres a short explanation from the NCES: http://nces.ed.gov/ecls/kinderfaq.asp?faq=1 http://nces.ed.gov/ecls/kinderfaq.asp?faq=1 Chapter 7 in the ECLS-K, 5 th Grade Users Guide has Tables 7-15 and 7-16 that describe the differences in the public and restricted datasets. The Users Guide can be found online at: http://sodapop.pop.psu.edu/codebooks/ecls/k5userpart2.pdf http://sodapop.pop.psu.edu/codebooks/ecls/k5userpart2.pdf