Presentation is loading. Please wait.

Presentation is loading. Please wait.

Creating Something from Nothing: Working with Synthetic Files

Similar presentations


Presentation on theme: "Creating Something from Nothing: Working with Synthetic Files"— Presentation transcript:

1 Creating Something from Nothing: Working with Synthetic Files
Bo Wandschneider University of Guelph DLI Training: April 2004, Kingston 13/12/2019

2 Outline NLSCY background Types of microdata files
Which microdata file to use Providing services for synthetic files This presentation is a modification of a workshop that Chuck Humphrey and I presented at the May 2003 National DLI Training and a Presentation Chuck presented at Accoleds – DLI training, 2003. 13/12/2019

3 NLSCY The National Longitudinal Survey of Children and Youth (NLSCY) is a long-term study of Canadian children that follows their development and well-being from birth to early adulthood. The NLSCY began in 1994 and is jointly conducted by Statistics Canada and Human Resources Development Canada. 13/12/2019

4 NLSCY There are 4 cycles There are 8 different files
2 of these are available as a PUMF Primary Self-Reporting The rest include: secondary and those based on people reporting about child (teacher, principal…) 13/12/2019

5 Types of Microdata Files
Confidential Microdata Products Master Files Share Files Public Access Microdata Products Public use anonymized microdata (PUMFS) Synthetic Files 13/12/2019

6 Microdata Products Microdata
raw data organized in a file where the records or lines in the file are observations of a specific unit of analysis and the information on the lines are the values of variables requires some form of processing or analysis to be used 13/12/2019

7 Microdata Products NLSCY –cycle 3 - primary 13/12/2019

8 Confidential Microdata
Master Files These files contain the fullness of detail captured about the unit of observation. The information in these files could identify the individual who provided the original information and, therefore, are considered confidential. 13/12/2019

9 Confidential Microdata
Master File – Example 13/12/2019

10 Confidential Microdata
Master File – detailed identifiers 13/12/2019

11 Confidential Microdata
Master File – geography 13/12/2019

12 Confidential Microdata
Master File - fullness of data 13/12/2019

13 Confidential Microdata
Master File - fullness of data 13/12/2019

14 Confidential Microdata
Master File - fullness of data 13/12/2019

15 Confidential Microdata
Share Files these are confidential files in which the respondents have signed a consent form permitting Statistics Canada to allow access to their information for approved research. Used with NPHS and NLSCY 13/12/2019

16 Public Access Microdata
Anonymized Microdata these microdata are specially prepared to minimize the possibility of disclosing or identifying any of the cases or observations the original data from the master file are edited to create a public use microdata file 13/12/2019

17 Public Access Microdata
Steps in Anonymizing Microdata removal of all personal identifiers include only gross levels of geography collapse detailed information into fewer general categories or cap values suppress the values of a variable 13/12/2019

18 Public Access Microdata
Statistics Canada PUMFs only available for select social surveys that undergo a review of the Data Release Committee, an internal Statistics Canada committee; no ‘enterprise’ public use microdata; 13/12/2019

19 Public Access Microdata
Statistics Canada PUMFs almost all are cross-sectional, that is, represent data collected at one point in time; longitudinal data are difficult to anonymize while maintaining any useful information. 13/12/2019

20 Public Access Microdata
PUMFs – personal identifiers 13/12/2019

21 Public Access Microdata
PUMFs – gross geography 13/12/2019

22 Public Access Microdata
PUMFs – collapsed data 13/12/2019

23 Public Access Microdata
PUMFs – suppressed variables Note – from the MASTER file – NOT the PUMF 13/12/2019

24 Public Access Microdata
Synthetic Files These microdata do not contain actual ‘real’ cases but are pseudo-cases that for some surveys, provide aggregate results close to the ‘real’ cases 13/12/2019

25 Public Access Microdata
Synthetic Files They have been prepared to create analysis runs with the master file without possibly disclosing or identifying any of the cases 13/12/2019

26 Public Access Microdata
Synthetic Files The results are not to be reported, but are strictly to be used to prepare analyses of master files; Usually associated with longitudinal files. 13/12/2019

27 Public Access Microdata
Steps in creating Synthetic Files Observations are transformed No records actually exist Keep fullness of variable description How the files are made is kept confidential 13/12/2019

28 Public Access Microdata
Synthetic Files – NLSCY 13/12/2019

29 Public Access Microdata
Synthetic Files – NPHS 1999 General File PUMF Synthetic Obs 49046 Var 176 400 13/12/2019

30 Implications for Analysis
What are the implications in doing analysis with these different types of microdata files? 13/12/2019

31 Implications for Analysis
Master File All observations Has the most variables with the most detail Lots of geography and personal characteristics Little grouping or capping of categories 13/12/2019

32 Implications for Analysis
Master File Restricted access: only available to authorized Statistics Canada employees, which includes ‘deemed employees’; Use of the analysis is controlled through a contract; 13/12/2019

33 Implications for Analysis
Master File Includes linkage variables across files within a study, e.g., NLSCY linkage among the files for different units of analysis (kids, parents, teachers…). 13/12/2019

34 Implications for Analysis
Public Use Microdata (PUMF) Valuable content for a tremendous amount of research; Suppresed observations Suppressed variables Suppresed Content Gross Geography Collapsed categories Capped variables Where issues arise is when smaller area geography is desired; rare subpopulations are being studied; or the variables that are needed have been used to anonymize respondents; 13/12/2019

35 Implications for Analysis
Public Use Microdata (PUMF) Licensed product: agree to certain terms of use; No linkage to multiple units of analysis, except for a few exceptions (e.g., GSS Time Use and Family); 13/12/2019

36 Implications for Analysis
Synthetic Files “Looks like a duck and quacks like a duck”, but it isn’t a duck or any other type of fowl. 13/12/2019

37 Implications for Analysis
Synthetic Files Looks like master files Lots of observations (maybe) Lots of variables Little grouping or capping of categories Lots of geographic detail 13/12/2019

38 Synthetic Files Precautions
Results not authentic – but may be close in the aggregate for some synthetic files; Use for testing analysis setups only; Still need the master files for publishable results. 13/12/2019

39 Where do we get Access? Master File
Restricted access governed under the Statistics Act; Remote Job Submission (a.k.a, RDA) Research Data Centres Apply to SSHRC to obtain a peer-reviewed proposal and STC for security clearance. 13/12/2019

40 Where do we get Access? Public Use Microdata Files (PUMF) Get from DLI
Analyze where it is convenient Can use a variety of analysis software, including SAS, SPSS, Stata, HLM, LISREL, etc. 13/12/2019

41 Where do we get Access? Synthetic Files
Author Divisions ‘may’ create it Most relevant when dealing with new Panel Data, but not necessarily, e.g., the Census has potential NLSCY, NPHS & CCHS synthetic files on DLI FTP site 13/12/2019

42 Where do we get Access? Synthetic files Work locally with the file
Build SAS and SPSS setups 13/12/2019

43 Which File is Appropriate?
1st stop is still the PUMF; This file has the easiest access for us; Probably meets the needs of most patrons; Not as administratively burdensome as synthetic or master file; Perfect for clients just looking for ‘data’ – courses in quantitative analysis; 13/12/2019

44 Which File is Appropriate?
If more detail is needed, refer to the Master File Documentation; Inform patrons that the cost of use is higher, both in terms of accessibility and analytical requirements; Interest most likely to come from grad students and ‘experienced’ researchers 13/12/2019

45 Which File is Appropriate?
Download the Synthetic files from DLI Make them aware of problems with synthetic files – RESULTS ARE NOT PUBLISHABLE Encourage them to submit an application for RDC access – there is a time lag 13/12/2019

46 Which File is Appropriate?
RDC 13/12/2019

47 Which File is Appropriate?
Some of you may work with patron using synthetic files before passing her/him off to RDC. 13/12/2019

48 Services for Synthetic Files
DLI Contacts can provide four basic services with synthetic files. Build SPSS and SAS system files from the raw synthetic data files that are distributed through DLI; Provide information about the use of Remote Job Submission and RDC’s; 13/12/2019

49 Services for Synthetic Files
Assist with finding variables in the synthetic files; Provide instruction about ways of capturing SPSS or SAS code from “dummy” analysis runs with the synthetic files. It is this code that is submitted to STC through remote job submission. 13/12/2019

50 Services for Synthetic Files
1. Building SPSS and SAS system files for synthetic data The NLSCY synthetic data are distributed as a raw ASCII file with accompanying command files for SPSS and SAS Separate synthetic data files exist for each component of the NLSCY – not all components have PUMF’s 13/12/2019

51 Services for Synthetic Files
1. Building SPSS and SAS system files for synthetic data The synthetic data for the NLSCY – cycle 3 – primary file, has 948 variables and 6,393 fabricated cases. Creating the SPSS and SAS system files from this file is not difficult, but it does take time. DLI Contacts may wish to create these products for their patrons. 13/12/2019

52 Services for Synthetic Files
2. Information about Remote Job Submission (RJS) The author divisions supporting RJS have established their own guidelines and have different operating procedures. Not all divisions supporting longitudinal surveys currently support RJS (e.g., SLID). Therefore, there is a need to track down this information for our patrons. 13/12/2019

53 Services for Synthetic Files
2. Information about Remote Job Submission (RJS) For example, the sources for information about RJS include the Centre for Education Statistics: 13/12/2019

54 13/12/2019

55 Services for Synthetic Files
2. Information about Remote Job Submission (RJS) Where do you find this information? Ask the DLI Team via the DLI List The EAC has asked for a description of RJS on the DLI website, which should be on the DLI Team’s to-do list 13/12/2019

56 Services for Synthetic Files
2. Information about Research Data Centres The collection of master files available through RDC’s is listed on the STC website for RDC’s Each RDC has its own website describing its services 13/12/2019

57 13/12/2019

58 Services for Synthetic Files
3. Data Reference for the content of the synthetic files Helping researchers identify variables over longitudinal files is an important service Need to keep the unit of analysis straight Need to understand the mnemonic naming convention for variables over cycles Develop indexing aids for you and your patrons 13/12/2019

59 Services for Synthetic Files
4. Provide helpful tips for preserving the code from “dummy” analysis runs in SPSS and SAS Researchers will run analyses on the synthetic file to generate the code that they will subsequently for Remote Job Submission Providing information about how to do this easily will be helpful to your patrons 13/12/2019

60 Exercises 13/12/2019


Download ppt "Creating Something from Nothing: Working with Synthetic Files"

Similar presentations


Ads by Google