Presentation is loading. Please wait.

Presentation is loading. Please wait.

Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta.

Similar presentations


Presentation on theme: "Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta."— Presentation transcript:

1 Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta

2 Outline Types of microdata files Which microdata file to use Providing services for synthetic files This presentation is a modification of a workshop that Bo Wandschneider and I presented at the May 2003 National DLI Training program.

3 Types of Microdata Files Confidential Microdata Products Master Files Share Files Public Access Microdata Products Public use anonymized microdata (PUMFS) Synthetic Files

4 Microdata Products Microdata raw data organized in a file where the records or lines in the file are observations of a specific unit of analysis and the information on the lines are the values of variables requires some form of processing or analysis to be used

5 Microdata Products CCHS 0000015959922220611230721241433296101121222222112222222223060.75021.6221102296010400009600960400 000000002266662666666666666666666601166666666631114222222122226622612222266966226662122222221213 666666666666666666666999666615212222222222266666666666666666666666666666666666666666666666666666 6666666666666966666666666666666966666666666666666666666666666666666666000.4001.0000.0000.0000.10 00.1001.7112222222222222222222222699799966996699669966996699669966996699669966996699669966996699 6699669966996699669966996699660101300.0100032396969696966666662966666662696969666666666662696011 111101101.00096969696669619222210339699699696669699669605996666666662666611112222222105011000001 00000000000000166666666666666610020002000.006666969669966996669669666666607101122666296666666969 609669696000.009696662621112441100412102119630401161245060522333200224.17 0000023535951221521330523226642101103266666666266666666619045.90999.6622100296040501020300960000 000000001221222666666666666666666606126666666611413211122112226622622222266966226662222222222111 666666666666666666666999666611666666666666615221222222222222266666666666666666666662151222222222 2221222212226026666666666666666966666666666666116666666666666666666666000.1001.0001.0001.0001.00 01.0005.1222222122222222222222122699669966996699669966060299669966996699669966996699669966996699 6699669966996699660032996699660101100.8102112301960705066666662966666662696969666666666662696021 141201100.45996969696669696132229639699699696669699669606996666666662666622266662222296966996996 99699699699699612666666666666639969962000.006666969669966996669669666666696969696666296666666969 609669696000.009696662612631340000312696669669966663234040122333200317.04

6 Confidential Microdata Master Files These files contain the fullness of detail captured about the unit of observation. The information in these files could identify the individual who provided the original information and, therefore, are considered confidential.

7 Confidential Microdata Master File – Example

8 Confidential Microdata Master File – geography

9 Confidential Microdata Master File - fullness of data

10 Confidential Microdata Master File - fullness of data

11 Confidential Microdata Master File - fullness of data

12 Confidential Microdata Share Files these are confidential files in which the respondents have signed a consent form permitting Statistics Canada to allow access to their information for approved research. Used with NPHS and NLSCY

13 Public Access Microdata Anonymized Microdata these microdata are specially prepared to minimize the possibility of disclosing or identifying any of the cases or observations the original data from the master file are edited to create a public use microdata file

14 Public Access Microdata Steps in Anonymizing Microdata removal of all personal identifiers include only gross levels of geography collapse detailed information into fewer general categories or cap values suppress the values of a variable

15 Public Access Microdata Statistics Canada PUMFs only available for select social surveys that undergo a review of the Data Release Committee, an internal Statistics Canada committee; no ‘enterprise’ public use microdata;

16 Public Access Microdata Statistics Canada PUMFs almost all are cross-sectional, that is, represent data collected at one point in time; longitudinal data are difficult to anonymize while maintaining any useful information.

17 Public Access Microdata PUMFs – personal identifiers

18 Public Access Microdata PUMFs – collapsed data

19 Public Access Microdata PUMFs – suppressed data

20 Public Access Microdata Synthetic Files These microdata do not contain actual ‘real’ cases but are pseudo- cases that for some surveys, provide aggregate results close to the ‘real’ cases

21 Public Access Microdata Synthetic Files They have been prepared to create analysis runs with the master file without possibly disclosing or identifying any of the cases

22 Public Access Microdata Synthetic Files The results are not to be reported, but are strictly to be used to prepare analyses of master files; Usually associated with longitudinal files.

23 Public Access Microdata Steps in creating Synthetic Files Observations are transformed No records actually exist Keep fullness of variable description How the files are made is kept confidential

24 Public Access Microdata Synthetic Files – CCHS Cycle 1.1 PUMF Synthetic Obs Lrecl 130880 841 65101 1778 Var 614 1164

25 Implications for Analysis What are the implications in doing analysis with these different types of microdata files?

26 Implications for Analysis Master File All observations Has the most variables with the most detail Lots of geography and personal characteristics Little grouping or capping of categories

27 Implications for Analysis Master File Restricted access: only available to authorized Statistics Canada employees, which includes ‘deemed employees’; Use of the analysis is controlled through a contract;

28 Implications for Analysis Master File Includes linkage variables across files within a study, e.g., NLSCY linkage among the files for different units of analysis (kids, parents, teachers).

29 Implications for Analysis Public Use Microdata (PUMF) Valuable content for a tremendous amount of research; Where issues arise is when smaller area geography is desired; rare subpopulations are being studied; or the variables that are needed have been used to anonymize respondents;

30 Implications for Analysis Public Use Microdata (PUMF) Licensed product: agree to certain terms of use; No linkage to multiple units of analysis, except for a few exceptions (e.g., GSS Time Use and Family);

31 Implications for Analysis Synthetic Files “Looks like a duck and quacks like a duck”, but it isn’t a duck or any other type of fowl.

32 Implications for Analysis Synthetic Files Looks like master files Lots of observations Lots of variables Little grouping or capping of categories Lots of geographic detail

33 Synthetic Files Precautions Results not authentic – but may be close in the aggregate for some synthetic files; Use for testing analysis setups only; Still need the master files for publishable results.

34 Where do we get Access? Master File Restricted access governed under the Statistics Act; Remote Job Submission (a.k.a, RDA) Research Data Centres Apply to SSHRC to obtain a peer-reviewed proposal and STC for security clearance.

35 Where do we get Access? Public Use Microdata Files (PUMF) Get from DLI Analyze where it is convenient Can use a variety of analysis software, including SAS, SPSS, Stata, HLM, LISREL, etc.

36 Where do we get Access? Synthetic Files Author Divisions ‘may’ create it Most relevant when dealing with new Panel Data, but not necessarily, e.g., the Census has potential NPHS & CCHS synthetic files on DLI FTP site

37 Where do we get Access? Synthetic files Work locally with the file Build SAS and SPSS setups

38 Which File is Appropriate? 1 st stop is still the PUMF; This file has the easiest access for us; Probably meets the needs of most patrons; Not as administratively burdensome as synthetic or master file; Perfect for clients just looking for ‘data’ – courses in quantitative analysis;

39 Which File is Appropriate? If more detail is needed, refer to the Master File Documentation; Inform patrons that the cost of use is higher, both in terms of accessibility and analytical requirements; Interest most likely to come from grad students and ‘experienced’ researchers

40 Which File is Appropriate? Download the Synthetic files from DLI Make them aware of problems with synthetic files – RESULTS ARE NOT PUBLISHABLE Encourage them to submit an application for RDC access – there is a time lag

41 Which File is Appropriate? Some of you may work with patron using synthetic files before passing her/him off to RDC.

42 DLI Contacts can provide four basic services with synthetic files. Build SPSS and SAS system files from the raw synthetic data files that are distributed through DLI; Provide information about the use of Remote Job Submission and RDC’s; Services for Synthetic Files

43 Assist with finding variables in the synthetic files; Provide instruction about ways of capturing SPSS or SAS code from “dummy” analysis runs with the synthetic files. It is this code that is submitted to STC through remote job submission. Services for Synthetic Files

44 1. Building SPSS and SAS system files for synthetic data The CCHS synthetic data are distributed as a raw ASCII file with accompanying command files for SPSS and SAS Separate synthetic data files exist for the master file setup and for bootstrapping analysis Services for Synthetic Files

45 1. Building SPSS and SAS system files for synthetic data The synthetic data for the CCHS Cycle 1.1 has 1,164 variables and 65,101 fabricated cases. Creating the SPSS and SAS system files from this file is not difficult, but it does take time. DLI Contacts may wish to create these products for their patrons. Services for Synthetic Files

46 2. Information about Remote Job Submission (RJS) The author divisions supporting RJS have established their own guidelines and have different operating procedures. Not all divisions supporting longitudinal surveys currently support RJS (e.g., SLID). Therefore, there is a need to track down this information for our patrons. Services for Synthetic Files

47 2. Information about Remote Job Submission (RJS) For example, the sources for information about RJS include the Centre for Education Statistics: http://www.statcan.ca/english/edu/rda/index.htm Services for Synthetic Files

48 2. Information about Remote Job Submission (RJS) Where do you find this information? Ask the DLI Team via the DLI List The EAC has asked for a description of RJS on the DLI website, which should be on the DLI Team’s to-do list Services for Synthetic Files

49 2. Information about Research Data Centres The collection of master files available through RDC’s is listed on the STC website for RDC’s Each RDC has its own website describing its services http://www.statcan.ca/english/rdc/index.htm Services for Synthetic Files

50 3. Data Reference for the content of the synthetic files Helping researchers identify variables over longitudinal files is an important service Need to keep the unit of analysis straight Need to understand the mnemonic naming convention for variables over cycles Develop indexing aids for you and your patrons Services for Synthetic Files

51 4. Provide helpful tips for preserving the code from “dummy” analysis runs in SPSS and SAS Researchers will run analyses on the synthetic file to generate the code that they will subsequently email for Remote Job Submission Providing information about how to do this easily will be helpful to your patrons Services for Synthetic Files


Download ppt "Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta."

Similar presentations


Ads by Google