Presentation is loading. Please wait.

Presentation is loading. Please wait.

Two Issues on Remote Data Access

Similar presentations


Presentation on theme: "Two Issues on Remote Data Access"— Presentation transcript:

1 Two Issues on Remote Data Access
Two Issues on Remote Data Access - Automation of Disclosure Limitation Review - Memory Allocation Peter Jacobebbinghaus 5th German Stata User's Group Meeting April 2nd, 2007 at RWI-Essen

2 Research Data Centre of the BA in the IAB in Nuremberg (FDZ)
Provides access to micro data of the Institute for Employment Research (IAB) Federal Employment Agency (BA) Data surveys administrational (databases > 200 GB, samples ~ 2 GB) Data protection and data access anonymization (scientific use files) on-site use remote data access

3 Disclosure limitation review
Remote data access Researchers send do-files – we return log-files 90% use Stata Disclosure limitation review Data have detailed information on persons and firms Neither firms nor persons may be identifiable by the contents of the log-files One heuristic: delete all information that is based on few observations

4 Example . table sex edu ---------------------------------- | Education
Sex | low medium high male | female |

5 Example . table sex edu ---------------------------------- | Education
Sex | low medium high male | / female |

6 Example . table sex edu, c(freq mean wage)
| Education Sex | low medium high male | | | female | |

7 Example . table sex edu, c(freq mean wage)
| Education Sex | low medium high male | / | / | female | |

8 Disclosure limitation review
Problems a lot of work if done by hand! Pearl scripts don‘t get all number of firms not visible weights Solutions use other software ;-) „fdz-ados“ -tablefdz-

9 tablefdz sxntax very similar to -table- version 9 error messages
program define tablefdz, byable(recall) version 9 syntax varlist(min=1 max=2) [if] [in] [pweight /] ,FIRMID(string) [Contents(string) Format(string)] error messages preserve drop all needless variables and observations

10 by `vars' `firmid': gen `dum' = ( _n == 1 )
by `vars': egen `anzbetr' = sum(`dum') by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -2 if `anzbetr' <= `minoffirms' replace `summe' = -2 if `anzbetr' <= `minoffirms' save `cell'

11 tablefdz calculation done for inner table, row, col and total
results for inner table, row, col and total are appended tabdisp restore

12 Results of tablefdz . tablefdz sex edu, firmid(firmid)
| Education Sex | low medium high Total male | female | | Total |

13 . tablefdz sex edu, firmid(firmid) c(mean wage sum wage)
| Education Sex | low medium high Total male | | | | female | | | Total | | |

14 . tablefdz sex edu [pw= weight], c(mean wage sum wage) firmid(firmid)
| Education Sex | low medium high Total male | | | | female | | | Total | | |

15 tablefdz deleted values can somtimes be recalculated very easily more deletion and rounding by Pearl still necessary but now it is feasible tablefdz takes about twice the time as table Questions ideas to improve? (computation time) similar programs around?

16 Topic 2: Memory usage 16 gigabyte memory in Unix server, 15 gigabyte are free How much memory is the optimum we should allocate? 13? 15? dependant on matsize? rule-of-thumb? Automation for running Stata-jobs?

17 * Program to produce tables with automatically deleted values that are based on few observations
* by Stefan Seth and Peter Jacobebbinghaus *! version mar2007 program define tablefdz, byable(recall) version 9 syntax varlist(min=1 max=2) [if] [in] [pweight /] , FIRMID(string) [Contents(string) Format(string)] tempvar anzbetr sv1 sv2 sv3 sv4 sv5 dum svar1 svar2 svar3 svar4 svar5 eins summe tempvar sum_gew1 sum_gew2 sum_gew3 sum_gew4 sum_gew // used when calculating weighted averages tempfile dat cross row col scol local minoffirms = // min. number of firms per statistic local anzvar = wordcount("`varlist'") local fmt "format(`format')" preserve if _by() { marksample touse // take care of by-group } quietly {

18 if "`contents'" == "" { local anz_stats = 0 } if "`contents'" != "" { if mod( wordcount( "`contents'" ), 2 ) != 0 { di in red "Invalid statistics specification; please use the Contents option as follows:" di in red "contents( statistic variable )" di in red "You may specify up to 5 statistic/variable-pairs, e.g.: " di in red "tabfdz var1 var2, contents( mean var3 sum var4 ) exit tokenize "`contents'" local stats "`1' `3' `5' `7' `9'" local svar "`2' `4' `6' `8' `10'" tokenize "`stats'" local anz_stats = wordcount( "`stats'" ) // list of statistics forvalues i = 1/`anz_stats' { local stat`i' "``i''" tokenize "`svar'" // list of variables local svar`i' "``i''"

19 *------------------------------------------------------------ error messages
if "`weight'" != "" { forv i = 1/`anz_stats' { if inlist( "`stat`i''", "sum", "mean" ) == 0 { di in red "weights may only be used with 'sum' and 'mean'" exit } if "`weight'" != "pweight" { di in red "only pweights allowed" forvalues i = 1/`anz_stats' { if inlist( "`stat`i''", "sum", "mean", "sd", "median", "min", "max" ) == 0 { di in red "Statistics sum mean sd median min max allowed only"

20 *------------------------------------------------------------ reduce size of data
if _by() { keep if `touse' // restrict to by-group } capture keep `if' `in' // reduction of observations keep `firmid' `varlist' `svar' `exp' // reduction of variables gen byte `eins' = 1 tokenize `varlist' // dropping missings drop if missing( `1' ) if `anzvar' == 2 { drop if missing( `2' )

21 if "`weight'" != "" { replace `eins' = `exp' // multiply statistics variables with weights... forv i = 1 / `anz_stats' { local jmax = `i' - 1 local seen = // ... but only once forv j = 1 / `jmax' { if "`svar`i''" == "`svar`j''" { local seen = 1 } if `seen' == 0 { replace `svar`i'' = `svar`i'' * `exp' save `dat'

22 *------------------------------------------------------------ computation of statistics
local vars "`varlist'" // inner part of the table sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) // mark obs with new firm by `vars': egen `anzbetr' = sum(`dum') // number of firms per cell by `vars': egen `summe' = sum( `eins' ) // (weighted) number of persons per cell forv i = 1 / `anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { // mean with weights per hand by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i''/`summe' // mean = sum(y_i * h_i)/sum(h_i) } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) // other statistics per person by `vars': keep if _n == // keep one line per cell replace `sv`i'' = -1 if `anzbetr' <= `minoffirms' // erase statistics based on few firms replace `summe' = -1 if `anzbetr' <= `minoffirms' // erase number of cases based on few firms save `cross' // save results for the inner part

23 if `anzvar' == { use `dat', clear // the same for rows tokenize `varlist' local vars "`1'" replace `2' = . sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum(`dum') by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i'' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -2 if `anzbetr' <= `minoffirms' replace `summe' = -2 if `anzbetr' <= `minoffirms' save `row'

24 use `dat', clear // the same for columns
tokenize `varlist' replace `1' = . local vars "`2'" sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum( `dum' ) by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i'' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -3 if `anzbetr' <= `minoffirms' replace `summe' = -3 if `anzbetr' <= `minoffirms' save `col'

25 use `dat', clear // the same for the total
tokenize `varlist' replace `1' = . if `anzvar' == { replace `2' = . } sort `firmid' by `firmid': gen `dum' = ( _n == 1 ) egen `anzbetr' = sum(`dum') egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { egen `sum_gew`i'' = sum( `svar`i'' ) gen `sv`i'' = `sum_gew`i'' / `summe' else { egen `sv`i'' = `stat`i''( `svar`i'' ) keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -4 if `anzbetr' <= `minoffirms' replace `summe' = -4 if `anzbetr' <= `minoffirms' save `scol'

26 use `cross', clear // appending the results
append using `scol' if `anzvar' == { append using `col' append using `row' } * format output forvalues i = 1/`anz_stats' { // headers local svlist = "`svlist'" + " `" + "sv`i'" + "'" if `anzvar' == 1 & "`contents'" != "" { label variable `sv1' "`stat1'(`svar1')" cap label variable `sv2' "`stat2'(`svar2')" cap label variable `sv3' "`stat3'(`svar3')" cap label variable `sv4' "`stat4'(`svar4')" cap label variable `sv5' "`stat5'(`svar5')" label variable `summe' "Freq." local linesize = c(linesize) set linesize 255 format `summe' %9.0g } // end of quietly

27 *------------------------------------------------------------ output
di "/*§BEGIN_tabfdz*/" tabdisp `varlist', cellvar(`summe' `svlist' ) totals `fmt' csepwidth(2) missing di "/*§END_tabfdz*/" restore set linesize `linesize' end


Download ppt "Two Issues on Remote Data Access"

Similar presentations


Ads by Google