Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia

Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia
Developing and using common tools for processing statistical micro data at SORS Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia

Content of the presentation
Problem description Design and implementation guidelines Example: common tools for statistical micro data editing Other common tools for processing statistical data Current state, future plans and lessons learned Discussion

Problem description Our statistical IT landscapes are becoming more and more complex, containing systems developed for decades containing multiple vendor components and technologies. Current and future statistical business needs require flexibility and agility in support of IT. Statistical micro data processing has always been one of the most important parts of the statistical process; however, due to the complexity of the related methodology, general acceptable solutions are rare. The main aim is to create a modular platform that enables statisticians to design the process and process statistical micro data. Current state and experiences gained during developing and using common tools for processing statistical micro data as well as how common tolls are used in the data editing process will be presented.

Data processing target workflow
Statistician Methodologist Statistician Statistician Request for data processing Define statistical methodology Choose and connect building blocks Process Repository of building blocks

Design and implementation guidelines
Standardization: solution enforces standards Flexibility: incremental development, agile maintenance Reuse: build processes from reusable, well defined and well tested building blocks, layered architecture model Resources: reduction of resources, less manual work Reliability: manual work is error prone Separation of design and execution Process quality and product quality Platform/ technology independent solutions GSBPM aware

Streamlining statistical micro data editing processing
AGREGATIONS DISSEMINATION PREPARATION OF DISSEMINATION TABLES KLASJE Input database STATISTICAL SURVEY PLANNING METADATA EDITING Statistical register Data warehouse Macro Electronic releases DATA COLLECTION DATA INTEGRATION CUSTOMERS (SLOVENIA, EU ...) DATA PREPARATION STATISTICAL PROCESSING SOURCES Macrodata – standard tables customers Latest data customers Tailor–made tables ARCHIVES OFFICIAL ARCHIVES 2,3 data and processes ARCHIVE 1 P O Ž A R N A S T E N A SCHEME OF TARGET DATA FLOW AT SORS ARCHIVE 4 Micro DOCUMENTATION EDITING Dissemi- nation server METIS Documentation According to templates SECONDARY DATA PRIMARY Printed publications Organizations International reporting Microdata for researchers MICRO EDITING SEASONAL ADJUSTEMENT AND MACRO DATA EDITING USERS Observation units FRAME AND SAMPLE CREATION OF STAT. REGISTER

Creation of the input database
Data integration, followed by the data editing is the crucial part of the process. The quality of the final data to a large extent depends on this part of the process. … Missing data Field survey data X … Inconsistent data X X Other SORS data … X X Admin data X X X

Data editing Main steps in the process:
Logical checks for a particular data set Outlier detection for a particular data source Corrections and missing data imputation for a particular data source Data integration and derived variables calculation Logical checks on integrated data Additional corrections in the particular data source Integration

Metadata driven, content independent and reusable building blocks
checks Corrections, imputations checks Corrections, imputations Integration checks Corrections, imputations Integrated database Integrated database checks From the implementation point of view

Checks – metadata table
Processing

Corrections – metadata table
Individual data corrections Heating costs Inter-household cash transfer Systematic data corrections Apples produced for own consumption

Imputations – metadata table
24 different methods with different parameterizations can be used at the moment (hot-deck, regression, logical imputations, etc.) Mortgage installment Parameterization

Data editing – software application
The application is designed as a Metadata Driven System (MDD). All the information referring to a specific survey execution are provided through the metadata tables. Currently the application uses the following software: SAS + Banff as stored procedures SAS EG as standard interface ORACLE (metadata repository) ORACLE (data storage) * Plan: Technology and platform independent environment for the automated execution of processes, other development environments

Existing MDD tools Logical checks Corrections Imputations Aggregation
Standard error estimation Tabulation

Planed MDD tools Sampling Weighting Calculation of quality indicators
Disclosure control Macro editing

Current state – main advantages
The subject-matter personnel can run the process independently from the IT sector. All information about the data processing is transparently available through the metadata tables. The process can be easily adjusted for the different executions of the survey. Every change of the data in the process is systematically flagged  easier calculation of quality indicators and production of the quality report.

Management of the system
Adhoc procedure for preparing input data Adhoc procedure for preparing output data input data output data Micro data database Edited data Production metadata repository Surveys and instances Standard SAS application All the processes are programmed as sas (+Banff) macros. The user can run and control the processes through stored process interface.

Implementation approach
Design & development of editing system Automated data editing system based on predefined checks (Fellegi, Holt) Complex implementation and maintenance Statisticians want more control over edited data Semi automated system where statisticians control checks and edit rules through interactive metadata Development of improved system for metadata management Development of architecture for supporting the execution of statistical business processes Solutions for visualisation of the workflow

Challenges for the future
To improve the procedure of the metadata management; at the moment there is a high risk of syntax errors Improve the management of the system, especially running of different parts of the process Additional MDD building blocks should be developed Platform independent components and integration Enable separation of process design from process execution

Conclusions Current solution for processing statistical micro data is proven to work satisfactorily for the processing of several micro data collections (Population Census, Agriculture Census, EU SILC, SES). Development of improved system for metadata management is the next phase. During the development of the applications other (international) solutions will be considered: sharing IT tools, building blocks harmonizing IT infrastructure Metadata definitions and their exchange

Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia

Similar presentations

Presentation on theme: "Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia

Similar presentations

Presentation on theme: "Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia"— Presentation transcript:

Similar presentations

About project

Feedback