Presentation on theme: "Model of transformation administrative data to statistical data Data used in Population and Housing Census 2011 – examples Janusz Dygaszewicz and Paweł."— Presentation transcript:
Model of transformation administrative data to statistical data Data used in Population and Housing Census 2011 – examples Janusz Dygaszewicz and Paweł Murawski Central Statistical Office POLAND
Outline 1. Purpose of the work on administrtive sources 2. Data quality 3. Extract data 4. Transform data 5. Summary
Data Owners: Ministry of Finance, Ministry of Interior and Administration, Ministry of Justice, Agricultural Social Insurance Fund, National Health Fund, Agency for Restructuring and Modernisation of Agriculture, Agricultural and Food Quality Inspection, Agency for Geodesy and Cartography, State Fund for Rehabilitation of Disabled Persons, County Offices, Commune Offices, Regional Offices, Telcoms, Energy Suppliers, Office For Foreigners, Social Insurance Institution, Housing Managers, Registers - data acquisition 3
Purpose of the work on administrative data Obtaining a sufficiently complete data set – subjective and objective completeness corresponding to classification standards, definitions and basic categories, and thus the effective use of administrative data
Data quality -measures- 1. Measuring the quality of administrative registers – timeliness of data – methodological compatibility – completeness – identification standards used in the registry – usefulness – compatibility of data in administrative sources to data obtained in the study/survey 2. Measuring the quality in processing of data registers – excessive coverage error rate – incomplete coverage error rate – subjective indicator of completeness – objective indicator of completeness – imputation rate – data correction index – integration data from various sources index
Extract data consolidation data from various source systems; different data format, extract data into the production environment based on the SAS software, converting data into one format that is suitable for processing – SAS tables, validate of imported data structure is an integral part of this process.
Extract data -examples-
Transform data Data processing in the production environment consisting of: profiling – create a raport on the data quality, unification/standardization of data, parsing (separation) or combining variables, standardization with schemes, conversion, validation, deduplication, data integration.
Transform data - profiling-
Transform data - standardization and parsing examples- Transform data - standardization and parsing examples-
Transform data - schemes examples-
Transform data - exemples: report data cleaning - DescriptionBefore cleaningAfter cleaning Group of variablesVariableTotal Inorrect Total Inorrect total incorrect In % total incorect In % Address of permanent residence COMMUNITY ,92% ,69% CITY ,77% ,99% STREET ,34% ,65% PREFIX ,00% Address of residence COMMUNITY ,57% ,58% CITY ,13% ,86% STREET ,90% ,55% PREFIX ,00% Corresponding address COMMUNITY ,59% ,50% CITY ,84% ,07% STREET ,17% ,50% PREFIX ,00% Personal DataNAME ,17% ,13%
Transform data - conversion: gender variables famale male FM12
Transform data - conversion: marital status variable- 3 married (M) 503 – married (M) ZNY - married (M) 3 – married (M) 1 – bachelor KWR – bachelor 502 – bachelor 1 bachelor
Transform data -validation- checking the data, correcting abnormal values, according to the algorithms prepared by methodologists, eventual exclusion from further processing records which improvement is impossible.
Transform data - deduplication - removal of repeated units, requires detailed analisys, including alalysis of legal acts individual for each register, result of deduplication – one record with all the possible and unique information.
Transform data -expamle of deduplication process- Transform data -expamle of deduplication process-
Transform data -data integration- process of selection of the best, most current and correct value of several or a dozen of registers Used to create a statistical register, which will be available for use by analysts.
Transform data -intergation process – scheme- A Register B Register C Register ONE ID MULTIPLE IDENTIFIRES ALTERNATIVE LINKING KEYS LINKINGLINKING SELECTINGSELECTING ALGORYTHMS SELECTING THE BEST VALUES DATA COMPLETENESS STATISTICAL REGISTER
kraj_ur_kod_KEP # not null msce_ur_kod_POBYT # not null kraj_ur_kod_GZM # not null Transform data - data integration: example of algorythm Transform data - data integration: example of algorythm FALSE TRUE Kraj_ur_kod select kraj_ur_kod_GZM select kaj_ur_kod_POBYT select kraj_ur_kod_KEP
Data integration -example of process- Data integration -example of process-
Summary Common difficulties: - poor quality data, missing values, duplicates, - conflicting data, - technical: size of the registers, time-consuming process. Benefits: - obtain relevent, useful, accurate data - improve the quality of the output data. - selection of the best variables from multiple registers,