Presentation is loading. Please wait.

Presentation is loading. Please wait.

4.1 Do you speak VTL? Validation and Transformation Language

Similar presentations


Presentation on theme: "4.1 Do you speak VTL? Validation and Transformation Language"— Presentation transcript:

1 4.1 Do you speak VTL? Validation and Transformation Language
Q2018 Conference Training course on Data Integration and Validation Krakow, 26 June 2018 Eurostat, Unit B1

2 VTL: the origin Based on a generic information model that can be used with different standards: SDMX, DDI, GSIM or others VTL is maintained by the VTL Task Force, composed of members of Eurostat, ECB, ILO, INEGI, Bank of Italy, ISTAT The VTL Task Force works under the umbrella of the SDMX Technical Working Group

3 Mapping of SDMX and VTL artefacts Messages for exchanging VTL rules
SDMX, which stands for Statistical Data and Metadata eXchange is an international initiative that aims at standardising and modernising (“industrialising”) the mechanisms and processes for the exchange of statistical data and metadata among international organisations and their member countries. SDMX is sponsored by seven international organisations including the Bank for International Settlements (BIS), the European Central Bank (ECB), Eurostat (Statistical Office of the European Union), the International Monetary Fund (IMF), the Organisation for Economic Cooperation and Development (OECD), the United Nations Statistical Division (UNSD), and the World Bank. SDMX implementation: in progress Mapping of SDMX and VTL artefacts Messages for exchanging VTL rules Registry for storing VTL rules Web services for retrieving VTL rules

4 VTL – purposes provide an unambiguous language to communicate validation rules between different statistical organisations provide a high-level language to document the data transformations provide an efficient language for implementing data validation services provide an efficient language for implementing data transformations

5 Versions of VTL VTL 1.0 published in March 2015
Collection of comments (public review) VTL 1.1 published in November 2016 VTL 2.0 published in April 2018 SDMX web site:

6 VTL – main principles ds_bop_1
Most of the VTL operators operate on datasets A dataset is described by dimensions, measures and attributes Example: ds_bop_1 REF_AREA PARTNER TIME OBS_VALUE OBS_STATUS EU25 CA 2010 20 D BG 1 P RO EU27 23 Dimension Measure Attribute

7 VTL – main principles Example of a typical VTL operation:
ds3 := ds1 + ds2 Operations carried out by VTL: join the data points of the ds1 and ds2 using the dimension values apply the scalar function "+" to all pairs of numeric measures of ds1 and ds2 having the same name if desired, execute an attribute propagation function defined by the user (e.g. concatenate the "flag" attribute of the two data points) create a temporary dataset containing the resulting data points

8 Example of VTL validation rules
Hierarchical validation rules Data point validation rules Time-series rules Boolean conditions Before: check ( ds1#obs_value >= 0 ) if obs_value is the only measure then it can be simplified to check(ds1 >= 0) check ( ds1 >= 0 ) 8

9 VTL - hierarchical ruleset
Hierarchical ruleset: hr_euro_agg N. Antecedent variables: time Rule variables: ref_area 1 EU15 = AT + BE + LU + DE + ES + FI + FR + EL + IE + IT + NL + PT + DK + UK + SE 2 EU25 = EU15 + CY + CZ + ES + HU + LT + LV + MT + PL + SK + SI 3 EU27 = EU25 + BG + RO 4 EU28 = EU27 + HR 5 time between 1995 and 2003 EU = EU15 6 time between 2004 and 2005 EU = EU25 7 time between 2006 and 2012 EU = EU27 8 time >= 2013 EU = EU28 VTL syntax: define hierarchical ruleset hr_euro_agg ( valuedomain condition time rule ref_area) is EU15 = AT + BE + LU + DE + ES + FI + FR + EL + IE + IT + NL + PT + DK + UK + SE ; EU25 = EU15 + CY + CZ + EE + HU + LT + LV + MT + PL + SK + SI ; EU27 = EU25 + BG + RO ; EU28 = EU27 + HR ; when between(time, 1995, 2003) then EU = EU15; when between(time, 2004 , 2005) then EU = EU25 ; when between(time, 2006 , 2012) then EU = EU27 ; when time >= 2013 then EU = EU28 ; end hierarchical ruleset 9

10 VTL – datapoint validation ruleset
define datapoint ruleset dr_flow_positive ( variable flow, obs_value ) is when flow = "IMP" or flow = "EXP" then obs_value > 0 ; end datapoint ruleset The datapoint ruleset: is defined on the variables flow and obs_value verifies that in each data point of the dataset to be validated (not shown here) the component obs_value is greater than zero when the flow is "IMP" or "EXP". the above syntax creates a ruleset (a permanent object) named "dr_flow_positive"

11 VTL – checking boolean conditions
ds_result := check ( ds1 > 1000, errorcode ("Value should be greater than 1000" ) errolevel ( "Error") )

12 Exercise 1 VTL code: ds_result  := check ( between(ds_bop # time_period , 2008 , 2015) errorcode(“_____”) errorlevel(“Error”) ) ; ds_bop is the dataset containing the data to be validated Question: What is the correct text (error message) to be inserted in _____ ? A correct text for an error message could be "Time Period should be between 2008 and 2015"

13 Exercise 2 ds_bop1 REF_AREA PARTNER TIME OBS_VALUE OBS_STATUS EU25 CA 2010 20 D BG 1 P RO EU27 23 VTL code: ds_result  := check_hierarchy ( ds_bop1, hr_euro_agg) ; hr_euro_agg is the hierarchical ruleset described in slide 9. Question: What is the data point contained in the ds_result dataset? The data point contained in the ds_result dataset is the EU27 record. This is due to the fact that the "check" operator returns erroneous records. The EU27 record is erroneous because it does not correspond to the sum of EU25+BG+RO (23 <> ) – see formula in slide 9 (VTL Hierarchical ruleset).

14 Exercise 3 ds_bop1 REF_AREA PARTNER FLOW TIME OBS_VALUE OBS_STATUS EU25 CA IMP 2010 20 D BG 1 P RO EU27 23 VTL code: ds_result  := check_datapoint ( ds_bop1, dr_flow_positive ) ; dr_flow_positive is the datapoint ruleset described in slide 10. Question: What is the data point contained in the ds_result dataset? The data point contained in the ds_result dataset is the record related to RO because it is an erroneous record (with value zero) according to the formula in slide 10 (VTL – datapoint validation ruleset). That formula implies a positive value for flows "IMP" and "EXP".

15 VTL – assessement of usability
Assessment of usability by statisticians: Covering several domains: Animal Production, Asylum, International Trade in Services, National Accounts, Short Term Statistics Participation of 8 countries + Eurostat Some conclusions: Rules in VTL are useful as complement to rules in plain English (to limit the risks of ambiguity) Examples of bad/good data are also useful to understand the rules.

16 Development of VTL tools
IT tools and services under development: ECB VTL parser Norway Java API based on JSON-stat format Poland VTL to SQL translator UNECE paper Istat VTL Editor Eurostat Compiler (part of the Validation Service) Eurostat Validation Rule Manager Eurostat Sandbox: simple GUI + VTL translator to SQL

17 Some use of VTL ECB BIRD portal Continuous Capture of Metadata
VTL is used to document the data validations and transformations of the statistical process: Continuous Capture of Metadata There is a proposal to use VTL as a common language to describe data transformations

18 Thank you for your attention!
Any questions?


Download ppt "4.1 Do you speak VTL? Validation and Transformation Language"

Similar presentations


Ads by Google