Presentation on theme: "Workshop ESS NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN STATISTICAL PRODUCTION 22 & 23 SEPTEMBER 2011 “Mapping the GSBPM on a SDW architecture”"— Presentation transcript:
Workshop ESS NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN STATISTICAL PRODUCTION 22 & 23 SEPTEMBER 2011 “Mapping the GSBPM on a SDW architecture” Antonio Laureti Palma IT - Structural Business Statistics Unit National Institute of Statistics – Italy
Overview The aim of this study is to define and contextualize a statistical data warehouse in order to define a framework to assist the development and definition of “data warehousing and data linking”. The data warehousing architecture presented can be considered as an IT-conclusion of the activities of the first year of the ESSnet. While, the modelling approach proposed it would indicate the roadmap for the future IT representation on the context. It will be described by: Data Warehousing as a Single Coherent Statistical production System Statistical Data Warehousing an Architecture schema Modeling the Business Domain - Designer’s view of the GSBPM on DWA schema Modeling the Data/Metadata Domain Conclusion
The Data Warehouse IT definition: In computing, a data warehouse is a database used for reporting. …the concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse" (from Wikipedia). ...as Bill Inmon says - “the data warehouse is at the center of the corporate information factory, which provides a logical framework for decision support environments and business management capabilities”. ...in essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to delivering business intelligence.
Data Warehousing for Enterprise MARKETING PRODUCTION SALES RESOURCES DISTRIBUTION DATA WAREHOUSE DSS Decision Support System MIS Management Information System ETL ENTERPRISE PRODUCTION LINE DW centrality in an enterprise is obtained trough a IT infrastructure transversal to all the operational systems. The data from operational systems are Extracted Transformed and Loaded (ETL) into the DW and then they are available for the DSS and MIS.
Data Warehousing for Statistics SURVEYS ADMIN DATA DATA WAREHOUSE ETL STATISTICAL PRODUCTION LINE RESOURCES OUTPUT REGULATIONS ETL In a NSI, if the DW is mainly used for improving production efficiency, like for an enterprise, it is transversal to the statistical production line: ELABORATION ETL DSS Decision Support System MIS Management Information System
Data Warehousing for Statistics SURVEYS ADMIN DATA DATA WAREHOUSE ETL STATISTICAL PRODUCTION LINE RESOURCES DDS MIS SD Statistical Dissemination REGULATIONS ETL In a NSI, if the DW is used for “improving the production efficiency” (DSS-MIS) and for “creating the statistical product” (SD), then the DW is part of the production line. …in this case, the DW could be considered as a single logical repository, the center of the information factory, of all information generated from the NSI:
From the survey, two issues arise: Single coherent system (questions 6 to 13) 15 counties declare they do not have a single coherent system, even if 11 out of them are planning to change it... this situation will probably largely change in the next five years... Current output requirements are not integrated into data systems for 10 countries and the situation will probably change for half of them... Those who have a single coherent system do not want to change it, metadata and data-input are totally integrated in the data system as well as admin data. Motivation to start DW (question 14) The main motivations are linked to the ways to (re)use data, the improvement of the efficiency and the process integration in business statistics production... Adjunct motivations are integrating the project in the organization processing model, reducing the burden (cost and time) on survey responders and increasing consistency and quality.
In a stove-pipe production system every single production line corresponds to a specific domain of statistics, together with the corresponding production system. For each domain, the whole production process from survey design to dissemination, takes place independently of other domains, and each has its own data suppliers and user groups: Disadvantages of a stove-pipe-like production administrative data Information Society Science Technology Innovation …. Short Term business Statistics survey data Structural Business Statistics elaboration statistical output SBS STS IS STI I/O STS SBS STI IS I/O data integration Business Register
Data Warehousing as a Single Coherent System In a NSI, a single coherent Data Warehousing System (DWSys) is finalized to improve the production efficiency and to create the statistical products, in a full integrated way. From this view, the DWSys becomes the “effective” Information System of the full statistical production line. Then, the DWSys should be used to refer to the interaction between: People, Business Processes, Data and Technology. The Statistical Data Warehouse (SDW) then can be seen as a central statistical data store, regardless of the data’s source, for managing all available data of interest, improving the NSI’s ability to: (re)use data to create new data/new outputs; perform reporting; execute analysis; produce the necessary information.
DWSys Architectural description A DWSys Architecture (DWA) for statistics is a rigorous description of the structure of the NSI production, which comprises DWSys components (business entities or sub-process), the externally visible properties of those components, and the relationships (e.g. the behavior) between them. The DWA should be a framework for a NSI which defines how to organize the DWSys: provide the mechanisms for communicating information about the relationships that are important in the architecture provide the discipline to gather and organize the data and construct the views in a way that helps ensure integrity, accuracy and completeness support the application of method and use of tools
Layers of the enterprise architecture In the context of the creation of enterprise architecture it is common, to recognize four types of architecture, each corresponding to its particular architectural domain.
DWA – Business Domain To provide a DWA as detailed as possible, in the context of statistics production, we could articulate the business domain in four functional layers: data source layer, integration layer, interpretation and data analysis layer, access layer. Each layer has its data domain structure: operational data, for data warehousing meta data, the description data of the SDW, usually used to manage, describe and monitor the information systems.
DWA layered business architecture INTERPRETATION & DATA ANALYSIS DSS MIS STATISTICAL DISSEMINATION SOURCEINTEGRATION STAGING AREA BUSINESS REGISTER PRIMARY DATA MART ACCESS DATA MART DATA MART SURVEYS 1 ADMIN DATA 1 RESOURCES REGULATIONS SURVEYS n ADMIN DATA n META DATA MANAGEMENT
DWA - functional Layer Source Database Layer: This level is responsible for, physically or virtually, storing the data from internal (surveys) or external (archives) sources for statistical purpose. Typical data sources, in the context of business statistics, are data from : specific surveys, like STS, ICT, CIS, SBS, Customs Agency, Revenue Agency, Chambers of Commerce, National Social Security Institute.
DWA - functional Layer Integration layer: It is used for all integration and reconciliation activities of data sources. Into this layer we have the set of applications that perform the main ETL, which manages: inconsistent coding for the same object, the consistency is obtained by coding defined by the data warehouse; adjustment of the different units of measurement and inconsistent formats; alignment of inconsistent labels, same object named differently. Usually the data are identified according to the definition contained in the metadata of the system. incomplete or incorrect data; in this case operation may require human intervention to resolve issues not predictable a priori. data linking, in which different sources enable the creation of extended, or new, units of analysis.
DWA - functional Layer Interpretation and data analysis layer: The basic functions performed at this level are advanced analysis and interpretation of data-elaborations, both based on statistical algorithms. Here “statistical expert users” operate to produce strategic value information, working with the maximum granularity data. Only a reduced number of users are allowed to access the data, in order to prevent lack of servers performance. This strategy of “process of information delivery”, where the demand for new statistical information does not involve the construction of new statistical production lines, but rather the creation of other data marts. Results of these activities are unplanned aggregate data for the next access layer or to develop software rules for next iteration, through data marts, regarded as subsets of the DW, usually oriented to a specific business line or team.
DWA - functional Layer Access Layer: It is the layer for the final presentation of the information sought, addressed to a wide typology of users, not necessarily expert on business statistics, or informatics instruments. They are: - Specialized Business Intelligence tools: in this extensive category, in terms of solutions on the market, we find tools to build queries, navigational tools (OLAP viewer) including Web browsers; - Graphics and publishing tools: the Business Intelligence tools are able to generate graphs and tables for its users, this solution consists essentially in just a couple of steps to avoid inefficiency. - Office Automation tools: this is a reassuring solution for users who come for the first time to the data warehouse context, as they are not forced to learn new complex instruments. The problem is that this solution while adequate with regard to productivity and efficiency, is very restrictive in the use of the data warehouse, since these instruments, have significant architectural and functional limitations;
DWA – Modeling the Business Domain The designer's view of business is also known as the analytical view and there are various standards for modeling this view. One mostly commonly used modeling standard is the Generic Statistical Business Process Model (GSBPM). The GSBPM definition by UNECE is (vers.4): “The original intention was for the GSBPM to provide a basis for statistical organizations to agree on standard terminology to aid their discussions on developing statistical metadata systems and processes. The GSBPM should therefore be seen as a flexible tool to describe and define the set of business processes needed to produce official statistics”. So, in order to define a general and comprehensive architecture for statistical production, it may be useful to identify and locate the different phases of a generic statistic production process on the different DWA’s functional levels.
Generic Statistic Business Production Model
DWA - Mapping the GSBPM on DWA The analysis of sub-processes locations on a SDW architecture is graphically represented in the next slides, with: SDW functional layers on the horizontal axis and the nine GSBPM phases on the vertical axis. Each element inside the graph is a sub-process, we will consider from the 4 td to the 7 td GSBPM phases. That is only an example of Model Processing. Each case must be validated and discussed on the different operational context this is just a basis for setting and starting the modelling work for the next two year of the ess-net. In the context, each sub-process must be regarded from either a: methodological, planning, technological, operational, point of view. Blank sub-processes are related to methodological, or planning, metadata definitions, meanwhile brown sub-processes are related to operational, or technological, function for data elaboration.
6 Analyze 6.4-apply disclosure control 6.2-validate outputs Source Layer Access Layer Interpretation and analysis Layer Integration Layer 6.3-scrutinize and explain 6.5-finalize outputs 6.1-prepare draft output 7 Disseminate 7.1-update output systems 7.2-produce dissemination 7.5-manage user support 7.4-promote dissemination 7.3-manage release of dissemination products Designer's view - Mapping the GSBPM on DWA Sub-Process of the GSBPM allocated on the functional layers of the DWA.
5 Process Source Layer Access Layer Interpretation and analysis Layer Integration Layer 5.1-integrate data 5.2-classify & code 5.4-impute 5.5-derive new variables and statistical units 5.6-calculate weights 5.7-calculate aggregate 5.8-finalize data files 5.3-review, validate & edit 4 Collect 4.2-set up collection 4.3-run collection 4.4-finalize collection 4.1-select sample Designer's view - Mapping the GSBPM on DWA Sub-Process of the GSBPM allocated on the functional layers of the DWA.
Graphic scheme of layered architecture with a focus on “statistical data”: Designer's view – Modeling the Data Domain
SDA – Modeling the Meta Data Domain Our purpose is to refer to an IT infrastructure of SDW, so we should consider only structured metadata articulated as: Structural Metadata (SM), they are used for description, identification and retrieval of statistical and quality information. Moreover they could link the various different components of the SDW; Process Metadata (PM), they are used to store the data usage and maintenance of process administration, as well as the proper information for automatic execution of work flows or management systems. Both of them can be Active, when they enables operational use, manual or automated, for one or more processes, or Passive in all other uses.
Graphic scheme of layered architecture with a focus on “meta data”: Designer's view - Modeling the Meta Data Domain
Conclusion We have contextualized the statistical production in a Data Warehousing Architecture. So, we have introduced a general Enterprise Architecture vision for a SDW production system. We have showed as the GSBPM representation can be used for modelling the business domain of the SDW layered architecture, for a complete operational view for the deploy of statistical production cases. Finally, we have showed the corresponding four level data-domain of the architecture for a Statistical Data Warehouse.