Presentation is loading. Please wait.

Presentation is loading. Please wait.

the Need for Data Integration

Similar presentations


Presentation on theme: "the Need for Data Integration"— Presentation transcript:

1 the Need for Data Integration
Data, Data everywhere: the Need for Data Integration Nicolas Spyratos Professor Emeritus University of Paris South France «Data, data everywhere» : The Economist, February 25, 2010

2 the relevant questions
What is data integration? collecting and combining information from multiple sources into a single information source Why is it needed? to get more informative answers to important questions and/or analyse it to make decisions How is it done? following a well disciplined approach not necessarily using computers What are the technical problems when using computers to do it? many, difficult and costly What are the prerequisites for data integration? datasets should be open and preferably linked

3 Example: writing a summary report on rice-production, transportation and commercialization
(using available information from Japan and Vietnam) datasets translation integration decision making japanese Japan thai thai minister of agriculture thai vietnamese Vietnam one important difficulty though: the data sources are autonomous, heterogeneous and geographically dispersed

4 sharing the integrated information
minister of agriculture Japan minister of transport minister of commerce Vietnam this is the concept of data integration independently of whether we use computers or not

5 let’s summarize (before going to computer-assisted integration)
Japan Vietnam translators need to know the language of the dataset and the language of the integrator integrator specialists decision makers datasets we can now replace all intermediate activities with software modules and either store the knowledge of the integrator in a database (called a data warehouse) or “simulate” it by a software (called a mediator)

6 using computers for data integration – the data warehouse approach (data in advance)
production of “goods” processing/transport wholesaler distribution (to retailers) consumption metatada dataset-1 Translator-1 . Integrator database Data Mart Translator-n Data Mart dataset-n datasets databases, file systems twit sets, XML docs, etc. translators extract/transform integrator filters/loads data warhouse stores integrated data and answers queries specialists filter/answer decision makers and analysts Real world example: the Walmart data warehouse contains 2,5 petabytes of data

7 a few remarks about data warehouses
a data warehouse is above all a database but of a specific nature as : its users are mainly analysts and decision makers (i.e. non computer specialists) it is accessed in read-only mode (usually through data marts) updates happen only at the source datasets and propagated to the data warehouse periodically they store mostly historical data (usually records), therefore the data volumes are orders of magnitude higher than in traditional databases (the Wallmart data warehouse stores 2.5 petabytes of data, i.e. 167 times the information contained in all the books in the US Library of Congress)

8 translators extract/transform
using computers for data integration – the mediator approach (data on demand) dataset-1 Translator-1 .. . . software module Translator-n dataset-n datasets databases, file systems twit sets, XML docs, etc. translators extract/transform mediator query decomposition synthesis of answers decision makers and analysts Real world example: mediating a car dealers network

9 a few remarks on mediators
a mediator is not a database but a software modulethat allows querying multiple sources its users are mainly analysts and decision makers (i.e. non computer specialists) it answers queries of its users users can not update through the mediator (as is the case with data warehouses) they do not store data, they just answer queries the translators are complex pieces of software and writing generic translators is hard

10 prerequisites for data integration
(whether in data warehousing or mediating) a minimal requirement for data integration is that the datasets should be collections of data, published or curated by a single agent, and available for access or download in one or more formats Example of such a dataset: The Credit Institutions Register It is published by the European Banking Authority (EBA) and contains a list of credit institutions to which authorization has been granted to operate within the European Union and European Economic Area countries (EEA).  if the datasets to be integrated are also linked and open then integration can release social and commercial value (ex: through data mining in integrated datasets)

11 linked data linked data is about publishing and connecting structured data on the Web, using standard Web technologies (such as HTTP, RDF and URIs) to make the connections readable by computers, enabling data from different sources to be connected and queried allowing for better interpretation and analysis an open dataset is a collection of data that can be freely used, modified, and shared by anyone for any purpose most datasets of the web are linked (ex: DBPedia)

12 open data a dataset is called open if it can be freely used, modified, and shared by anyone for any purpose most datasets of the web are not open (or if they are then their quality is low) the following site contains a list of open datasets most of which have been closed! however, within controlled user communities, openness is extremely useful (ex: collaborative working environments, big companies government agencies)

13 concluding remarks data integration is a basic tool in a large number of social and commercial activities (e.g. hotel or airplane bookings, e-learning, digital libraries, e-Government etc.) data warehouses and mediators constitute the common supporting technology for data integration data integration is especially important to governments, where large amounts of data reside in isolated information silos linking, integrating and opening government data can help drive the creation of innovative business and services that deliver social and commercial value

14 thank you for your attention


Download ppt "the Need for Data Integration"

Similar presentations


Ads by Google