the Need for Data Integration

Slides:



Advertisements
Similar presentations
IATI Technical Advisory Group Technical Proposals Simon Parrish IATI Technical Advisory Group, DIPR March 2010.
Advertisements

Distributed Data Processing
Database Management3-1 L3 Database Management Santa R. Susarapu Ph.D. Student Virginia Commonwealth University.
Management Information Systems, Sixth Edition
Chapter 3: DECISION SUPPORT SYSTEMS: AN OVERVIEW
Chapter 3 Database Management
Distributed DBMSs A distributed database is a single logical database that is physically distributed to computers on a network. Homogeneous DDBMS has the.
Data and Knowledge Management
 MODERN DATABASE MANAGEMENT SYSTEMS OVERVIEW BY ENGINEER BILAL AHMAD
CT2 Strategy Foundation Who We Serve: Mid Cap to Large Companies who consume multi-mode shipping services. Our Purpose: To achieve the status as ‘market.
A Comparsion of Databases and Data Warehouses Name: Liliana Livorová Subject: Distributed Data Processing.
Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization.
© 2011 IBM Corporation Smarter Software for a Smarter Planet The Capabilities of IBM Software Borislav Borissov SWG Manager, IBM.
Chapter 11 Databases.
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
From Multi-Domain Statistical Data to Complex Decisions and Actions: A Linked Data Based Approach Marta Sabou, Irem Önder, Adrian M.P. Brasoveanu.
DECISION SUPPORT SYSTEM ARCHITECTURE: The data management component.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
Database Management System CCPS1533 Dr. Abdulsamad Ebrahim.
I Information Systems Technology Ross Malaga 4 "Part I Understanding Information Systems Technology" Copyright © 2005 Prentice Hall, Inc. 4-1 DATABASE.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
Database Concepts Track 3: Managing Information using Database.
Utilizing Databases to Manage Precision Ag Data Candice Johnson BAE 4213 Spring 2004.
Introduction to the Semantic Web and Linked Data
IT and Network Organization Ecommerce. IT and Network Organization OPTIMIZING INTERNAL COLLABORATIONS IN NETWORK ORGANIZATIONS.
© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.
Copyright © 2002 Pearson Education, Inc. Slide 3-1 Internet II A consortium of more than 180 universities, government agencies, and private businesses.
Pat Tyrrell Vale Atlantic Associates 5 June 2009AFCEA TechNet.
CSC 351 FUNDAMENTALS OF DATABASE SYSTEMS. LECTURE 1: INTRODUCTION TO DATABASES.
Electronic Commerce Semester 1 Term 1 Lecture 7. Introduction to the Web The Internet supports a variety of important tools, such as file transfer, electronic.
Section 20.1 Modes of Information Integration Anilkumar Panicker CS257: Database Systems ID: 118.
IT 5433 LM1. Learning Objectives Understand key terms in database Explain file processing systems List parts of a database environment Explain types of.
What is Big Data? Refers to datasets whose size is beyond the ability of typical databases software tools to capture, store, manage and analyze. Big data.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
E-Business Infrastructure PRESENTED BY IKA NOVITA DEWI, MCS.
James A. Senn’s Information Technology, 3rd Edition
Country Report: Innovation of Library Services at the National University of Laos through mobile Technologies. Chansy Phuangsouketh Director Central.
SOFTWARE DESIGN AND ARCHITECTURE
Integration of the UC Davis Biological Collections Data via a Web Portal [A Pilot Project] Project Goals To develop a Web Portal allowing better & more.
Presented at Archives Records 2016, session 510
Data Warehouse.
BUS 201: Introduction to Business
MANAGING DATA RESOURCES
Semantic Web: Commercial Opportunities and Prospects
Chapter 1 Database Systems
Information Technology for Management
Model-View-Controller Patterns and Frameworks
Data Warehousing and Data Mining
Open Data A public good for the public good
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Introduction to Data Warehousing
C.U.SHAH COLLEGE OF ENG. & TECH.
PREMIS Tools and Services
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Warehousing Data Model –Part 1
Unit# 5: Internet and Worldwide Web
Middleware, Services, etc.
Data Warehouse.
Chapter 1 Database Systems
The HIRMEOS Metrics Services
Chapter 3 Database Management
Data Mining: Concepts and Techniques
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management
Software Agent.
Presentation transcript:

the Need for Data Integration Data, Data everywhere: the Need for Data Integration Nicolas Spyratos Professor Emeritus University of Paris South France «Data, data everywhere» : The Economist, February 25, 2010

the relevant questions What is data integration? collecting and combining information from multiple sources into a single information source Why is it needed? to get more informative answers to important questions and/or analyse it to make decisions How is it done? following a well disciplined approach not necessarily using computers What are the technical problems when using computers to do it? many, difficult and costly What are the prerequisites for data integration? datasets should be open and preferably linked

Example: writing a summary report on rice-production, transportation and commercialization (using available information from Japan and Vietnam) datasets translation integration decision making japanese Japan thai thai minister of agriculture thai vietnamese Vietnam one important difficulty though: the data sources are autonomous, heterogeneous and geographically dispersed

sharing the integrated information minister of agriculture Japan minister of transport minister of commerce Vietnam this is the concept of data integration independently of whether we use computers or not

let’s summarize (before going to computer-assisted integration) Japan Vietnam translators need to know the language of the dataset and the language of the integrator integrator specialists decision makers datasets we can now replace all intermediate activities with software modules and either store the knowledge of the integrator in a database (called a data warehouse) or “simulate” it by a software (called a mediator)

using computers for data integration – the data warehouse approach (data in advance) production of “goods” processing/transport wholesaler distribution (to retailers) consumption metatada dataset-1 Translator-1 . Integrator database Data Mart Translator-n Data Mart dataset-n datasets databases, file systems twit sets, XML docs, etc. translators extract/transform integrator filters/loads data warhouse stores integrated data and answers queries specialists filter/answer decision makers and analysts Real world example: the Walmart data warehouse contains 2,5 petabytes of data

a few remarks about data warehouses a data warehouse is above all a database but of a specific nature as : its users are mainly analysts and decision makers (i.e. non computer specialists) it is accessed in read-only mode (usually through data marts) updates happen only at the source datasets and propagated to the data warehouse periodically they store mostly historical data (usually records), therefore the data volumes are orders of magnitude higher than in traditional databases (the Wallmart data warehouse stores 2.5 petabytes of data, i.e. 167 times the information contained in all the books in the US Library of Congress)

translators extract/transform using computers for data integration – the mediator approach (data on demand) dataset-1 Translator-1 .. . . software module Translator-n dataset-n datasets databases, file systems twit sets, XML docs, etc. translators extract/transform mediator query decomposition synthesis of answers decision makers and analysts Real world example: mediating a car dealers network

a few remarks on mediators a mediator is not a database but a software modulethat allows querying multiple sources its users are mainly analysts and decision makers (i.e. non computer specialists) it answers queries of its users users can not update through the mediator (as is the case with data warehouses) they do not store data, they just answer queries the translators are complex pieces of software and writing generic translators is hard

prerequisites for data integration (whether in data warehousing or mediating) a minimal requirement for data integration is that the datasets should be collections of data, published or curated by a single agent, and available for access or download in one or more formats Example of such a dataset: The Credit Institutions Register It is published by the European Banking Authority (EBA) and contains a list of credit institutions to which authorization has been granted to operate within the European Union and European Economic Area countries (EEA).  if the datasets to be integrated are also linked and open then integration can release social and commercial value (ex: through data mining in integrated datasets)

linked data linked data is about publishing and connecting structured data on the Web, using standard Web technologies (such as HTTP, RDF and URIs) to make the connections readable by computers, enabling data from different sources to be connected and queried allowing for better interpretation and analysis an open dataset is a collection of data that can be freely used, modified, and shared by anyone for any purpose most datasets of the web are linked (ex: DBPedia)

open data a dataset is called open if it can be freely used, modified, and shared by anyone for any purpose most datasets of the web are not open (or if they are then their quality is low) the following site contains a list of open datasets most of which have been closed! https://blog.bigml.com/list-of-public-data-sources-fit-for-machine-learning/ however, within controlled user communities, openness is extremely useful (ex: collaborative working environments, big companies government agencies)

concluding remarks data integration is a basic tool in a large number of social and commercial activities (e.g. hotel or airplane bookings, e-learning, digital libraries, e-Government etc.) data warehouses and mediators constitute the common supporting technology for data integration data integration is especially important to governments, where large amounts of data reside in isolated information silos linking, integrating and opening government data can help drive the creation of innovative business and services that deliver social and commercial value

thank you for your attention