ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 10: DATA WAREHOUSING & CACHING PRINCIPLES OF DATA INTEGRATION.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Examples of Physical Query Plan Alternatives
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Dimensional Modeling.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Chapter 10: Designing Databases
Supervisor : Prof . Abbdolahzadeh
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Management Information Systems, Sixth Edition
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
Chapter 3 Database Management
DATA WAREHOUSING.
Data Warehouse Components
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
ETL By Dr. Gabriel.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
©2008 TTW Where “Lean” principles are considered common sense and are implemented with a passion! Product Training Cash and Cash Management.
SSIS Over DTS Sagayaraj Putti (139460). 5 September What is DTS?  Data Transformation Services (DTS)  DTS is a set of objects and utilities that.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
Understanding Data Warehousing
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
5 Copyright © 2009, Oracle. All rights reserved. Right-Time Data Warehousing with OWB.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
OnLine Analytical Processing (OLAP)
Using SAS® Information Map Studio
I Information Systems Technology Ross Malaga 4 "Part I Understanding Information Systems Technology" Copyright © 2005 Prentice Hall, Inc. 4-1 DATABASE.
BW Know-How Call : Performance Tuning dial-in phone numbers! U.S. Toll-free: (877) International: (612) Passcode: “BW”
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
Oracle Data Integrator Transformations: Adding More Complexity
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Lecture2: Database Environment Prepared by L. Nouf Almujally 1 Ref. Chapter2 Lecture2.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
XML and Database.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Advanced Database Concepts
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
1 Introduction to Database Systems, CS420 SQL Views and Indexes.
DO YOU TRUST YOUR DATA? KNOW THE ANSWER WITH EIM! Jose Hernandez Director, Business Intelligence Dunn Solutions Group.
Building the Corporate Data Warehouse Pindaro Demertzoglou Lally School of Management Data Resource Management.
Managing Data Resources File Organization and databases for business information systems.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Supervisor : Prof . Abbdolahzadeh
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Databases (CS507) CHAPTER 2.
Data Warehouse Components
Using Partitions and Fragments
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Database Performance Tuning and Query Optimization
Relational Algebra Chapter 4, Part A
Ch 4. The Evolution of Analytic Scalability
Data Warehouse.
Chapter 11 Database Performance Tuning and Query Optimization
Data Warehousing Concepts
Query Optimization.
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 10: DATA WAREHOUSING & CACHING PRINCIPLES OF DATA INTEGRATION

Data Warehousing and Materialization  We have mostly focused on techniques for virtual data integration (see Ch. 1)  Queries are composed with mappings on the fly and data is fetched on demand  This represents one extreme point  In this chapter, we consider cases where data is transformed and materialized “in advance” of the queries  The main scenario: the data warehouse

What Is a Data Warehouse?  In many organizations, we want a central “store” of all of our entities, concepts, metadata, and historical information  For doing data validation, complex mining, analysis, prediction, …  This is the data warehouse  To this point we’ve focused on scenarios where the data “lives” in the sources – here we may have a “master” version (and archival version) in a central database  For performance reasons, availability reasons, archival reasons, …

In the Rest of this Chapter…  The data warehouse as the master data instance  Data warehouse architectures, design, loading  Data exchange: declarative data warehousing  Hybrid models: caching and partial materialization  Querying externally archived data

Outline  The data warehouse  Motivation: Master data management  Physical design  Extract/transform/load  Data exchange  Caching & partial materialization  Operating on external data

Master Data Management  One of the “modern” uses of the data warehouse is not only to support analytics but to serve as a reference to all of the entities in the organization  A cleaned, validated repository of what we know … which can be linked to by data sources … which may help with data cleaning … and which may be the basis of data governance (processes by which data is created and modified in a systematic way, e.g., to comply with gov’t regulations)  There is an emerging field called master data management out the process of creating these

Data Warehouse Architecture  At the top – a centralized database  Generally configured for queries and appends – not transactions  Many indices, materialized views, etc.  Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools Data Warehouse ETL RDBMS 1 2 HTML 1 XML 1 ETL pipeline outputs ETL

ETL Tools  ETL tools are the equivalent of schema mappings in virtual integration, but are more powerful  Arbitrary pieces of code to take data from a source, convert it into data for the warehouse:  import filters – read and convert from data sources  data transformations – join, aggregate, filter, convert data  de-duplication – finds multiple records referring to the same entity, merges them  profiling – builds tables, histograms, etc. to summarize data  quality management – test against master values, known business rules, constraints, etc.

Example ETL Tool Chain  This is an example for e-commerce loading  Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load Invoice line items Split Date- time Filter invalid Join Filter invalid Invalid dates/times Invalid items Item records Filter non- match Invalid customers Group by customer Customer balance Customer records

Basic Data Warehouse – Summary  Two aspects:  A central DBMS optimized for appends and querying  The “master data” instance  Or the instance for doing mining, analytics, and prediction  A set of procedural ETL “pipelines” to fetch, transform, filter, clean, and load data  Often these tools are more expressive than standard conjunctive queries (as in Chapters 2-3)  … But not always! This raises a question – can we do warehousing with declarative mappings?

Outline The data warehouse  Data exchange  Caching & partial materialization  Operating on external data

Data Exchange  Intuitively, a declarative setup for data warehousing  Declarative schema mappings as in Ch. 2-3  Materialized database as in the previous section  Also allow for unknown values when we map from source to target (warehouse) instance  If we know a professor teaches a student, then there must exist a course C that the student took and the professor taught – but we may not know which…

Data Exchange Formulation A data exchange setting (S,T,M,C T ) has:  S, source schema representing all of the source tables jointly  T, target schema  A set of mappings or tuple-generating dependencies relating S and T  A set of constraints (equality-generating dependencies)

An Example Source S has Teaches(prof, student) Adviser(adviser, student) Target T has Advise(adviser, student) TeachesCourse(prof, course) Takes(course, student) existential variables represent unknowns

The Data Exchange Solution  The goal of data exchange is to compute an instance of the target schema, given a data exchange setting D = (S,T,M,C T ) and an instance I(S)  An instance J of Schema T is a data exchange solution for D and I if 1.the pair (I,J) satisfies schema mapping M, and 2.J satisfies constraints C T

Instance I(S) has Teaches Adviser Back to the Example, Now with Data profstudent AnnBob ChloeDavid Instance J(T) has Advise TeachesCourse Takes adviserstudent EllenBob FeliciaDavid adviserstudent EllenBob FeliciaDavid coursestudent C1C1 Bob C2C2 David profcourse AnnC1C1 ChloeC2C2 variables or labeled nulls represent unknown values

Instance I(S) has Teaches Adviser This Is also a Solution profstudent AnnBob ChloeDavid Instance J(T) has Advise TeachesCourse Takes adviserstudent EllenBob FeliciaDavid adviserstudent EllenBob FeliciaDavid coursestudent C1C1 Bob C1C1 David profcourse AnnC1C1 ChloeC1C1 this time the labeled nulls are all the same!

Universal Solutions  Intuitively, the first solution should be better than the second  The first solution uses the same variable for the course taught by Ann and by Chloe – they are the same course  But this was not specified in the original schema!  We formalize that through the notion of the universal solution, which must not lose any information

Formalizing the Universal Solution First we define instance homomorphism:  Let J 1, J 2 be two instances of schema T  A mapping h: J 1  J 2 is a homomorphism from J 1 to J 2 if  h(c) = c for every c ∈ C,  for every tuple R(a 1,…,a n ) ∈ J 1 the tuple R(h(a 1 ),…,h(a n )) ∈ J 2  J 1, J 2 are homomorphically equivalent if there are homomorphisms h: J 1  J 2 and h’: J 2  J 1 Def: Universal solution for data exchange setting D = (S,T,M,C T ), where I is an instance of S. A data exchange solution J for D and I is a universal solution if, for every other data exchange solution J’ for D and I, there exists a homomorphism h: J  J’

Computing Universal Solutions  The standard process is to use a procedure called the chase  Informally:  Consider every formula r of M in turn:  If there is a variable substitution for the left-hand side (lhs) of r where the right-hand side (rhs) is not in the solution – add it  If we create a new tuple, for every existential variable in the rhs, substitute a new fresh variable  See Chapter 10 Algorithm 10 for full pseudocode

Core Universal Solutions  Universal solutions may be of arbitrary size  The core universal solution is the minimal universal solution

Data Exchange and Querying  As with the data warehouse, all queries are directly posed over the target database – no reformulation necessary  However, we typically assume certain answers semantics  To get the certain answers (which are the same as in the virtual integration setting with GLAV/TGD mappings) – compute the query answers and then drop any tuples with labeled nulls (variables)

Data Exchange vs. Warehousing  From an external perspective, exchange and warehousing are essentially equivalent  But there are different trade-offs in procedural vs. declarative mappings  Procedural – more expressive  Declarative – easier to reason about, compose, invert, create matieralized views for, etc. (see Chapter 6)

Outline The data warehouse Data exchange  Caching & partial materialization  Operating on external data

The Spectrum of Materialization  Many real EII systems compute and maintain materialized views, or cache results A “hybrid” point between the fully virtual and fully materialized approaches Virtual integration (EII) Data exchange / data warehouse sources materializedall mediated relations materialized caching or partial materialization – some views materialized

Possible Techniques for Choosing What to Materialize Cache results of prior queries  Take the results of each query, materialize them  Use answering queries using views to reuse  Expire using time-to-live… May not always be fresh! Administrator-selected views  Someone manually specifies views to compute and maintain, as with a relational DBMS  System automatically maintains Automatic view selection  Using query workload, update frequencies – a view materialization wizard chooses what to materialize

Outline The data warehouse Data exchange Caching & partial materialization  Operating on external data

Many “Integration-Like” Scenarios over Historical Data  Many Web scenarios where we have large logs of data accesses, created by the server  Goal: put these together and query them!  Looks like a very simple data integration scenario – external data, but single schema  A common approach: use programming environments like MapReduce (or SQL layers above) to query the data on a cluster  MapReduce reliably runs large jobs across 100s or 1000s of “shared nothing” nodes in a cluster

MapReduce Basics  MapReduce is essentially a template for writing distributed programs – corresponding to a single SQL SELECT..FROM..WHERE..GROUP BY..HAVING block with user-defined functions  The MapReduce runtime calls a set of functions:  map is given a tuple, outputs 0 or more tuples in response  roughly like the WHERE clause  shuffle is a stage for doing sort-based grouping on a key (specified by the map)  reduce is an aggregate function called over the set of tuples with the same grouping key

MapReduce Dataflow “Template”: Tuples  Map “worker”  Shuffle  Reduce “worker” 30 Map Worker Map Worker Map Worker Map Worker Reduce Worker Reduce Worker Reduce Worker Reduce Worker Reduce Worker emit tuples emit aggregate results

MapReduce as ETL  Some people use MapReduce to take data, transform it, and load it into a warehouse  … which is basically what ETL tools do!  The dividing line between DBMSs, EII, MapReduce is blurring as of the development of this book  SQL  MapReduce  MapReduce over SQL engines  Shared-nothing DBMSs  NoSQL

Warehousing & Materialization Wrap-up  There are benefits to centralizing & materializing data  Performance, especially for analytics / mining  Archival  Standardization / canonicalization  Data warehouses typically use procedural ETL tools to extract, transform, load (and clean) data  Data exchange replaces ETL with declarative mappings (where feasible)  Hybrid schemes exist for partial materialization  Increasingly we are integrating via MapReduce and its cousins