Chapter 11 Information Integration

Chapter 11 Information Integration
Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ. Chapter 11

How to integrate information, which is usually scattered physically
This is an unavoidable question to all of us. Approaches (homogenous) Distributed DBMS (80’s) Federated databases, Multidatabases, remote data access (90’s) Data warehouse, mediator (late 90’s) Chapter 11

Why Information Integration is Difficult (1)
Heterogeneous sources Examples (Aardvark Automobile Co.) 1000 dealers Each dealer maintains a database of their cars in stock Aardvark wants to create an integrated database 1000 dealers do not all use the same database schema Dealer 1: Cars(serialNo, model, color, autoTrans, cdPlayer, …) Dealer 2 Autos(serial, model, color) Options(serial, option) Chapter 11

Why Information Integration is Difficult (2)
Furthermore … Data type differences: Serial numbers might be represented by character strings or integers Value differences: The color black might be represented by an integer code, the string BLACK, or the code BL Semantic differences: One dealer distinguish station wagon from minivans, while another doesn’t Missing values: A source does not record information that all or most of the other sources provide Chapter 11

Modes of Information Integration
Federated databases The sources are independent, but one source can call on others to supply information Warehousing Copies of data from several sources are stored in a single database, called a (data) warehouse Mediation A mediator is a software component that supports a virtual database, which the user may query as if it were materialized The mediator stores no data of its own Chapter 11

Federated Database Systems
A federated database system is a federation of existing databases systems (called local database systems, LDBS) and provides applications with a uniform means of access to data that are managed by more than one of these database systems In theory, local databases should preserve local autonomy Chapter 11

Local Autonomy (1) Design autonomy Execution autonomy
Ability of an LDBS to choose its own design decisions wrt any matter, including data model, query language, constraints, system functions, semantic interpretation of data, … Execution autonomy Ability of an LDBS to execute local operations without interference from external operations and to decide the order in which to schedule external operations Chapter 11

Local Autonomy (2) Communication autonomy Association autonomy
Ability of an LDBS to decide whether and when to communicate with other database systems Association autonomy Ability of an LDBS to decide whether and how much to share its functionality and resources with others. For example, an LDBS may export only part of its database to external users or even disassociate itself from an LDBS for some reasons. Chapter 11

Federated Database Example
A federated collection of four local databases DB1 DB2 DB3 DB4 Chapter 11

Federated Database If n databases each need to talk to the n – 1 other databases, then we should write n(n – 1) pieces of code to support queries between systems This approach is easy to implement in some circumstances !!! Chapter 11

Query Translation Example
Dealer 1: NeededCars(model, color, autoTrans) Dealer 2: Autos(serial, model, color), Options(serial, option) /* Dealer 1 queries Dealer 2 for needed car For (each tuple (:m, :c, :a) in NeededCars) { if ( :a = TRUE) { /* automatic transmission wanted */ SELECT serial FROM Autos, Options WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’ AND Autos.model = :m AND Autos.color = :c; } else { /* automatic transmission not wanted */ SELECT serial FROM Autos WHERE Autos.model = :m AND Autos.color = :c AND NOT EXISTS ( SELECT * FROM Options WHERE serial = Autos.serial AND option = ‘autoTrans’ ); } } Chapter 11

Mediators A mediator supports a virtual view or collection of view
Don’t store any data of its own query result Mediator query query result result Wrapper Wrapper query result query result Source 1 Source 2 Chapter 11

Mediator Example (1) A view that is a single relation
AutosMed(serialNo, model, color, autoTrans, dealer) A query to the mediator SELECT serialNo, model FROM AutosMed WHERE color = ‘red’ The mediator can forward the same query to each of the two wrappers The translation work can be done by the wrappers alone Chapter 11

Mediator Example (2) A suitable translation for Dealer 1
Cars(serialNo, model, color, autoTrans, cdPlayer, …) SELECT serialNo, model FROM Cars WHERE color = ‘red’; A suitable translation for Dealer 2 Autos(serial, model, color), Options(serial, option) SELECT serial, model FROM Autos Each wrapper returns to the mediator a serialNo-model pairs and serial-model pairs, respectively The mediator can take the union of these sets and return the result to the user Chapter 11

Wrappers in Mediator-Based Systems
Sources could be DBMSs (in various models), file systems, Web servers, … Handles all connection/query-translation problems peculiar to sources Mediator systems require more complex wrappers than do most warehouse systems Techniques Wrapper generator Template-based Filter techniques Etc. Chapter 11

Templates for Query Patterns
Templates are queries with parameters that represent constants Example SELECT * SELECT serialNo, model, color FROM AutosMed => autoTrans, ‘dealder1’ WHERE color = ‘$c’ FROM Cars WHERE color = ‘$c’; In general there would be 2n templates if we have the option of specifying n attributes The number of templates could grow unreasonably large Chapter 11

Wrapper Generators Wrapper generator
The software that creates the wrapper A table that holds the various query patterns contained in the templates Wrapper generator Templates Driver Source Table Queries from mediator Results Queries Chapter 11

Filters It is not always realistic to write a template for every possible from of query Another approach to supporting more queries is to have the wrapper filter the results of queries Chapter 11

Filters Example Suppose the only template we have is the one that finds cars given a color The mediator needs to find blue ‘Gobi’ model cars SELECT * FROM autosMed WHERE color = ‘blue’ and model = ‘Gobi’ A possible way to answer the query Use the template (with $c = ‘blue’) Store the result in a temporary relation Select from TempAutos the Gobi’s Chapter 11

Data Warehousing Growing industry since mid 90’s
Ranges from desktop to huge Lots of buzzwords, hype Slice & dice, rollup, MOLAP, pivot, … Chapter 11

Information as a Competitive Weapon
Organizations have collected large amounts of data. Now it is time to use it to their advantage. Chapter 11

Can You Easily Answer These Questions?
What are Personnel Services costs across all departments for all funding sources? What is the correlation between expenditures and collection of delinquent taxes? What are the effects of outsourcing specific services? What is the impact on revenues and expenditures of changing the operating hours of the Dept. of Motor Vehicles? What is the economic impact of the small business initiative in our district?

What is a Warehouse (1) Collection of diverse data AND …
Subject oriented Aimed at executive, decision maker Often a copy of operational data With value-added data (e.g., summaries, history) Integrated Time-varying Non-volatile AND … Chapter 11

What is a Warehouse (2) Collection of tools Gathering data
Cleansing, integrating Querying, reporting, analysis Data mining Monitoring, administering warehouse Chapter 11

Warehouse Architecture
Client Client Query & Analysis Warehouse Metadata Integration Source Source Source Chapter 11

Motivating Examples Forecasting Comparing performance of units
Monitoring, detecting fraud Visualization Chapter 11

? Why a Warehouse Two approaches: Query-driven (lazy)
Source ? Two approaches: Query-driven (lazy) Warehouse (eager) Chapter 11

Query-driven approach
Client Wrapper Mediator Source Chapter 11

Advantages of Query-driven
No need to copy data Less storage No need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources Chapter 11

Advantages of Warehousing
High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse Modify, summarizes (store aggregates) Add historical information Chapter 11

OLTP vs. OLAP OLTP (On-Line Transaction Processing)
Describes processing at operational sites OLAP (On-Line Analytical Processing) Describes processing at warehouse Chapter 11

OLTP vs. OLAP OLTP Warehouse Mostly updates Many small transactions
Mb-Tb of data Current snapshot Raw data Clerical users Consistency, recoverability critical Warehouse Mostly reads Queries are long and complex Gb-Tb of data History Summarized, consolidated data Decision-makers, analysts as users Chapter 11

OLAP Example The schema for the warehouse
Sales(serialNo, date, dealer, price) Autos(serialNo, model, color) Dealers(name, city,state,phone) A typical decision-support query SELECT state, AVG(price) FROM Sales, Dealers WHERE Sales.dealer = Dealers.name AND date >= ‘ ’ GROUP BY state; Common OLTP query “Find the price at which the auto with serial number 123 was sold” Chapter 11

Warehouse Models and Operations
Data models Relations Stars and snowflakes Cubes Operations Slice and dice Roll-up, drill-down Pivoting other Chapter 11

Star Schemas Star schema = fact table + dimension tables
Dependent attributes Chapter 11

Example-1 (1) date dealer car Sales(serialNo, date, dealer, price) Autos(serialNo, model, color) Dealers(name, city, state, phone) Sales is a fact table serialNo, date, dealer are dimensions The one dependent attribute is price, which is what OLAP queries will typically request in an aggregation Autos relation and Dealer relation are dimension tables Attribute serialNo in the fact table is a foreign key, referencing serialNo of dimension table Autos Join between fact table and dimension tables, is frequently done Chapter 11

Example-1 (2) A time dimension table
Days (day, week, month, year) Since grouping by various time units is frequently desired by analysts It helps to build into the database a notion of time, as if there were a time dimension table such as above Chapter 11

Example-2 (1) Chapter 11

Example-2 (2) Chapter 11

Slicing and Dicing Dicing Slicing
For example, in the time dimension, we might partition (“group by” clause) according to days, weeks, months, years, or not partition at all Partitioning is also possible for cars and dealers Slicing Through the “where” clause, a query focuses on partitions along one or more dimensions dealer date car Chapter 11

Example 1 A query in which we ask for a slice in one dimension (the date), and dice in two other dimensions (car and dealer) The date is divided into four groups, … car date dealer Chapter 11

More Examples SELECT color, SUM(price) FROM Sales NATURAL JOIN Autos WHERE model = ‘Gobi’ GROUP BY color; This query dices by color and then slices by model SELECT dealer, month, SUM(price) FROM (Sales NATURAL JOIN Autos) JOIN Days on date = day WHERE model = ‘Gobi’ and color = ‘red’ GROUP BY color; Chapter 11

How to support cube-structured data for OLAP
ROLAP, or Relational OLAP Data may be stored in relations with a specialize structure called a “star schema” MOLAP, or Multidimensional OLAP A specialized structure, the “data cube”, is used to hold the data Chapter 11

Data cubes An alternative to executing decision-support queries as an ad-hoc queries is to pre-compute all possible aggregates in a systematic way The amount of extra storage needed is often tolerable We shall continue to call the points of the data cube the “fact table” Chapter 11

Cube Example Fact table view: Multi-dimensional cube: dimensions = 2
Chapter 11

3-D Cube Example Fact table view: Multi-dimensional cube:
day 2 day 1 dimensions = 3 Chapter 11

The Cube Operator Given a fact table F, we can define an augmented table CUBE(F)that adds an additional value, denoted *, to each dimension The * represents aggregation along the dimension in which it appears A tuple of the table CUBE(F)has * in one or more dimensions Chapter 11

The Cube Operator Example
Sales(model, color, date, dealer, val, cnt) “val” denotes the total price, “cnt” denotes the total # of automobiles Possible tuples (‘Gobi’, ‘red’, ‘ ’, ‘Friendly Fred’, 45000, 2) (‘Gobi’, *, ‘ ’, ‘Friendly Fred’, , 7) (‘Gobi’, *, ‘ ’, *, , 100) (‘Gobi’, *, *, *, , 58000) (*, *, *, *, , ) Chapter 11

Another Example Consider SELECT color, AVG(price) FROM Sales WHERE model = ‘Gobi’ GROUP BY color; Above query is answered by looking for all tuples of CUBE(Sales) with the form (‘Gobi’, c, *, *, v, n) C is any specific color The tuple asked for by the query is (c, v/n) Answer is the set of (c,v/n) pairs from all (‘Gobi’, c, *, *, v, n) tuples Chapter 11

Aggregates Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date Chapter 11

Rollup vs. Drill-down Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId rollup drill-down Chapter 11

Aggregates Operators: sum, count, max, min, median, ave
“Having” clause Using dimension hierarchy average by region (within store) maximum by month (within date) Chapter 11

Cube Aggregation Example: computing sums . . . 129 rollup drill-down
day 2 . . . day 1 129 drill-down rollup Chapter 11

Cube Operators . . . sale(c1,*,*) 129 sale(c2,p2,*) sale(*,*,*) day 2
Chapter 11

Extended Cube * day 2 sale(*,p2,*) day 1 Chapter 11

The lattice of Views It helps to think of a lattice of possible groupings for each dimension of the cube A path from some node P2 down to P1 means that P1 <= P2 All Years Quarters Months Days Weeks State City Dealer All Chapter 11

Aggregation Using Hierarchies
day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B) Chapter 11

Data Mining Knowledge discovery
To find surprising facts from existing databases Techniques from DBMS, machine learning, and statistics, … Chapter 11

Decision Tree The interior nodes each have an attribute and a value that serves as a threshold The children of a node are either other interior nodes, or a decision: accept or reject A given tuple is passed down the tree, going left or right at each step according to the value the tuple has, until a decision node is reached The tree is constructed by a training set of tuples whose outcome is known Chapter 11

Example (weather vs. tennis play)
Outlook no humidity windy yes sunny high normal false true overcast rainy Chapter 11

Clustering To group data items into some small number of groups such that the groups each have something substantial in common Example Clustering of Web pages in Web search engines Chapter 11

Association-Rule Mining Example
Market-basket data A customer approaches the checkout with a “market basket” full of the items he or she has selected The cash register records all of these items as part of a single transaction Claim: People who buy diapers are unusually likely also to buy beer Schema: Baskets(basket, item) Chapter 11

Data-Ming Applications: Association-Rule Mining
Naive way to find all high-support pairs of items SELCT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; Chapter 11

The A-Priori Algorithm
Basic observation If a set of items X has support s, then each subset of X must also have support at least s. If a pair of items, say {i, j} appears in, say, 1000 baskets, then we know there are at least 1000 baskets with item i and there are at least 1000 baskets with item j. Strategies First finding the set of “OK” items -- those that appear in a sufficient number of baskets by themselves Running the query on only the items in the OK set Chapter 11

The A-Priori Algorithm
INSERT INTO OkBasekts SELECT * FROM Baskets WHERE item IN ( SELECT item GROUP BY item HAVING COUNT(*) > = s ); SELECT I.item, J.item, COUNT(I.basket) FROM OkBaskets I, OkBaskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(*) >= s; Chapter 11

How Good the A-Priori Algorithm is
Assumptions (Example 11.20) 10,000 different items, average market basket has 20 items in it 1,000,000 baskets, the Baskets relation has 20,000,000 tuples The naive algorithms The join has 190,000,000 pairs The 190,000,000 tuples must all be grouped and counted The A-Priori algorithm Suppose that s is 10,000, i.e., 1% of the baskets Not possible that more than 2000 (= 20,000,000 / 10,000) items appear in at least 10,000 baskets The sub query produces many fewer than 2000 items Assume, OkBaskets has on the average 10 items The join is less than ¼ of that of Baskets, which means ¼ reduction of running time Chapter 11

Chapter 11 Information Integration

Similar presentations

Presentation on theme: "Chapter 11 Information Integration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 11 Information Integration

Similar presentations

Presentation on theme: "Chapter 11 Information Integration"— Presentation transcript:

Similar presentations

About project

Feedback