Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 111 Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ.

Similar presentations


Presentation on theme: "Chapter 111 Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ."— Presentation transcript:

1 Chapter 111 Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ. shlee@computing.soongsil.ac.kr

2 Chapter 112 How to integrate information, which is usually scattered physically This is an unavoidable question to all of us. Approaches –(homogenous) Distributed DBMS (80’s) –Federated databases, Multidatabases, remote data access (90’s) –Data warehouse, mediator (late 90’s)

3 Chapter 113 Why Information Integration is Difficult (1) Heterogeneous sources Examples (Aardvark Automobile Co.) –1000 dealers –Each dealer maintains a database of their cars in stock –Aardvark wants to create an integrated database –1000 dealers do not all use the same database schema –Dealer 1: Cars(serialNo, model, color, autoTrans, cdPlayer, …) –Dealer 2 Autos(serial, model, color) Options(serial, option)

4 Chapter 114 Why Information Integration is Difficult (2) Furthermore … –Data type differences: Serial numbers might be represented by character strings or integers –Value differences: The color black might be represented by an integer code, the string BLACK, or the code BL –Semantic differences: One dealer distinguish station wagon from minivans, while another doesn’t –Missing values: A source does not record information that all or most of the other sources provide

5 Chapter 115 Modes of Information Integration Federated databases –The sources are independent, but one source can call on others to supply information Warehousing –Copies of data from several sources are stored in a single database, called a (data) warehouse Mediation –A mediator is a software component that supports a virtual database, which the user may query as if it were materialized –The mediator stores no data of its own

6 Chapter 116 Federated Database Systems A federated database system is a federation of existing databases systems (called local database systems, LDBS) and provides applications with a uniform means of access to data that are managed by more than one of these database systems In theory, local databases should preserve local autonomy

7 Chapter 117 Local Autonomy (1) Design autonomy –Ability of an LDBS to choose its own design decisions wrt any matter, including data model, query language, constraints, system functions, semantic interpretation of data, … Execution autonomy –Ability of an LDBS to execute local operations without interference from external operations and to decide the order in which to schedule external operations

8 Chapter 118 Local Autonomy (2) Communication autonomy –Ability of an LDBS to decide whether and when to communicate with other database systems Association autonomy –Ability of an LDBS to decide whether and how much to share its functionality and resources with others. For example, an LDBS may export only part of its database to external users or even disassociate itself from an LDBS for some reasons.

9 Chapter 119 Federated Database Example A federated collection of four local databases DB1DB2

10 Chapter 1110 Federated Database If n databases each need to talk to the n – 1 other databases, then we should write n(n – 1) pieces of code to support queries between systems This approach is easy to implement in some circumstances !!!

11 Chapter 1111 Query Translation Example Dealer 1: NeededCars(model, color, autoTrans) Dealer 2: Autos(serial, model, color), Options(serial, option) /* Dealer 1 queries Dealer 2 for needed car For (each tuple (:m, :c, :a) in NeededCars) { if ( :a = TRUE) { /* automatic transmission wanted */ SELECT serial FROM Autos, Options WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’ AND Autos.model = :m AND Autos.color = :c; } else { /* automatic transmission not wanted */ SELECT serial FROM Autos WHERE Autos.model = :m AND Autos.color = :c AND NOT EXISTS ( SELECT * FROM Options WHERE serial = Autos.serial AND option = ‘autoTrans’ ); } }

12 Chapter 1112 Mediators A mediator supports a virtual view or collection of view Don’t store any data of its own Source 1Source 2 Wrapper Mediator query result query result query result

13 Chapter 1113 Mediator Example (1) –A view that is a single relation AutosMed(serialNo, model, color, autoTrans, dealer) –A query to the mediator SELECT serialNo, model FROM AutoMed WHERE color = ‘red’ –The mediator can forward the same query to each of the two wrappers –The translation work can be done by the wrappers alone

14 Chapter 1114 Mediator Example (2) –A suitable translation for Dealer 1 Cars(serialNo, model, color, autoTrans, cdPlayer, …) SELECT serialNo, model FROM Cars WHERE color = ‘red’; –A suitable translation for Dealer 2 Autos(serial, model, color), Options(serial, option) SELECT serial, model FROM Autos WHERE color = ‘red’; –Each wrapper returns to the mediator a serialNo-model pairs and serial-model pairs, respectively –The mediator can take the union of these sets and return the result to the user

15 Chapter 1115 Wrappers in Mediator-Based Systems Sources could be DBMSs (in various models), file systems, Web servers, … Handles all connection/query-translation problems peculiar to sources Mediator systems require more complex wrappers than do most warehouse systems Techniques –Wrapper generator –Template-based –Filter techniques –Etc.

16 Chapter 1116 Templates for Query Patterns Templates are queries with parameters that represent constants –Example SELECT *SELECT serialNo, model, color FROM AutosMed =>autoTrans, ‘dealder1’ WHERE color = ‘$c’ FROM Cars WHERE color = ‘$c’; In general there would be 2 n templates if we have the option of specifying n attributes The number of templates could grow unreasonably large

17 Chapter 1117 Wrapper Generators Wrapper generator –The software that creates the wrapper –A table that holds the various query patterns contained in the templates Wrapper generator Templates Driver Source Table Queries from mediator Results Queries

18 Chapter 1118 Filters It is not always realistic to write a template for every possible from of query Another approach to supporting more queries is to have the wrapper filter the results of queries

19 Chapter 1119 Filters Example –Suppose the only template we have is the one that finds cars given a color –The mediator needs to find blue ‘Gobi’ model cars SELECT * FROM autosMed WHERE color = ‘blue’ and model = ‘Gobi’ –A possible way to answer the query Use the template (with $c = ‘blue’) Store the result in a temporary relation Select from TempAutos the Gobi’s

20 Chapter 1120 Data Warehousing Growing industry since mid 90’s Ranges from desktop to huge Lots of buzzwords, hype –Slice & dice, rollup, MOLAP, pivot, …

21 Chapter 1121 Information as a Competitive Weapon Organizations have collected large amounts of data. Now it is time to use it to their advantage.

22 Can You Easily Answer These Questions? What are Personnel Services costs across all departments for all funding sources? What are the effects of outsourcing specific services? What is the correlation between expenditures and collection of delinquent taxes? What is the impact on revenues and expenditures of changing the operating hours of the Dept. of Motor Vehicles? What is the economic impact of the small business initiative in our district?

23 Chapter 1123 What is a Warehouse (1) Collection of diverse data –Subject oriented –Aimed at executive, decision maker –Often a copy of operational data –With value-added data (e.g., summaries, history) –Integrated –Time-varying –Non-volatile AND …

24 Chapter 1124 What is a Warehouse (2) Collection of tools –Gathering data –Cleansing, integrating –Querying, reporting, analysis –Data mining –Monitoring, administering warehouse

25 Chapter 1125 Warehouse Architecture Client Warehouse Source Query & Analysis Integration Metadata

26 Chapter 1126 Motivating Examples Forecasting Comparing performance of units Monitoring, detecting fraud Visualization

27 Chapter 1127 Why a Warehouse Two approaches: –Query-driven (lazy) –Warehouse (eager) Source ?

28 Chapter 1128 Query-driven approach Client Wrapper Mediator Source

29 Chapter 1129 Advantages of Query-driven No need to copy data –Less storage –No need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources

30 Chapter 1130 Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse –Modify, summarizes (store aggregates) –Add historical information

31 Chapter 1131 OLTP vs. OLAP OLTP (On-Line Transaction Processing) –Describes processing at operational sites OLAP (On-Line Analytical Processing) –Describes processing at warehouse

32 Chapter 1132 OLTP vs. OLAP OLTP –Mostly updates –Many small transactions –Mb-Tb of data –Current snapshot –Raw data –Clerical users –Consistency, recoverability critical Warehouse –Mostly reads –Queries are long and complex –Gb-Tb of data –History –Summarized, consolidated data –Decision-makers, analysts as users

33 Chapter 1133 OLAP Example The schema for the warehouse –Sales(serialNo, date, dealer, price) –Autos(serialNo, model, color) –Dealers(name, city,state,phone) A typical decision-support query –SELECT state, AVG(price) FROM Sales, Dealers WHERE Sales.dealer = Dealers.name AND date >= ‘1999- 01-04’ GROUP BY state; Common OLTP query –“Find the price at which the auto with serial number 123 was sold”

34 Chapter 1134 Warehouse Models and Operations Data models –Relations –Stars and snowflakes –Cubes Operations –Slice and dice –Roll-up, drill-down –Pivoting –other

35 Chapter 1135 Star Schemas Star schema = fact table + dimension tables Fact table Dimension table Dependent attributes

36 Chapter 1136 Example-1 (1) Sales(serialNo, date, dealer, price) Autos(serialNo, model, color) Dealers(name, city, state, phone) Sales is a fact table –serialNo, date, dealer are dimensions –The one dependent attribute is price, which is what OLAP queries will typically request in an aggregation Autos relation and Dealer relation are dimension tables –Attribute serialNo in the fact table is a foreign key, referencing serialNo of dimension table Autos Join between fact table and dimension tables, is frequently done date dealer car

37 Chapter 1137 Example-1 (2) A time dimension table Days (day, week, month, year) –Since grouping by various time units is frequently desired by analysts –It helps to build into the database a notion of time, as if there were a time dimension table such as above

38 Chapter 1138 Example-2 (1)

39 Chapter 1139 Example-2 (2)

40 Chapter 1140 Slicing and Dicing Dicing –For example, in the time dimension, we might partition (“group by” clause) according to days, weeks, months, years, or not partition at all –Partitioning is also possible for cars and dealers Slicing –Through the “where” clause, a query focuses on partitions along one or more dimensions dealer date car

41 Chapter 1141 Example 1 A query in which we ask for a slice in one dimension (the date), and dice in two other dimensions (car and dealer) The date is divided into four groups, … car date dealer

42 Chapter 1142 More Examples SELECT color, SUM(price) FROM Sales NATURAL JOIN Autos WHERE model = ‘Gobi’ GROUP BY color; –This query dices by color and then slices by model SELECT dealer, month, SUM(price) FROM (Sales NATURAL JOIN Autos) JOIN Days on date = day WHERE model = ‘Gobi’ and color = ‘red’ GROUP BY color;

43 Chapter 1143 How to support cube-structured data for OLAP ROLAP, or Relational OLAP –Data may be stored in relations with a specialize structure called a “star schema” MOLAP, or Multidimensional OLAP –A specialized structure, the “data cube”, is used to hold the data


Download ppt "Chapter 111 Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ."

Similar presentations


Ads by Google