1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch. 21.1-2, Ch. 22.

1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch. 21.1-2, Ch. 22

2 Outline What is a data warehouse? Why a data warehouse? Data Warehouse Models & operations Query & Analysis Tools Implementing a data warehouse Future directions

333 What is a Data Warehouse

4 Collection of diverse data  subject oriented  aimed at executive, decision maker  often a copy of operational data  with value-added data (e.g., summaries, history)  integrated  time-varying  non-volatile What is a Data Warehouse more Collection of tools gathering data cleansing, integrating,... querying, reporting, analysis data mining monitoring, administering warehouse

5 Data Warehouse Architecture What is a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata

6 Motivation Examples: Forecasting Comparing performance of units Monitoring, detecting fraud Visualization What is a Data Warehouse

777 Why a Data Warehouse

8 Two Approaches to Extract Information in the Enterprise: Query-Driven (Lazy) Warehouse (Eager) Why a Data Warehouse

9 Query Driven Approach (Federation): Why a Data Warehouse Client Wrapper Mediator Source

10 Warehouse Advantages Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse  Modify, summarize (store aggregates)  Add historical information Why a Data Warehouse

11 Query-driven Advantages No need to copy data  less storage  no need to purchase data More up-to-date data – real-time Query needs can be unknown Only query interface needed at sources May be less draining on sources Why a Data Warehouse

12 OLTP vs. OLAP OLTP: On Line Transaction Processing  Describes processing at operational sites OLAP: On Line Analytical Processing  Describes processing at warehouse What is a Data Warehouse

13 OLTP vs. OLAP What is a Data Warehouse Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical Mostly reads Queries long, complex GB-TB of data Summarized, consolidated data Decision-makers, analysts as users OLTP OLAP

14 Data Marts Smaller warehouses Spans part of organization  e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus  but long term integration problems? What is a Data Warehouse

15 Data Warehouse Models & Operations

16 Data Models  relations  stars & snowflakes  cubes Operators  slice & dice  roll-up, drill down  pivoting  other Data Warehouse Models & Operations

17 Star Schema Data Warehouse Models & Operations

Star Schema 18 Data Warehouse Models & Operations Terms: Fact table: PK is the aggregate of the dimensions’ PK - 2 nd NF Dimension tables: “by” condition - one view for how the data will be analyzed – 3 rd NF Measures: aggregates – sales/day, sales/quarter, etc. Summary Tables: pre- summarized data to meet 90% of user queries. Use triggers/stored procedures to maintain summary tables current on insert/delete product prodId name price customer custId name address city store storeId city sale orderId date custId prodId storeId qty amt

19 Star Schema - Dimension Hierarchies Data Warehouse Models & Operations store sType cityregion

20 S nowflakes schema, a dimension table have the hierarchy broken into set of tables – more complicated and slower queries Data Warehouse Models & Operations time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city

21 Fact Constellations schema, multiple fact tables that share the set of dimensional tables; viewed as a collection of stars Data Warehouse Models & Operations time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table t item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city time_key

22 Data Warehouse Models & Operations product prodId name price customer custId name address city store storeId city sale orderId date custId prodId storeId qty amt Fact table view: Multi-dimensional cube: dimensions = 2 Star Schema

23 3D Cube Data Warehouse Models & Operations day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:

24 ROLAP vs. MOLAP  ROLAP: Relational On-Line Analytical Processing  MOLAP: Multi-Dimensional On-Line Analytical Processing Aggregates  Add up amounts for day 1  In SQL: SELECT sum(amt) FROM SALE WHERE date = 1; Data Warehouse Models & Operations 81

25 Aggregates (Contd.)  Add up amounts by day  In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date; Data Warehouse Models & Operations

26 Aggregates (Contd.)  Add up amounts by day, product  In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId; Data Warehouse Models & Operations drill-down rollup

27 Aggregates (Contd.)  Operators: sum, count, max, min, median, ave  “Having” clause  Using dimension hierarchy average by region (within store) maximum by month (within date) Data Warehouse Models & Operations

28 Cube Aggregations Data Warehouse Models & Operations day 2 day 1 129... Example: computing sums drill-down rollup

29 Cube Operators Data Warehouse Models & Operations day 2 day 1 129... sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

30 Extended Cube Data Warehouse Models & Operations day 2 day 1 * sale(*,p2,*)

31 Aggregation Using Hierarchies Data Warehouse Models & Operations day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)

32 Pivoting Data Warehouse Models & Operations day 2 day 1 Multi-dimensional cube: Fact table view:

33 Query & Analysis Tools

34 Query Building Report Writers (comparisons, growth, graphs,…) Spreadsheet Systems Web Interfaces Data Mining Query & Analysis Tools

35 Other Operations Time functions  e.g., time average Computed Attributes  e.g., commission = sales * rate Text Queries  e.g., find documents with words X AND B  e.g., rank documents by frequency of words X, Y, Z Query & Analysis Tools

36 Data Mining  Decision Trees  Clustering  Association Rules (i) Decision Tree:  Conducted survey to see what customers were interested in new model car  Want to select customers for advertising campaign Query & Analysis Tools training set

37 One possibility: Query & Analysis Tools age<30 city=sfcar=van likely unlikely YY Y N N N

38 Another Possibility: Query & Analysis Tools car=taurus city=sfage<45 likely unlikely YY Y N N N

39 Decision Tree Issues: Decision tree cannot be “too deep” would not have statistically significant amounts of data for lower decisions Need to select tree that most reliably predicts outcomes Query & Analysis Tools

40 (ii) Clustering: Query & Analysis Tools age income education

41 Another Example: Text Each document is a vector  e.g., contains words 1,4,5,... Clusters contain “similar” documents Useful for understanding, searching documents Query & Analysis Tools international news sports business

42 Clustering Issues: Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful?  e.g., “yuppies’’ cluster? Using clusters for disk storage Query & Analysis Tools

43 (iii) Association Rules Mining: Query & Analysis Tools customer id products bought sales records: Trend: Products p5, p8 often bought together Trend: Customer 12 likes product p9 market-basket data

44 Association Rule Rule: {p 1, p 3, p 8 } Support: number of baskets where these products appear High-support set: support  threshold s Problem: find all high support sets Query & Analysis Tools

45 Finding High-Support Pairs Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s; Query & Analysis Tools WHY?

46 Example: Query & Analysis Tools check if count  s

47 Association Rule Issues: Performance for size 2 rules Performance for size k rules Query & Analysis Tools big even bigger!

48 Implementing a Data Warehouse

49 Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing,... Managing: Metadata, Design,... Implementing a Data Warehouse

50 Monitoring: Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … Incremental vs. Refresh Implementing a Data Warehouse new

51 Monitoring Techniques: Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring Implementing a Data Warehouse  Advantages & Disadvantages!!

52 Monitoring Issues: Frequency  periodic: daily, weekly, …  triggered: on “big” change, lots of changes,... Data transformation  convert data to uniform format  remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways Implementing a Data Warehouse

53 Integration Data Cleaning Data Loading Derived Data Implementing a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata

54 Data Cleaning: Migration (e.g., yen  dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) Auditing: discover rules & relationships (like data mining) Implementing a Data Warehouse billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

55 Loading Data: Incremental vs. refresh Off-line vs. on-line Frequency of loading  At night, 1x a week/month, continuously Parallel/Partitioned load Implementing a Data Warehouse

56 Derived Data: Derived Warehouse Data  indexes  aggregates  materialized views (next slide) When to update derived data? Incremental vs. refresh Implementing a Data Warehouse

57 Materialized Views: Define new warehouse relations using SQL expressions Implementing a Data Warehouse does not exist at any source

Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms 58 Implementing a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata

ROLAP Server: Relational OLAP Server 59 Implementing a Data Warehouse relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”

MOLAP Server: Multi-Dimensional OLAP Server 60 Implementing a Data Warehouse multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date 1 2 3 4 milk soda eggs soap A B Sales

Index Structures Traditional Access Methods  B-trees, hash tables, R-trees, grids, … Popular in Warehouses  inverted lists  bit map indexes  join indexes  text indexes 61 Implementing a Data Warehouse

Inverted Lists 62 Implementing a Data Warehouse... age index inverted lists data records

Using Inverted Lists: Query:  Get people with age = 20 and name = “fred” List for age = 20: r4, r18, r34, r35 List for name = “fred”: r18, r52 Answer is intersection of “list for name” & “list for age”: r18 63 Implementing a Data Warehouse

Bit Map Index: 64 Implementing a Data Warehouse... age index bit maps data records

Using Bit Map Index: Query:  Get people with age = 20 and name = “fred” List for age = 20: 1101100000 List for name = “fred”: 0100000001 Answer is intersection: 010000000000 Good if domain cardinality small Bit vectors can be compressed 65 Implementing a Data Warehouse

Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT; 66 Implementing a Data Warehouse

Join Index 67 Implementing a Data Warehouse join index

What to Materialize Store in warehouse results useful for common queries Example: 68 Implementing a Data Warehouse day 2 day 1 129... total sales materialize

Materialization Factors: Type/frequency of queries Query response time Storage cost Update cost 69 Implementing a Data Warehouse

Cube Aggregates Lattice: 70 Implementing a Data Warehouse city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day 1 129 use greedy algorithm to decide what to materialize

Dimension Hierarchies: 71 Implementing a Data Warehouse all state city

Dimension Hierarchies: 72 Implementing a Data Warehouse city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

Interesting Hierarchy: 73 Implementing a Data Warehouse all years quarters months days weeks conceptual dimension table

Algorithms: Query Optimization Parallel Processing Data Mining 74 Implementing a Data Warehouse

Managing: Metadata Warehouse Design Tools 75 Implementing a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata

Metadata: Administrative  definition of sources, tools,...  schemas, dimension hierarchies, …  rules for extraction, cleaning, …  refresh, purging policies  user profiles, access control,... 76 Implementing a Data Warehouse

Metadata (Contd.): Business  business terms & definition  data ownership, charging Operational  data lineage  data currency (e.g., active, archived, purged)  use stats, error reports, audit trails 77 Implementing a Data Warehouse

Design: What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 78 Implementing a Data Warehouse

Tools: Development  design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis  what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management  performance monitoring, usage patterns, exception reporting System & Network Management  measure traffic (sources, warehouse, clients) Workflow Management  “reliable scripts” for cleaning & analyzing data 79 Implementing a Data Warehouse

Current State of Industry: Extraction and integration done off-line  Usually in large, time-consuming, batches Everything copied at warehouse  Not selective about what is stored  Query benefit vs. storage & update cost Query optimization aimed at OLTP  High throughput instead of fast response  Process whole query before displaying anything 80 Implementing a Data Warehouse

81 Future Directions

Better performance Larger warehouses Easier to use What are companies & research labs working on? 82 Future Directions

Research: Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush) 83 Future Directions

Research: Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data 84 Future Directions

85 END

1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch. 21.1-2, Ch. 22.

Similar presentations

Presentation on theme: "1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch. 21.1-2, Ch. 22."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch. 21.1-2, Ch. 22.

Similar presentations

Presentation on theme: "1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch. 21.1-2, Ch. 22."— Presentation transcript:

Similar presentations

About project

Feedback