Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.

Similar presentations


Presentation on theme: "Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11."— Presentation transcript:

1 Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

2 CS 245Notes11 2 Warehousing l Growing industry: $8 billion in 1998 l Range from desktop to huge: u Walmart: 900-CPU, 2,700 disk, 23TB Teradata system l Lots of buzzwords, hype u slice & dice, rollup, MOLAP, pivot,...

3 CS 245Notes11 3 Outline l What is a data warehouse? l Why a warehouse? l Models & operations l Implementing a warehouse l Future directions

4 CS 245Notes11 4 What is a Warehouse? l Collection of diverse data u subject oriented u aimed at executive, decision maker u often a copy of operational data u with value-added data (e.g., summaries, history) u integrated u time-varying u non-volatile more

5 CS 245Notes11 5 What is a Warehouse? l Collection of tools u gathering data u cleansing, integrating,... u querying, reporting, analysis u data mining u monitoring, administering warehouse

6 CS 245Notes11 6 Warehouse Architecture Client Warehouse Source Query & Analysis Integration Metadata

7 CS 245Notes11 7 Motivating Examples l Forecasting l Comparing performance of units l Monitoring, detecting fraud l Visualization

8 CS 245Notes11 8 Why a Warehouse? l Two Approaches: u Query-Driven (Lazy) u Warehouse (Eager) Source ?

9 CS 245Notes11 9 Query-Driven Approach Client Wrapper Mediator Source

10 CS 245Notes11 10 Advantages of Warehousing l High query performance l Queries not visible outside warehouse l Local processing at sources unaffected l Can operate when sources unavailable l Can query data not stored in a DBMS l Extra information at warehouse u Modify, summarize (store aggregates) u Add historical information

11 CS 245Notes11 11 Advantages of Query-Driven l No need to copy data u less storage u no need to purchase data l More up-to-date data l Query needs can be unknown l Only query interface needed at sources l May be less draining on sources

12 CS 245Notes11 12 OLTP vs. OLAP l OLTP: On Line Transaction Processing u Describes processing at operational sites l OLAP: On Line Analytical Processing u Describes processing at warehouse

13 CS 245Notes11 13 OLTP vs. OLAP l Mostly updates l Many small transactions l Mb-Tb of data l Raw data l Clerical users l Up-to-date data l Consistency, recoverability critical l Mostly reads l Queries long, complex l Gb-Tb of data l Summarized, consolidated data l Decision-makers, analysts as users OLTP OLAP

14 CS 245Notes11 14 Data Marts l Smaller warehouses l Spans part of organization u e.g., marketing (customers, products, sales) l Do not require enterprise-wide consensus u but long term integration problems?

15 CS 245Notes11 15 Warehouse Models & Operators l Data Models u relations u stars & snowflakes u cubes l Operators u slice & dice u roll-up, drill down u pivoting u other

16 CS 245Notes11 16 Star

17 CS 245Notes11 17 Star Schema

18 CS 245Notes11 18 Terms l Fact table l Dimension tables l Measures

19 CS 245Notes11 19 Dimension Hierarchies store sType cityregion  snowflake schema  constellations

20 CS 245Notes11 20 Cube Fact table view: Multi-dimensional cube: dimensions = 2

21 CS 245Notes11 21 3-D Cube day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:

22 CS 245Notes11 22 ROLAP vs. MOLAP l ROLAP: Relational On-Line Analytical Processing l MOLAP: Multi-Dimensional On-Line Analytical Processing

23 CS 245Notes11 23 Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

24 CS 245Notes11 24 Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

25 CS 245Notes11 25 Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup

26 CS 245Notes11 26 Aggregates l Operators: sum, count, max, min, median, ave l “Having” clause l Using dimension hierarchy u average by region (within store) u maximum by month (within date)

27 CS 245Notes11 27 Cube Aggregation day 2 day 1 129... drill-down rollup Example: computing sums

28 CS 245Notes11 28 Cube Operators day 2 day 1 129... sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

29 CS 245Notes11 29 Extended Cube day 2 day 1 * sale(*,p2,*)

30 CS 245Notes11 30 Aggregation Using Hierarchies day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)

31 CS 245Notes11 31 Pivoting day 2 day 1 Multi-dimensional cube: Fact table view:

32 CS 245Notes11 32 Query & Analysis Tools l Query Building l Report Writers (comparisons, growth, graphs,…) l Spreadsheet Systems l Web Interfaces l Data Mining

33 CS 245Notes11 33 Other Operations l Time functions u e.g., time average l Computed Attributes u e.g., commission = sales * rate l Text Queries u e.g., find documents with words X AND B u e.g., rank documents by frequency of words X, Y, Z

34 CS 245Notes11 34 Data Mining l Decision Trees l Clustering l Association Rules

35 CS 245Notes11 35 Decision Trees Example: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign training set

36 CS 245Notes11 36 One Possibility age<30 city=sfcar=van likely unlikely YY Y N N N

37 CS 245Notes11 37 Another Possibility car=taurus city=sfage<45 likely unlikely YY Y N N N

38 CS 245Notes11 38 Issues l Decision tree cannot be “too deep” è would not have statistically significant amounts of data for lower decisions l Need to select tree that most reliably predicts outcomes

39 CS 245Notes11 39 Clustering age income education

40 CS 245Notes11 40 Another Example: Text l Each document is a vector u e.g., contains words 1,4,5,... l Clusters contain “similar” documents l Useful for understanding, searching documents international news sports business

41 CS 245Notes11 41 Issues l Given desired number of clusters? l Finding “best” clusters l Are clusters semantically meaningful? u e.g., “yuppies’’ cluster? l Using clusters for disk storage

42 CS 245Notes11 42 Association Rule Mining transaction id customer id products bought sales records: Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p9 market-basket data

43 CS 245Notes11 43 Association Rule l Rule: {p 1, p 3, p 8 } l Support: number of baskets where these products appear l High-support set: support  threshold s l Problem: find all high support sets

44 CS 245Notes11 44 Finding High-Support Pairs l Baskets(basket, item) l SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s; WHY?

45 CS 245Notes11 45 Example check if count  s

46 CS 245Notes11 46 Issues l Performance for size 2 rules big even bigger! l Performance for size k rules

47 CS 245Notes11 47 Implementing a Warehouse l Monitoring: Sending data from sources l Integrating: Loading, cleansing,... l Processing: Query processing, indexing,... l Managing: Metadata, Design,...

48 CS 245Notes11 48 Monitoring l Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … l Incremental vs. Refresh new

49 CS 245Notes11 49 Monitoring Techniques l Periodic snapshots l Database triggers l Log shipping l Data shipping (replication service) l Transaction shipping l Polling (queries to source) l Screen scraping l Application level monitoring  Advantages & Disadvantages!!

50 CS 245Notes11 50 Monitoring Issues l Frequency u periodic: daily, weekly, … u triggered: on “big” change, lots of changes,... l Data transformation u convert data to uniform format u remove & add fields (e.g., add date to get history) l Standards (e.g., ODBC) l Gateways

51 CS 245Notes11 51 Integration l Data Cleaning l Data Loading l Derived Data Client Warehouse Source Query & Analysis Integration Metadata

52 CS 245Notes11 52 Data Cleaning Migration (e.g., yen  dollars) l Scrubbing: use domain-specific knowledge (e.g., social security numbers) l Fusion (e.g., mail list, customer merging) l Auditing: discover rules & relationships (like data mining) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

53 CS 245Notes11 53 Loading Data l Incremental vs. refresh l Off-line vs. on-line l Frequency of loading u At night, 1x a week/month, continuously l Parallel/Partitioned load

54 CS 245Notes11 54 Derived Data l Derived Warehouse Data u indexes u aggregates u materialized views (next slide) l When to update derived data? l Incremental vs. refresh

55 CS 245Notes11 55 Materialized Views l Define new warehouse relations using SQL expressions does not exist at any source

56 CS 245Notes11 56 Processing l ROLAP servers vs. MOLAP servers l Index Structures l What to Materialize? l Algorithms Client Warehouse Source Query & Analysis Integration Metadata

57 CS 245Notes11 57 ROLAP Server l Relational OLAP Server relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”

58 CS 245Notes11 58 MOLAP Server l Multi-Dimensional OLAP Server multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date 1 2 3 4 milk soda eggs soap A B Sales

59 CS 245Notes11 59 Index Structures l Traditional Access Methods u B-trees, hash tables, R-trees, grids, … l Popular in Warehouses u inverted lists u bit map indexes u join indexes u text indexes

60 CS 245Notes11 60 Inverted Lists... age index inverted lists data records

61 CS 245Notes11 61 Using Inverted Lists l Query: u Get people with age = 20 and name = “fred” l List for age = 20: r4, r18, r34, r35 l List for name = “fred”: r18, r52 l Answer is intersection: r18

62 CS 245Notes11 62 Bit Maps... age index bit maps data records

63 CS 245Notes11 63 Using Bit Maps l Query: u Get people with age = 20 and name = “fred” l List for age = 20: 1101100000 l List for name = “fred”: 0100000001 l Answer is intersection: 010000000000 l Good if domain cardinality small l Bit vectors can be compressed

64 CS 245Notes11 64 Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT

65 CS 245Notes11 65 Join Indexes join index

66 CS 245Notes11 66 What to Materialize? l Store in warehouse results useful for common queries l Example: day 2 day 1 129... total sales materialize

67 CS 245Notes11 67 Materialization Factors l Type/frequency of queries l Query response time l Storage cost l Update cost

68 CS 245Notes11 68 Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day 1 129 use greedy algorithm to decide what to materialize

69 CS 245Notes11 69 Dimension Hierarchies all state city

70 CS 245Notes11 70 Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

71 CS 245Notes11 71 Interesting Hierarchy all years quarters months days weeks conceptual dimension table

72 CS 245Notes11 72 Algorithms l Query Optimization l Parallel Processing l Data Mining

73 CS 245Notes11 73 Example: Association Rules l How do we perform rule mining efficiently? l Observation: If set X has support t, then each X subset must have at least support t l For 2-sets: u if we need support s for {i, j} u then each i, j must appear in at least s baskets

74 CS 245Notes11 74 Algorithm for 2-Sets (1) Find OK products u those appearing in s or more baskets (2) Find high-support pairs using only OK products

75 CS 245Notes11 75 Algorithm for 2-Sets l INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s; l Perform mining on okBaskets SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item = s;

76 CS 245Notes11 76 Counting Efficiently l One way: sort count & remove threshold = 3

77 CS 245Notes11 77 Counting Efficiently l Another way: remove scan & count keep counter array in memory threshold = 3

78 CS 245Notes11 78 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove (4) remove (3) scan& count in-memory counters false positive

79 CS 245Notes11 79 Discussion l Hashing scheme: 2 (or 3) scans of data l Sorting scheme: requires a sort! l Hashing works well if few high-support pairs and many low-support ones item-pairs ranked by frequency frequency threshold iceberg queries

80 CS 245Notes11 80 Managing l Metadata l Warehouse Design l Tools Client Warehouse Source Query & Analysis Integration Metadata

81 CS 245Notes11 81 Metadata l Administrative u definition of sources, tools,... u schemas, dimension hierarchies, … u rules for extraction, cleaning, … u refresh, purging policies u user profiles, access control,...

82 CS 245Notes11 82 Metadata l Business u business terms & definition u data ownership, charging l Operational u data lineage u data currency (e.g., active, archived, purged) u use stats, error reports, audit trails

83 CS 245Notes11 83 Design l What data is needed? l Where does it come from? l How to clean data? l How to represent in warehouse (schema)? l What to summarize? l What to materialize? l What to index?

84 CS 245Notes11 84 Tools l Development u design & edit: schemas, views, scripts, rules, queries, reports l Planning & Analysis u what-if scenarios (schema changes, refresh rates), capacity planning l Warehouse Management u performance monitoring, usage patterns, exception reporting l System & Network Management u measure traffic (sources, warehouse, clients) l Workflow Management u “reliable scripts” for cleaning & analyzing data

85 CS 245Notes11 85 Current State of Industry l Extraction and integration done off-line u Usually in large, time-consuming, batches l Everything copied at warehouse u Not selective about what is stored u Query benefit vs storage & update cost l Query optimization aimed at OLTP u High throughput instead of fast response u Process whole query before displaying anything

86 CS 245Notes11 86 Future Directions l Better performance l Larger warehouses l Easier to use l What are companies & research labs working on?

87 CS 245Notes11 87 Research (1) l Incremental Maintenance l Data Consistency l Data Expiration l Recovery l Data Quality l Error Handling (Back Flush)

88 CS 245Notes11 88 Research (2) l Rapid Monitor Construction l Temporal Warehouses l Materialization & Index Selection l Data Fusion l Data Mining l Integration of Text & Relational Data

89 CS 245Notes11 89 Conclusions l Massive amounts of data and complexity of queries will push limits of current warehouses l Need better systems: u easier to use u provide quality information


Download ppt "Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11."

Similar presentations


Ads by Google