Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 345: Topics in Data Warehousing Tuesday, September 28, 2004.

Similar presentations


Presentation on theme: "CS 345: Topics in Data Warehousing Tuesday, September 28, 2004."— Presentation transcript:

1 CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

2 Outline of Today’s Class What is data warehousing? Transaction processing vs. data analysis Course logistics Data integration

3 A Brief History of Information Technology The “dark ages”: paper forms in file cabinets Computerized systems emerge –Initially for big projects like Social Security –Same functionality as old paper-based systems The “golden age”: databases are everywhere –Most activities tracked electronically –Stored data provides detailed history of activity The next step: use data for decision-making –The focus of this course! –Made possible by omnipresence of IT –Identify inefficiencies in current processes –Quantify likely impact of decisions

4 Databases for Decision Support 1 st phase: Automating existing processes makes them more efficient. –Automation → Lots of well-organized, easily accessed data 2 nd phase: Data analysis allows for better decision-making. –Analyze data → better understanding –Better understanding → better decisions “Data Entry” vs. “Thinking” –Data analysts are decision-makers: managers, executives, etc.

5 OLTP vs. OLAP OLTP: On-Line Transaction Processing –Many short transactions (queries + updates) –Examples: Update account balance Enroll in course Add book to shopping cart –Queries touch small amounts of data (one record or a few records) –Updates are frequent –Concurrency is biggest performance concern OLAP: On-Line Analytical Processing –Long transactions, complex queries –Examples: Report total sales for each department in each month Identify top-selling books Count classes with fewer than 10 students –Queries touch large amounts of data –Updates are infrequent –Individual queries can require lots of resources

6 Why OLAP & OLTP don’t mix (1) Transaction processing (OLTP): –Fast response time important (< 1 second) –Data must be up-to-date, consistent at all times Data analysis (OLAP): –Queries can consume lots of resources –Can saturate CPUs and disk bandwidth –Operating on static “snapshot” of data usually OK OLAP can “crowd out” OLTP transactions –Transactions are slow → unhappy users Example: –Analysis query asks for sum of all sales –Acquires lock on sales table for consistency –New sales transaction is blocked Different performance requirements

7 Why OLAP & OLTP don’t mix (2) Transaction processing (OLTP): –Normalized schema for consistency –Complex data models, many tables –Limited number of standardized queries and updates Data analysis (OLAP): –Simplicity of data model is important Allow semi-technical users to formulate ad hoc queries –De-normalized schemas are common Fewer joins → improved query performance Fewer tables → schema is easier to understand Different data modeling requirements

8 Why OLAP & OLTP don’t mix (3) An OLTP system targets one specific process –For example: ordering from an online store OLAP integrates data from different processes –Combine sales, inventory, and purchasing data –Analyze experiments conducted by different labs OLAP often makes use of historical data –Identify long-term patterns –Notice changes in behavior over time Terminology, schemas vary across data sources –Integrating data from disparate sources is a major challenge Analysis requires data from many sources

9 Data Warehouses Doing OLTP and OLAP in the same database system is often impractical –Different performance requirements –Different data modeling requirements –Analysis queries require data from many sources Solution: Build a “data warehouse” –Copy data from various OLTP systems –Optimize data organization, system tuning for OLAP –Transactions aren’t slowed by big analysis queries –Periodically refresh the data in the warehouse

10 Course Logistics Course web site: http://cs345.stanford.edu http://cs345.stanford.edu Course format will be lecture-based –As opposed to a paper-reading course Prerequisite: –Knowledge of SQL

11 Assigned Work Five homework assignments –One problem set –Four programming assignments Not a lot of code to write Emphasis will be on interacting with Oracle Course project –Open-ended –Focus on a topic of your choosing –Any of these types: Research project, or… Programming project, or… Survey of research literature –May be done individually or in groups of two Final Exam

12 High-Level Course Outline Logical Database Design –How should the data be modeled? –Designing the data warehouse schema Query Processing –Analysis queries are hard to answer efficiently –What techniques are available to the DBMS? Physical Database Design –How should the data be organized on disk? –What data structures should be used? Data Mining –What use is all this data? –Which questions should we ask our data warehouse?

13 Additional Topics Related topics to be touched on briefly: –Data integration –Data cleaning –Approximate query answering –Data lineage –Data visualization –Incremental maintenance of materialized views –Answering queries using views –Indexing special data types (spatial, text, geographic) –Metadata management Projects can be done in these areas

14 The Textbook “The Data Warehouse Toolkit” by Ralph Kimball and Margy Ross Written by well-known data warehouse designer Clearly written and readable Lots of generic but realistic examples Semi-technical (no math!) Business-focused We’ll use it for the first one-third of the course Get the second edition!

15 Course Objectives Gain practical understanding of how data warehouses are built and used Gain exposure to data modeling “best practices” Learn techniques used to process complex queries over very large data sets Understand the performance trade-offs that come from alternative data structures Learn commonly-used methods for mining and analysis of large data sets Become familiar with current research directions in data warehousing and related areas

16 Loading the Data Warehouse Source Systems Data Staging AreaData Warehouse (OLTP) Data is periodically extracted Data is cleansed and transformed Users query the data warehouse

17 Data Integration is Hard Data warehouses combine data from multiple sources Data must be translated into a consistent format Data integration represents ~80% of effort for a typical data warehouse project! Some reasons why it’s hard: –Metadata is poor or non-existent –Data quality is often bad Missing or default values Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California) –Inconsistent semantics What is an airline passenger?

18 Federated Databases An alternative to data warehouses Data warehouse –Create a copy of all the data –Execute queries against the copy Federated database –Pull data from source systems as needed to answer queries “lazy” vs. “eager” data integration Data WarehouseFederated Database Query Answer Query Extraction Rewritten Queries Answer Source Systems Source Systems Warehouse Mediator

19 Warehouses vs. Federation Advantages of federated databases: –No redundant copying of data –Queries see “real-time” view of evolving data –More flexible security policy Disadvantages of federated databases: –Analysis queries place extra load on transactional systems –Query optimization is hard to do well –Historical data may not be available –Complex “wrappers” needed to mediate between analysis server and source systems Data warehouses are much more common in practice –Better performance –Lower complexity –Slightly out-of-date data is acceptable


Download ppt "CS 345: Topics in Data Warehousing Tuesday, September 28, 2004."

Similar presentations


Ads by Google