Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from.

Similar presentations


Presentation on theme: "Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from."— Presentation transcript:

1 Data Warehouses Kathy S. Schwaig

2 Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from @ J. Han, Simon Fraser University, Canada, 2000

3 Now that we have gathered so much data, What do we do with it? “ I never waste memory on things that can easily be stored and retrieved from elsewhere.” - Albert Einstein

4 Data Explosion Problem  Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases.  We are drowning in data, but starving for knowledge!

5 What is a Data Warehouse? An integrated and consistent store of subject-oriented data, structured for query and retrieval in order to support management decision making.

6 A data warehouse is where the information systems department puts data to be turned into information. One cannot just dump masses of data into a disk drive and expect it to be usable.

7 Goal of Data Warehousing  Resolve enormous data access difficulties: Unavailable data hidden in transaction systems Delays as underpowered systems try to perform huge, complex queries Complex, user-hostile interfaces Difficulties in discovering patterns in large amounts of data Competition for computer resources between transaction systems and decision support systems

8 On-line Transaction Processing (OLTP )  Traditional database management systems (DBMS) used for on-line transaction processing (OLTP).  Order entry: update status field of order 445522  Banking: transfer $100 from account 55779 to account 99321  Characteristics: detailed up-to-date data structured, repetitive tasks short transactions read and/or update a few records

9 An OLTP example  You call retailer Land’s End, where you have done business before. The exchange might be: “Hi, this is Mr. Smith. I’d like to place an order” “Your phone number, please?” “555-555-1212” (Pulls up your file) “Yes, Mr. Smith. What can I help you with? “I’d like to order merchandise number 2222” “I see you were a little late last year in getting your Christmas presents ordered. Would you like some suggestions to get the process started earlier?” “Sure, Why not?” “Last year, you bought your Aunt Jennifer a scarf. We have a lovely pair of gloves to match --they are on special for only $19. Should I add those to your order?” “Uh...sure.” “And would you like the card to say the same as last year?” “Yes, please.”

10 Characteristics and usage patterns of operational systems (transaction processing systems) used to automate business processes and those of a Decision Support System are fundamentally different but linked. Why? Decision Support versus Transaction Processing

11 What is a Data Warehouse?  Facility for integrating data  Organizes and stores data for analytical processing from historical perspective  Maintained separately from organization’s operational database

12 Data Warehouse Architecture Data Sources Data Warehouse Extract Transform Load Refresh metadata DSS Server Analysis Query Reports Data mining Tools Serve other sources Data Marts Operational DBs

13 Characteristics of a Data Warehouse Subject-oriented Integrated Non-volatile Time-varying

14 1. Subject Oriented  Oriented to the major subject areas of the corporation E.g. insurance company: customer, product, transaction, policy, claim, account  Operational database and applications may be organized differently E.g. based on type of insurance's: auto, life, medical, fire.

15 2. Integrated  Inconsistencies in encoding and naming conventions exist among data sources. Why?  Data converted

16 3. Non-Volatile  Operational data regularly accessed and manipulated a record at a time. Update performed in operational environment.  Warehouse data loaded and accessed.  Update of data does not occur in the data warehouse environment.

17 4. Time Variant... A data warehouse is a “time-variant” collection of data, meaning time is a variable in accessing the data.

18 Time Variant  Time horizon for data longer than that of operational systems.  Operational database contains current value data.  Data warehouse data is a sophisticated series of snapshots.  The key structure of operational data may or may not contain some element of time. The key structure of the data warehouse always contains some element of time.

19 Data Mart A data mart is a smaller version of a data warehouse, typically containing data related to a single functional area of the firm or having limited scope in some other way. It can be a useful first step to a full-scale data warehouse.

20 Data: The Critical Issue  Users need to gather, analyze, report on business information to help organizations gain competitive advantage.  Most companies have a wealth of legacy data. Worthless if: existence unknown cannot be found cannot be understood incorrect

21 Data Transformation  Simple transformation -- e.g. change data type of field from integer to character  Cleansing & scrubbing -- consistent format, valid values  Integration -- data from multiple sources and map field by field into data warehouse.  Aggregation / summarization

22 Sample Operations  Roll up -- summarize data total sales volume last year by product category by region  Roll down, drill down, drill through -- go from higher level summary to lower level summary or detailed data For a particular product category, find the detailed sales data for each salesperson by date  Slice and dice Sales of beverages in the West over the last 6 months

23 No single "best" data structure for all applications within an enterprise. Need good conceptual fit with the way end-users visualize business data Most business people already think about their businesses in multidimensional terms Managers tend to ask questions about product sales in different markets over specific time periods Adapted from Arun Rai 1999 Why Multi-Dimensional Databases?

24 What is a Multi-Dimensional Database? A multidimensional database (MDD) is a computer software system designed for the efficient and convenient storage and retrieval of large volumes of data that are: (1) intimately related (2) stored, viewed and analyzed from different perspectives. Perspectives called dimensions.

25 Contrasting Relational and Multi-Dimensional Models: An Example

26

27 Mutlidimensional Representation

28 Assume that each dimension has 10 positions, as shown in the cube above How many records would be there in a relational table? Implications for viewing data from an end-user standpoint? View Data – An Example

29 Data Warehousing and The World Wide Web Access and transfer large numbers of data relatively easily and economically Integration of external data into data warehouse Issues of data integrity, accuracy, quality Quality rating versus price

30 Applications Data Mining Data Visualization (Coming Next)

31 Summary Data versus Information Data Warehouse Architecture Characteristics Applications

32 Appendix: Operational Data Store and Data Warehouse Characteristic How is it built? User requirements Area of support Characteristic Operational Data Store Data Warehouse One application or subject area at a time. Well defined prior to logical design. Day-to-day business operations. Relatively small number of records retrieved via a single query. Tuned for frequent access to small amounts of data. Similar to typical daily volume of operational transactions. Typically multiple subject areas at a time. Often vague and conflicting. Decision support for managerial activities. Large data sets scanned to retrieve results from either single or multiple queries. Tuned for infrequent access to large amounts of data. Much larger than typical daily transaction volume. Type of access Volume of data Frequency of access

33 Retention period Currency of data Availability of data Typical unit of analysis Design focus Retained as necessary to meet daily operating requirements. Up-to-the-minuet; real time. High and immediate availability may be required. Small, manageable, transaction-level units. High-performance, limited flexibility. Retention period is indeterminate and must support historical reporting, comparison, and analysis Typically represents a static point in time. Immediate availability is less critical. Large, unpredictable,variable units. High flexibility, high-performance. Characteristic Operational Data Store Data Warehouse Appendix: Operational Data Store and Data Warehouse Characteristic (cont’d)

34 Appendix: Characteristics of a Data Warehouse  Subject orientation. Data are organized based on how the users to them.  Integrated. All inconsistencies regarding naming convention and value representations are removed.  Nonvolatile. Data are stored in read-only format and do not change over time.  Time variant. Data are not current but normally time-series.  Summarized. Operational data are mapped into a decision-usable format.  Large volume. Time-series data sets are normally quite large.  Not normalized. DW data can be, and often are, redundant.  Metadata. Data about data are stored.  Data sources. Data come from internal and external unintegrated operational systems


Download ppt "Data Warehouses Kathy S. Schwaig. Outline  Data Explosion  Data Warehouses  Multi-dimensional databases Portions of this presentation are adapted from."

Similar presentations


Ads by Google