CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: 845-4259 Notes #15.

Slides:



Advertisements
Similar presentations
Database Architectures and the Web
Advertisements

Database Management3-1 L3 Database Management Santa R. Susarapu Ph.D. Student Virginia Commonwealth University.
Chapter 18: Data Analysis and Mining Kat Powell. Chapter 18: Data Analysis and Mining ➔ Decision Support Systems ➔ Data Analysis and OLAP ➔ Data Warehousing.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Introduction to Data Warehousing CPS Notes 6.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS MBNA
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Data Warehousing Overview
Chapter 3 Database Management
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #6.
Data Warehousing and OLAP
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #13.
Chapter 14 The Second Component: The Database.
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Introduction to Data Warehousing Enrico Franconi CS 636.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #12.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes 1.
Data Mining – Intro.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #6.
An Overview of Data Warehousing and OLTP Technology Presenter: Parminder Jeet Kaur Discussion Lead: Kailang.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
CS 345: Topics in Data Warehousing Tuesday, September 28, 2004.
A Comparsion of Databases and Data Warehouses Name: Liliana Livorová Subject: Distributed Data Processing.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Joachim Hammer 1 Data Warehousing Overview, Terminology, and Research Issues Joachim Hammer.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS MBNA ebay
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
Database Systems – Data Warehousing
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
CISB594 – Business Intelligence
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
1 Topics about Data Warehouses What is a data warehouse? How does a data warehouse differ from a transaction processing database? What are the characteristics.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CISB594 – Business Intelligence Data Warehousing Part I.
Data Warehouses and OLAP Data Management Dennis Volemi D61/70384/2009 Judy Mwangoe D61/73260/2009 Jeremy Ndirangu D61/75216/2009.
Chapter 3 Databases and Data Warehouses: Building Business Intelligence Copyright © 2010 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio.
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
CISB594 – Business Intelligence Data Warehousing Part I.
© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
Chapter 111 Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
1 Data Warehousing Data Warehousing. 2 Objectives Definition of terms Definition of terms Reasons for information gap between information needs and availability.
1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch , Ch. 22.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data Warehousing Overview CS245 Notes 12
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data Warehouse.
Data Analysis.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Data Warehousing and OLAP
Introduction to Data Warehousing
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Data Warehousing Concepts
Chapter 3 Database Management
Data Warehouse and OLAP Technology
Presentation transcript:

CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #15

2 Brief Overview on Data/information integration (data warehouse) Data mining

3 Data Warehouse (Overview) A data warehouse is the main repository of an organization's historical data, its corporate memory. It contains the raw material for management's decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems. [Wikipedia]

What is a Warehouse? Collection of (possibly diverse) data – subject oriented – aimed at executive, decision maker, analysts – often a copy of operational data – with value-added data (e.g., summaries, history) – integrated schema – time-varying – non-volatile 4

What is a Warehouse? Collection of tools/services – gathering data – cleansing, integrating,... – querying, reporting, aggregation, analysis – data mining – monitoring, administration 5

Why a Warehouse? Ship and integrate data from different sources to the analyst Three Approaches: – Database federations (legacy) – Query-driven (lazy) – Warehouse (eager) 6 6

Database Federations 7 An application program for each connection, Simple, good if DB communications are limited Needs to write many application programs

Warehouse Architecture 8 Each source has a wrapper/extractor that consists of a collection of predefined queries on the source, and communication mechanisms

Query-Driven Approach 9 Each source has a wrapper, which classifies queries into templates, and translates them into queries for the source. The wrapper can be generated from templates using modern compiler techniques.

Advantages of Query-Driven No need to copy data – less storage – no need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources 10

Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse – Modify, summarize (store aggregates) – Add historical information 11

OLTP vs. OLAP OLTP: On Line Transaction Processing – Describes processing at operational sites (sources) OLAP: On Line Analytical Processing – Describes processing at warehouse 12

OLTP vs. OLAP Mostly updates Many small transactions Megabyte-terabyte of data Raw data Up-to-date data Consistency, recoverability critical Clerical users Mostly reads Queries long, typically complex aggregations Gigabyte-terabyte of data Summarized, consolidated data Decision-makers, analysts as users 13 OLTP OLAP

Implementing a Warehouse Monitoring: Sending data from sources Integrating: Data loading, cleaning,... Processing: Query processing, indexing,... Managing: Metadata, Design,... 14

Monitoring Issues Frequency – periodic: daily, weekly, … – triggered: on “big” change, lots of changes,... Data transformation/normalization – convert data to uniform format – remove & add fields (e.g., add date to get history) Standards Gateways (Intranet/internet, firewalls, VPN, etc.) 15

Integration Data Cleaning Data Loading Derived Data 16

Processing Index Structures What to Materialize? Algorithms 17 Client Warehouse Source Query & Analysis Integration Metadata

Managing Metadata Warehouse Design Tools 18 Client Warehouse Source Query & Analysis Integration Metadata

Warehouse Design What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 19

Conclusions Massive amounts of data and complexity of queries will push limits of current warehouses Need better systems: – easier to use – provide quality information – scalability CS 245Notes12 20

Data Mining (Overview) What is data mining? A process of examining data and finding simple rules or models that summarize the data. Mining Techniques: Decision Trees Clustering Association Rules 21

Decision Trees 22 Example: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign training set training set

One Possibility 23 car=taurus city=sfage<45 likely unlikely YY Y N N N

Another Possibility 24 age<30 city=sfcar=van likely unlikely YY Y N N N

Issues Decision tree should not be “too deep” – would not have statistically significant amounts of data for lower decisions Need to select tree that most reliably predicts outcomes – automatic decision tree construction from training data (“unsupervised learning”) – exploit training data statistics to detect most ”discriminative” attribute/value conditions at each level 25

Clustering 26 age income education

Another Example: Text Each document is a vector Clusters contain “similar” documents Useful for understanding, searching documents 27 international news sports business

Issues Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful? Using clusters for disk storage 28

Association Rule Mining 29 transaction id customer id products bought sales records: Trend 1) Products p5, p8 often bought together Trend 2) Customer 12 likes product p9 market-basket data market-basket data

Association Rule Rule: {p 5, p 8 }, {cust 12, p 9 }, … Support: number of “baskets” where these products appear High-support set: support  threshold s Problem: find all high support sets 30

Association Rules How do we perform rule mining efficiently? Observation: – If set X has support t, then each X subset must have at least support t For 2-sets: – if we need support s for {i, j} – then each i, j must appear in at least s baskets A-Priori Algorithm 31

32 CSCE-608 Course Summary Overview of DB and DBMS systems; The memory architecture; Indexing and hashing; Query processing; Crash recovery; Concurrency control; Transaction processing; Data integrity and data mining;

33 CSCE-608 Course Summary Overview of DB and DBMS systems; The memory architecture; Indexing and hashing; Query processing; Crash recovery; Concurrency control; Transaction processing; Data integrity and data mining;

34 Indexing and Hashing B+ trees structure operations: search, insert, delete Hashing hash table and hash function operations: search, insert, delete extensible hashing linear hashing

35 Query Processing Query compiler, parse tree Logic query plan, physical query plan Disk I/O efficient algorithms Cost estimation of query plans

36 Crash Recovery Undo logging Redo logging Undo/redo logging Recovery algorithms Checkpoints

37 Concurrent Control Serialization Locking systems Timestamp Validation

38 Transaction processing Recoverability Handling deadlocks