Presentation on theme: "CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:"— Presentation transcript:
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Email: email@example.com@yahoo.com Web: www.ifm.ac.tz/staff/bajuna/courses/ www.ifm.ac.tz/staff/bajuna/courses/
Introduction to Data Mining and Data Warehousing
Data Mining and Data Warehousing Agenda What is Data Mining? What is Data Warehousing? The source of invention of Data Mining and Data Warehousing. Drowning in Data Starving for Knowledge. Evolution of Database Technology to the current state. (Home Work)
What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Should have been named “knowledge mining from data” which is too long or “knowledge mining” not reflecting the emphasis on mining from huge data
What Is Data Mining? Many people treat data mining as a synonym for another popularly used term Knowledge Discovery from Data/Databases (KDD). KDD as the process is depicted below:
The KDD Process Cleaning & Integration Evaluation & Presentation Data Warehouse Databases Selection & Transformation Data Mining Knowledge
KDD Process 1) Data cleaning To move noise and inconsistent data 2) Data integration Where multiple data sources may be combined 3) Data selection Where data relevant to the analysis task are retrieved from the database.
KDD Process 4) Data transformation Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance. 5) Data mining An essential process where intelligent methods are applied in order to extract data pattern.
KDD Process 6) Pattern evaluation. To identify the truly interesting pattern representing knowledge. 7) Knowledge presentation Where visualization and knowledge representation techniques are used to present the mined knowledge to the users. 8) Use of discovered knowledge
Data Mining: On What Kinds Of Data? Relational database Data warehouse Transactional database Advanced database and information repository Spatial and temporal data Stream data Multimedia database Text databases & WWW
Data Mining Functionalities Association (correlation and causality) Cheese & Bread Classification and Prediction Construct models that describe and distinguish classes or concepts for future prediction Predict some unknown or missing numerical values
Data Mining Functionalities (cont…) Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Outlier analysis Outlier: a data object that does not comply with the general behavior of the data Noise or exception? No! useful in fraud detection and rare event analysis
Necessity Is The Mother Of Invention Data explosion problem Automated data collection tools and mature database technology lead to huge amounts of data accumulated We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
Evolution Of Database Technology 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended- relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)
Evolution Of Database Technology 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining with a variety of applications Web technology and global information systems
Potential Applications Data analysis and decision support Market analysis and management Risk analysis and management Fraud detection and detection of unusual patterns Other applications Text mining (email, documents) and Web mining Stream data mining DNA and bio-data analysis
Fraud Detection & Mining Unusual Patterns Applications: Health care, retail, credit card service, telecommunications Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Retail industry Analysts estimate that 38% of retail shrink is due to dishonest employees Anti-terrorism Approaches: Clustering, model construction, outlier analysis, etc.
Other Applications Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior to help analyzing effectiveness of Web marketing, improving Web site organization, etc.
What is Data Warehouse? Defined in many different ways, but not rigorously A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis “A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process” —Bill Inmon
The source of Invention of DW and Data Mining Data explosion problem Automated data collection tools and mature database technology lead to huge amounts of data accumulated We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
Drowning In Data, Starving For Knowledge DATA KNOWLEDGE
Importance of Data Mining By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles. The discovered knowledge can be applied to decision making process.