Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,

Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001
Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,

Presentation Outline Motivation Background (KDD Process)
What’s Data Mining? Why Data Mining? The Data Mining Process Data Mining Algorithms Data Mining Research Trend Existing Systems for Data Mining Conclusions

Motivation “Necessity is the mother of invention”
Data explosion problem: Automated data collection tools, availability of increasingly cheap storage devices and mature database technology lead to tremendous amounts of data stored in database, data warehouses and other information repositories. We are drowning in data, but starving for knowledge! Data is everywhere Understand and use data—an imminent task! Solution: Knowledge Discovery (Data warehousing and data mining)

Evolution of Database Technology
1960s-1970s: Data collection, database creation, IMS and network DBMS. 1970s-1980s: Relational data model, relational DBMS implementation. 1980s-1990s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.). 1990s-right now: Data mining and data warehousing, multimedia databases, and Web-based database technology.

Background Knowledge Discovery (KD):
the process of finding general patterns/principles that summarize/explain a set of "observations". The Knowledge Discovery in Databases (KDD) Very Large DataBases (VLDB) have become the industry standard, making it impossible for human beings to mine the data "by hand" to look for interesting patterns. Automated tools are therefore needed to help to extract these patterns.

Background Cont. The knowledge discovery in databases (KDD) consists of 3 steps: Data Integration (Data Warehousing): Collecting the target data observations from the different data sources, removing noise from the observations, and integrating them into an appropriate format. Data Mining: (will be covered in detail) Applying a concrete algorithm to find useful and novel patterns in the integrated data.

Background Cont. Pattern Evaluation:
Interpreting mined patterns, evaluating them according to usefulness/interestingness criteria, and possibly using visualization tools to aid in understanding the patterns graphically. See KDD process graph below:

Data Mining: KDD process
Knowledge Data mining: the core of knowledge discovery process. Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases

What Is Data Mining? Data Mining (knowledge discovery in databases)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (knowledge) or patterns from data in large databases, data warehouse or other information repositories What is not data mining? (Deductive) query processing. Expert systems or Machine Learning/statistical programs Online Analytical Processing (OLAP) Software Agents Data Mining: Confluence of Multiple Disciplines

Data Mining Database, OLAP, Visualization Machine Learning (AI)
High Performance Computing Data Mining Visualization Machine Learning (AI) Pattern recognition Statistics Modeling Information Science

Why Data Mining? – Potential Applications
Database analysis and decision support System (DSS) Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation. Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis. Text mining (Text Databases, documents), key words search and analysis. DNA sequence analysis and gene expression.

Data Mining and Business Intelligence
Increasing potential to support business decisions End User Making Decisions Useful Pattern Business Analyst Visualization Techniques Data Mining Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP

Why Data Mining? – Potential Applications (Cont.)
Internet Web Surf-Aid (Web Mining) IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.

The Data Mining Process
Data set Data Mining System training Data Mining Algorithm evaluation model prediction Score model Historical Training data Results Pattern New data

Examples of “Discovered” Patterns
Association rules: find rules between different attributes 98% of AOL users also have EBay accounts Classification: Classify data based on the values in a classifying attribute People age less than 40 and salary > 40,000$ trade on-line Clustering: Group data to form new classes Users A and B access similar URLs, they belong to the same group, which has similar user profiles.

Are All the “Discovered” Patterns Interesting?
A data mining system/query may generate thousands of patterns, not all of them are interesting. Suggested approach: Query-based, focused mining Interestingness measures: A pattern is interesting if it is: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to confirm

How can we Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness. Can a data mining system find all the interesting patterns? Search only interesting patterns: Optimization. Can a data mining system find only the interesting patterns? Approaches First generate all the patterns and then filter out the uninteresting ones. Generate only the interesting patterns --- mining query optimization

Data Mining Algorithms
Four common DM algorithm types: The k-Nearest Neighbor Algorithm (KNN) Artificial Neural Network (ANN) Rule Induction Decision Trees

The k-Nearest Neighbor Algorithm (KNN)
A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset Use entire training database as the model Find nearest data point and do the same thing as you did for that record . - - - - + + + xq + + -

The k-Nearest Neighbor Algorithm (KNN) (Cont.)
Distance-weighted nearest neighbor algorithm. Weight the contribution of each of the k neighbors according to their distance to the query point Xq. giving greater weight to closer neighbors: Advantages: Calculate the mean values of the k nearest neighbors. Robust to noisy data by averaging k-nearest neighbors. Very easy to implement. Disadvantage: Huge Models ( the entire training database ) More difficult to use in production.

Artificial neural networks Algorithm (ANN)
Non-linear predictive models that learn through training and loosely resemble biological neural networks in structure. Inputs transformed through a network of simple processors Processor combines (weighted) inputs and produces an output value

Artificial neural networks (Cont.)
- mk (Learning Rate) x0 w0 x1 w1 å f output y xn wn Input vector x weight vector w weighted sum Activation function The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping

Multi layer perception of Artificial neural networks
Output vector Output nodes Hidden nodes Input nodes Input vector: xi

Artificial Neural Network evaluation
Advantages: prediction accuracy is generally high robust,still works when training examples contain errors Disadvantages: Key problem: Difficult to understand The neural network model is difficult to understand No intuitive understanding of results Long training time Although after training, process is very quick, the training process itself is time-consuming Significant pre-processing of data often required

Rule Induction Rule Induction (rule-based prediction) Two phases:
We first generate a set of rules from a data warehouse, then use them to predict values for new data item. It works much better on larger (and real)data sets, not just on samples of data. Two phases: Rule discovery: analyze a historical database and generate a set of rules by automatic discovery. Prediction: apply the rules to a new data set and match the rules to make predictions.

Rule Induction Example
Training Set

Rule Induction Example (Cont.)
4 attributes: Outlook: can be sunny, overcast, rainy 3 cases Temperature: hot, mild, cool cases Humidity: high, normal cases Windy: true, false cases 1 outcome: class (N: no class, P: have class) Totally we should have 3*3*2*2=36 possible combinations, of which 14 are present in the set of input examples.

Rule Induction Example (Cont.)
Some rules inducted from above dataset: Classification rules: If outlook = sunny and humidity = high then class = n. If outlook = rainy and windy = true then class = n if outlook = overcast then class = p Association rules: If temperature = cool then humidity = normal If windy=false and class=n then outlook = sunny and humidity = high

What is a decision tree? A decision tree is a flow-chart-like tree structure. Internal node denotes a test on an attribute Branch represents an outcome of the test All tuples in branch have the same value for the tested attribute. Leaf node represents class label or class label distribution. A series of nested if/then rules Understandable!

A Sample Decision Tree The same Training set with Rule Induction
Outlook sunny rain overcast humidity windy P true false high normal N P N P

Another Example for DT If x=1 and y=0 then class = a If x=0 and y=1
then class = b If x=1 and y=1

Another Example for DT Credit Analysis salary < 20000 Yes no
education in graduate accept no yes reject accept

Decision-Tree Classification Methods
The basic top-down decision tree generation approach usually consists of two phases: Tree construction At start, all the training examples are at the root. Partition examples recursively based on selected attributes. Tree pruning Aiming at removing tree branches that may lead to errors when classifying test data (training data may contain noise, statistical fluctuations, …)

How to construct a tree? Algorithm greedy algorithm
make optimal choice at each step: select the best attribute for each tree node. top-down recursive divide-and-conquer manner from root to leaf split node to several branches for each branch, recursively run the algorithm

How to prune a tree A decision tree constructed using the training data may have too many branches/leaf nodes. Caused by noise, overfitting May result poor accuracy for unseen samples Prune the tree: merge a subtree into a leaf node. Using a set of data different from the training data. At a tree node, if the accuracy without splitting is higher than the accuracy with splitting, replace the subtree with a leaf node, label it using the majority class.

How to use a tree? Directly Indirectly
test the attribute value of unknown sample against the tree. A path is traced from root to a leaf which holds the label Indirectly decision tree is converted to classification rules one rule is created for each path from the root to a leaf IF-THEN is easier for humans to understand

Decision tree for a covering algorithm

Data Mining Algorithm Summary
KNN: Quick and easy Models tend to be very large ANN: Difficult to interpret Can require significant amounts of time to train Rule Induction: Understandable Need to limit calculations Decision Trees: Understandable Relatively fast Other DM Technologies Genetic Algorithms Rough sets Bayesian networks Mixture models Many more...

Data Mining Research Trend
Text mining: Text database and information retrieval Multimedia data mining OLAM (OLAP Mining) Web mining (Data Mining and WWW) E-commerce Information retrieval (search) Network management

Why Mine the Web? Web: A huge, widely-distributed, highly heterogeneous, semi-structured, hypertext/hypermedia, interconnected, evolving information repository. Web is a huge collection of documents plus Hyper-link information Access and usage information Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagats) Car prices (e.g. Carpoint) Lots of data on user access patterns Web logs contain sequence of URLs accessed by users

Why is Web Mining Different?
Huge : The Web is a huge collection of documents except for Hyper-link information Access and usage information Dynamic:The Web is very dynamic New pages are constantly being generated Unstructured: Complexity of Web pages: far greater than text document collection Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-links and access patterns Be incremental

Types of Web Mining Web Mining Web Structure Mining Web Content Mining
Web Page Content Mining Search Result Web Usage Mining General Access Pattern Tracking Customized Usage Tracking

Web Mining Applications
E-commerce (Infrastructure) Generate user profiles Targetted advertizing Fraud detection Similar image retrieval Information retrieval (Search) on the Web Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents Network Management Performance management Fault management

Existing Systems for Data Mining
IBM: Intelligent Miner. SAS Institute: Enterprise Miner. Silicon Graphics: MineSet. Integral Solutions Ltd.: Clementine. Information Discovery Inc.: Data Mining Suite. DBMiner Technology Inc.: DBMiner Rutger: DataMine, GMD: Explora, Univ. Munich: VisDB

Microsoft OLE DB for Data Mining
Microsoft OLE, OLE DB, OLE DB for OLAP and OLE DB for Data Mining OLE DB for DM: Standardization July 1999 to March 2000 Microsoft SQL Server 2000: Analysis manager Analysis manager consists of OLAP and Data Mining Data mining: two modules (Classification/Prediction and clustering) OLDB for DM: Data mining providers (such as association modules and other classification or clustering modules)

Research Progress for Data Mining in the Last Decade
Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing) Association, correlation, and causality analysis Classification: scalability and new approaches Clustering and outlier analysis Sequential patterns and time-series analysis Text mining, Web mining and Weblog analysis Spatial, multimedia, scientific data analysis Data preprocessing and database compression Data visualization and visual data mining

Conclusions Knowledge Discovery in Databases (KDD)
Data warehouse: An industry trend DW stores a huge amount of subject-oriented, cleansed, integrated, consolidated, time-related data. Data Mining: A rich, promising, young field with broad applications and many challenging research issues. Good science - leading position in research community

Conclusions (Cont.) Data mining tasks: characterization, association, classification, clustering, prediction, sequence and pattern analysis, etc. Data mining Algorithms: The k-Nearest Neighbor Algorithm (KNN) Artificial Neural Network (ANN) Rule Induction Decision Trees Research progress and trend in Data Mining

Future Work Theoretical foundations of data mining.
Implementation and new data mining methodologies: A set of well-tuned, standard mining operators. Data and knowledge visualization tools. Integration of multiple data mining strategies. Data mining in advanced information systems: Spatial, multimedia, Web-mining Data mining applications: content browsing, query optimization, multi-resolution model, etc. Social issues: A threat to security and privacy.

Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,

Similar presentations

Presentation on theme: "Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,

Similar presentations

Presentation on theme: "Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,"— Presentation transcript:

Similar presentations

About project

Feedback