1 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a CS411a/CS538a v New rooms: Monday: UC 224; Friday: UC 30 v Office hours of Dr. Ling, :

1 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a CS411a/CS538a v New rooms: Monday: UC 224; Friday: UC 30 v Office hours of Dr. Ling, : M,F: 3-5 pm. MC 366 v Course web: www.csd.uwo.ca/faculty/ling/cs411a

2 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Data Mining: Concepts and Techniques Lecture Notes for CS411A/538A Acknowledgements Adapted from Lecture Notes of Professor Jiawei Han http://www.cs.sfu.ca/~han

3 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Course Outline v Introduction v Data warehousing and OLAP v Data preparation and preprocessing v Concept description: characterization and discrimination v Classification and prediction v Association analysis v Clustering analysis v Mining complex data and advanced mining techniques  Trends and research issues

4 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a I. Introduction: Outline v Motivation: Why data mining? v What is data mining? v Data Mining: On what kind of data? v Data mining functionality: v Are all the patterns interesting? v A classification of data mining systems v Requirements and challenges of data mining.

5 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Motivation: “Necessity is the Mother of Invention” v Data explosion problem: U Automated data collection tools, mature database technology, and affordable hardware lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. v We are drowning in data, but starving for knowledge! U On-line analytical processing –Data warehouses U Data mining of interesting knowledge (rules, regularities, patterns, constraints) from large databases.

6 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a We are Data Rich but Information Poor Data Mining can help discover knowledge Terrorbytes of data

7 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Evolution of Database Technology v 1960s: U Data collection, database creation, IMS and network DBMS. v 1970s: U Relational data model, relational DBMS implementation. v 1980s: to present U RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented (spatial, scientific, engineering, etc.). v 1990s: to present U Data mining and data warehousing, multimedia databases, and Web technology. v 2000: New generation of information systems

8 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Why Data Mining? — Potential Applications v Database analysis and decision support U Market analysis and management –target marketing, customer relation management, market basket analysis, cross selling, market segmentation. U Risk analysis and management –Forecasting, customer retention, improved underwriting, quality control, competitive analysis. U Fraud detection and management v Other Applications: U Text mining (news group, email, documents) and Web analysis. U Intelligent query answering U E-Commerce

9 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Market Analysis and Management v Where are the data sources for analysis? U Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies. v Target (direct) marketing: U Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. v Determine customer purchasing patterns over time: U Time-series analysis v Cross-market analysis U Associations/co-relations between product sales U Prediction based on the association information.

10 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Market Analysis and Management (2) v Customer profiling U data mining can tell you what types of customers buy what products (clustering or classification). v Identifying customer requirements U identifying the best products for different customers U predic what factors will attract new customers v Provides summary information U various multidimensional summary reports; U statistical summary information (data central tendency and variation)

11 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Corporate Analysis and Risk Management v Finance planning and asset evaluation: U cash flow analysis and prediction U risk assessment and portfolio analysis U time series analysis (trend analysis, stock market) v Resource planning: U summarize and compare the resources and spending v Discretionary pricing: U group customers into classes and a class-based pricing procedure.

12 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Fraud Detection and Management v Applications: U widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. v Approach: U use data mining on historical data to build models of fraudulent behavior to help identify similar instances. v Examples: U auto insurance: detect a group of people who stage accidents to collect on insurance U money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) U medical insurance: detect professional patients and ring of doctors and ring of references

13 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Fraud Detection and Management (2) v More examples: U Detecting inappropriate medical treatment: –Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). U Detecting telephone fraud: –Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. –British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud..

14 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Other Applications v Sports U IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. v Astronomy U JPL and the Palomar Observatory discovered 22 quasars with the help of data mining v Internet Web Surf-Aid U IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

15 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Data Mining Should Not be Used Blindly! v Data mining find regularities from history, but history is not the same as the future. U The no-free-lunch theorem v Association does not dictate trend nor causality!? U Drink diet drinks lead to obesity! U David Heckerman’s counter-example (1997): –Barbecue sauce, hot dogs and hamburgers. v Some abnormal data could be caused by human! U Missing data: unknown, unimportant, irrelevant, … U Data can lie, data cannot tell

16 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a I. Introduction v Motivation: Why data mining? v What is data mining? v Data Mining: On what kind of data? v Data mining functionality v Are all the patterns interesting? v A classification of data mining systems v Requirements and challenges of data mining.

17 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a What Is Data Mining? v Data mining (knowledge discovery in databases): U Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases v Alternative names and their “inside stories”: U Data mining: a misnomer? U Knowledge discovery in databases (KDD: SIGKDD), knowledge extraction, data archeology, data dredging, information harvesting, business intelligence, etc. v What is not data mining? U (Deductive) query processing. U Expert systems or small ML/statistical programs

18 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Data Mining: A KDD Process U Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

19 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Steps of a KDD Process v Learning the application domain: U relevant prior knowledge and goals of application v Creating a target data set: data selection v Data cleaning and preprocessing: (may take 60% of effort!) v Data reduction and projection: U Find useful features, dimensionality/variable reduction, invariant representation. v Choosing functions of data mining U summarization, classification, regression, association, clustering. v Choosing the mining algorithm(s) v Data mining: search for patterns of interest v Interpretation: analysis of results. U visualization, transformation, removing redundant patterns, etc. v Use of discovered knowledge

20 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Data Mining: Confluence of Multiple Disciplines v Database systems, data warehouse and OLAP v Statistics v Machine learning v Visualization v Information science v High performance computing v Other disciplines: U Neural networks, mathematical modeling, information retrieval, pattern recognition, etc.

22 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Data Mining: On What Kind of Data? v Relational databases v Data warehouses v Transactional databases v Advanced DB systems and information repositories U Object-oriented and object-relational databases U Spatial databases U Time-series data and temporal data U Text databases and multimedia databases U Heterogeneous and legacy databases U WWW

24 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a

25 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Data Mining Functionality v Concept description: Characterization and discrimination: U Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions. v Association: U association, correlation, or causality. U Measurement: support and confidence v Classification and Prediction: U Classify data based on the values in a classifying attribute, e.g., classify countries based on climate, or classify cars based on gas mileage. U Predict some unknown or missing attribute values based on other information.

26 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Data Mining Functionality (Cont.) v Cluster analysis: U Group data to form new classes, e.g., cluster houses to find distribution patterns. v Outlier and exception data analysis: v Time-series analysis (trend and deviation): U Similarity-based pattern-directed analysis: U Full vs. partial periodicity analysis:

28 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Are All the “Discovered” Patterns Interesting? v A data mining system/query may generate thousands of patterns, not all of them are interesting. U Suggested approach: Human-centered, query-based, focused mining v Interestingness measures: A pattern is interesting if it is U easily understood by humans; small in size; how to represent U valid on new or test data with some degree of certainty. U potentially useful U novel, or validates some hypothesis that a user seeks to confirm v Objective vs. subjective interestingness measures: U Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. U Subjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, etc.

29 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Can It Find All and Only Interesting Patterns? v Find all the interesting patterns: Completeness. U Can a data mining system find all the interesting patterns? v Search for only interesting patterns: Optimization. U Can a data mining system find only the interesting patterns? U Approaches –First general all the patterns and then filter out the uninteresting ones. –Generate only the interesting patterns --- mining query optimization

31 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Data Mining: Classification Schemes v General functionality: U Descriptive data mining U Predictive data mining v Different views, different classifications: U Kinds of knowledge to be discovered, U Kinds of databases to be mined, and U Kinds of techniques adopted.

32 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Three Schemes in Classification v Knowledge to be mined: U Summarization (characterization), comparison, association, classification, clustering, trend, deviation and pattern analysis, etc. U Mining knowledge at different abstraction levels: primitive level, high level, multiple-level, etc. v Databases to be mined: U Relational, transactional, object-oriented, object- relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, etc. v Techniques adopted: U Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc.

34 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Issues and Challenges in Data Mining v Mining methodology issues U Mining different kinds of knowledge in databases. U Interactive mining of knowledge at multiple levels of abstraction. U Incorporation of background knowledge U Data mining query languages and ad-hoc data mining. U Expression and visualization of data mining results. U Handling noise and incomplete data U Pattern evaluation: the interestingness problem. v Performance issues: U Efficiency and scalability of data mining algorithms. U Parallel, distributed and incremental mining methods.

35 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Issues and Challenges in Data Mining (Cont.) v Issues relating to the variety of data types: U Handling relational and complex types of data U Mining information from heterogeneous databases and global information systems. v Issues related to applications and social impacts: U Application of discovered knowledge. –Domain-specific data mining tools –Intelligent query answering –Process control and decision making. U Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem. U Protection of data security and integrity.

36 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a A Brief History of Data Mining v 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro) U Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) v 1991-1994 Workshops on Knowledge Discovery in Databases U Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) v 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) U Journal of Data Mining and Knowledge Discovery (1997) v 1998 ACM SIGKDD and SIGKDD’99 conference

37 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a Where to Find References? v Data mining and KDD (SIGKDD member CDROM): U Conference proceedings: KDD, and others, such as PKDD, PAKDD, etc. U Journal: Data Mining and Knowledge Discovery v KDNuggets, companies, positions, general resources U http://www.kdnuggets.com/ v Machine Learning People Home Pages U http://www.aic.nrl.navy.mil/~aha/ v Best AI textbook, AI Resources on the Web U http://www.cs.berkeley.edu/~russell/aima.html v Datasets U http://www.ics.uci.edu/AI/ML/Machine-Learning.html U KDD-98 cup: http://www.epsilon.com/new/1datamining.html v Postscript papers search engine U ML papers: http://gubbio.cs.berkeley.edu/mlpapers/ U CS papers: http://www.cora.justresearch.com/

1 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a CS411a/CS538a v New rooms: Monday: UC 224; Friday: UC 30 v Office hours of Dr. Ling, :

Similar presentations

Presentation on theme: "1 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a CS411a/CS538a v New rooms: Monday: UC 224; Friday: UC 30 v Office hours of Dr. Ling, :"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a CS411a/CS538a v New rooms: Monday: UC 224; Friday: UC 30 v Office hours of Dr. Ling, :

Similar presentations

Presentation on theme: "1 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a CS411a/CS538a v New rooms: Monday: UC 224; Friday: UC 30 v Office hours of Dr. Ling, :"— Presentation transcript:

Similar presentations

About project

Feedback