Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: Concepts and Techniques www.ePowerPoint.com.

Similar presentations


Presentation on theme: "Data Mining: Concepts and Techniques www.ePowerPoint.com."— Presentation transcript:

1 Data Mining: Concepts and Techniques www.ePowerPoint.com

2 Course Objective Introduce the fundamental concepts of data mining Learning basic data mining principles and algorithms Develop the ability to solve problems with data mining techniques

3 Textbook J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3nd ed., 2012 武森,高学东, Bastian. 数据仓库与数据挖 掘. 北京:冶金工业出版社, 2003 武森,高学东, Bastian. 数据仓库与数据挖 掘. 北京:冶金工业出版社, 2003

4 Course Assessment Attendance + homework + project: 30% Final examination: 70%

5 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary

6 Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Social media and everyone: search engine, news, blogs,social networks, YouTube, digital picturess&videos, from data age to information age: need powerful dada analysis tools database systems offering query and transaction processing as common practice. Data mining tools perform data analysis and extract the valuable knowledge embedded in the vast amount of data.

7 Why Data Mining?

8 Why Not Traditional Data Analysis? Tremendous amount of data  Algorithms must be highly scalable to handle such as tera- bytes of data High-dimensionality of data  may have tens of thousands of dimensions High complexity of data  Data streams and sensor data  Time-series data, temporal data, sequence data  Structure data, graphs, social networks and multi-linked data  Heterogeneous databases and legacy databases  Spatial, spatiotemporal, multimedia, text and Web data  Software programs, scientific simulations New and sophisticated applications

9 Why Now? Data is being produced Data is being warehoused The computing power is available The computing power is affordable The competitive pressures are strong Commercial products are available

10 Evolution of Database Technology 1960s:  database creation, IMS and network DBMS 1970s:  Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s:  Data mining, data warehousing, multimedia databases, and Web databases 2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems

11 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary

12 What can we get from data mining?

13 What Is Data Mining? Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

14 What Is Data Mining? Interesting patterns: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm

15 What Is Data Mining?

16 Data mining: a misnomer? Alternative names  Knowledge discovery in databases (KDD),  knowledge extraction,  data/pattern analysis,  data archeology,  data dredging,  information harvesting,  business intelligence, etc.

17 What Is not Data Mining? Watch out: Is everything “ data mining ” ?  Simple search and query processing  (Deductive) expert systems  OLAP based on data warehouse  statistical analysis system  Information system

18 Data mining context Levels of data analysis method hidden shallow surface simple database queries statistical analysis data mining

19 Database Processing vs. Data Mining Processing Query Well defined SQL Query Poorly defined No precise query language Data Data – Operational data Output Output – Precise – Subset of database Data Data – Not operational data Output Output – Fuzzy – Not a subset of database

20 Query Examples (查询实例对比) Database Data Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)

21 Knowledge Discovery (KDD) Process  Data mining — core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

22 Knowledge Discovery (KDD) Process step1 Data cleaning: to remove noise and inconsistent data step2 Data integration: where multiple data sources may be combined step3 Data selection: where data relevant to analysis task are retrieved from the database step4 Data mining: an essential process where intelligent methods are applied in order to extract data patterns,such as classification, clustering step5 Pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interesting measures step6 Knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user

23 Example: A Web Mining Framework Web mining usually involves  Data cleaning  Data integration from multiple sources  Warehousing the data  Data cube construction  Data selection for data mining  Data mining  Presentation of the mining results  Patterns and knowledge to be used or stored into knowledge-base

24 KDD Process: A Typical View from ML and Statistics Input Data Data Mining Data Pre- Processing Post- Processing This is a view from typical machine learning and statistics communities Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis … … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization

25 Example: Medical Data Mining Health care & medical data mining – often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation

26 Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems

27 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary

28 Multi-Dimensional View of Data Mining Data to be mined  Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions)  Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.  Descriptive vs. predictive data mining  Multiple/integrated functions and mining at multiple levels Techniques utilized  Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted  Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

29 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary

30 Data Mining: On What Kinds of Data? In general,data mining should be applicable to any kind of data repository, as well as data streams Database-oriented data sets and applications  Relational database, 关系数据库  data warehouse, 数据仓库  transactional database 事务数据库 Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequence data (incl. bio- sequences)  Multimedia database  Text databases  The World-Wide Web

31 Data sets Data set concerning bridges in USA E13,A,33,CRAFTS,HIGHWAY,?,2,N,THROUGH,WOOD,?,S,WOOD E15,A,28,CRAFTS,RR,?,2,N,THROUGH,WOOD,?,S,WOOD E16,A,25,CRAFTS,HIGHWAY,MEDIUM,2,N,THROUGH,IRON,MEDIUM,S-F,SUSPEN E17,M,4,CRAFTS,RR,MEDIUM,2,N,THROUGH,IRON,MEDIUM,?,SIMPLE-T E18,A,28,CRAFTS,RR,MEDIUM,2,N,THROUGH,IRON,SHORT,S,SIMPLE-T E19,A,29,CRAFTS,HIGHWAY,MEDIUM,2,N,THROUGH,WOOD,MEDIUM,S,WOOD E20,A,32,EMERGING,HIGHWAY,MEDIUM,2,N,THROUGH,WOOD,MEDIUM,S,WOOD E21,M,16,EMERGING,RR,?,2,?,THROUGH,IRON,?,?,SIMPLE-T E23,M,1,EMERGING,HIGHWAY,MEDIUM,?,?,THROUGH,STEEL,LONG,F,SUSPEN E22,A,24,EMERGING,HIGHWAY,MEDIUM,4,G,THROUGH,WOOD,SHORT,S,WOOD E24,O,45,EMERGING,RR,?,2,G,?,STEEL,?,?,SIMPLE-T E25,M,10,EMERGING,RR,?,2,G,?,STEEL,?,?,SIMPLE-T E27,A,39,EMERGING,RR,?,2,G,THROUGH,STEEL,?,F,SIMPLE-T E26,M,12,EMERGING,RR,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,SIMPLE-T E30,A,31,EMERGING,RR,?,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T E29,A,26,EMERGING,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,?,SUSPEN E28,M,3,EMERGING,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,ARCH E32,A,30,EMERGING,HIGHWAY,?,2,G,THROUGH,IRON,MEDIUM,F,SIMPLE-T E31,M,8,EMERGING,RR,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,SIMPLE-T E34,O,41,EMERGING,RR,LONG,2,G,THROUGH,STEEL,LONG,F,SIMPLE-T E33,M,19,EMERGING,HIGHWAY,MEDIUM,?,G,THROUGH,IRON,MEDIUM,F,SIMPLE-T E36,O,45,MATURE,HIGHWAY,?,2,G,THROUGH,IRON,SHORT,F,SIMPLE-T E35,A,27,MATURE,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T E38,M,17,MATURE,HIGHWAY,?,2,G,THROUGH,IRON,MEDIUM,F,SIMPLE-T E37,M,18,MATURE,RR,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,SIMPLE-T E39,A,25,MATURE,HIGHWAY,?,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T E4,A,27,MATURE,AQUEDUCT,MEDIUM,1,N,THROUGH,WOOD,SHORT,S,WOOD E40,M,22,MATURE,HIGHWAY,?,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T E41,M,11,MATURE,HIGHWAY,?,2,G,THROUGH,IRON,MEDIUM,F,SIMPLE-T E42,M,9,MATURE,HIGHWAY,LONG,2,G,THROUGH,STEEL,LONG,F,SIMPLE-T format is simply comma separated values

32 Data sets Data set concerning geotechnical parameters format taken directly from a spreadsheet

33 Example Cappuccino coffee relation missing data attribute-value attribute as number or name note this attribute process of discretization

34 data warehouse Is a data repository A repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making. data warehouse are constructed via a process of data cleaning, data integration, data transform, data loading,and periodic data refreshing. on-line analytical processing (OLAP) is the Major task of data warehouse system

35 DATA STREAMS Streaming data :data flow in and out of an observation platform dynamically Features:  Huge or possibly infinite volume,  Dynamically changing  Flow in and out in a fixed order  Allowing only one or a small number of scans  Demanding fast (often real time ) response time Examples  Network traffic  Stock exchange  web click streams  Weather monitoring

36 Data Mining: On What Kinds of Data? Advanced data sets and advanced applications  Time-series data, temporal data, sequence data (incl. bio- sequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data and spatiotemporal data  Multimedia database  Text databases  The World-Wide Web

37 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary

38 Data Mining Functionalities General functionality  Descriptive data mining (描述性的数据挖掘 ) : characterize the general properties of the data in database Such as: Clustering / similarity matching, Association rules, Deviation detection  Predictive data mining (预测性的数据挖掘) : perform inference on the current data in order to make predictions Such as: Classification, Regression

39 Data Mining Models and Tasks

40 concept description: Characterization and discrimination  Data Characterization is a summarization of the general characteristics or features of a target class of data. Eg. description the characteristics of customers who spend over 10000$ last year,  Use SQL query to get the dada;  statistical measures; OLAP;  The output can be : pie charts, bar charts,curves, multidimensional data cubes, multidimensional tables  Data discrimination is a comparison of the general features of target class data objects with one or a set of contrasting classes

41 Frequent patterns, association, correlation vs. causality  Frequent patterns are patterns that occurred frequently in data.  many kind of Frequent patterns, such as frequent item sets, subsequences, substructures  Mining frequent patterns leads to the discovery of interesting associations and correlations within data. Association analysis  Diaper  Beer [20%, 75%] (Correlation or causality?) 20% support means that 20% of all the transactions under analysis showed that diaper and beer are purchased together. 75%confidence or certainty means that if a customer buys diaper, there is a 75% chance that she will buy beer

42 Data Mining Functionalities Classification and prediction  Construct models (functions) that describe and distinguish classes or concepts for future prediction A model may be presented in various forms: rules, decision tree, mathematical formulae, neural networks A model is derived based on the analysis of training data  Predict some unknown or missing numerical values

43 Classification Given old data about customers and payments, predict new applicant ’ s loan eligibility. Age Salary Profession Location Customer type Previous customers ClassifierDecision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad

44 Classification methods  Regression: (linear or any other polynomial)  Nearest neighour  Decision tree classifier:  Bayesian learning  Neural networks:

45 Cluster analysis  Class label is unknown: Group data to form new classes  Maximizing intra-class similarity & minimizing interclass similarity, e.g objects within a cluster have high similarity, but are very dissimilarity to objects in other clusters. Key requirement: Need a good measure of similarity between instances. Application  Customer segmentation e.g. for targeted marketing  Collaborative filtering:  group based on common items purchased  Text clustering  Compression

46 Cluster Analysis

47 Outlier analysis  Outlier: Data object that does not comply with the general behavior of the data  Noise or exception? One person’s garbage could be another person’s treasure  Methods: statistical tests; clustering or regression analysis Application  fraud detection  rare events analysis

48 Trend and evolution analysis evolution analysis  Describes and models regularities or trends for objects whose behavior changes over time.  Time-series data analysis  Sequence analysis-sequential pattern mining e.g., first buy digital camera, then buy large SD memory cards  Periodicity analysis  Similarity-based analysis

49 Ex: Time Series Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior

50 Structure and Network Analysis Graph mining  Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) Information network analysis  Social networks: actors (objects, nodes) and relationships (edges) e.g., author networks in CS, terrorist networks  Multiple heterogeneous networks A person could be multiple information networks: friends, family, classmates, …  Links carry a lot of semantic information: Link mining Web mining  Web is a big information network: from PageRank to Google  Analysis of Web information networks Web community discovery, opinion mining, usage mining, …

51 Data Mining Models and Tasks

52 Are all the patterns interesting? A data mining system may generate thousands of patterns, are all of them interesting? Can a data mining system generate all of the interesting pattern? Can a data mining system generate only the interesting pattern?

53 Are all the patterns interesting? Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm

54 Are all the patterns interesting? Objective vs. subjective interestingness measures:  Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.  Subjective: based on user ’ s belief in the data, e.g., unexpectedness, novelty, etc.

55 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary

56 Data Mining: Confluence of Multiple Disciplines Data Mining Machine Learning Statistics Applications Algorithm Pattern Recognition High-Performance Computing Visualization Database Technology

57 Why Confluence of Multiple Disciplines? Tremendous amount of data  Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data  Micro-array may have tens of thousands of dimensions High complexity of data  Data streams and sensor data  Time-series data, temporal data, sequence data  Structure data, graphs, social networks and multi-linked data  Heterogeneous databases and legacy databases  Spatial, spatiotemporal, multimedia, text and Web data  Software programs, scientific simulations New and sophisticated applications

58 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary

59 Applications of Data Mining Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQL- Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining

60 Data Mining Application Areas IndustryApplication FinanceCredit Card Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall record analysis TransportLogistics management Consumer goodspromotion analysis Data Service providersValue added data UtilitiesPower usage analysis

61 Data Mining Application in daily life Ubiquitous data mining  Shopping (Wal-mart,7-11,etc. market basket analysis )  On-line shopping -collaborative recommender system  Credit card –detect fraudulent usage  Email - spam filter  Internet- ads

62 Data Mining works with Warehouse Data Data Warehousing provides the Enterprise with a memory ÑData Mining provides the Enterprise with intelligence

63 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary

64 2016年7月7日星期四 2016年7月7日星期四 2016年7月7日星期四 Data Mining: Concepts and Techniques 64 Major Issues in Data Mining (1) Mining Methodology  Mining various and new kinds of knowledge  Mining knowledge in multi-dimensional space  Data mining: An interdisciplinary effort  Boosting the power of discovery in a networked environment  Handling noise, uncertainty, and incompleteness of data  Pattern evaluation and pattern- or constraint-guided mining User Interaction  Interactive mining  Incorporation of background knowledge  Presentation and visualization of data mining results

65 2016年7月7日星期四 2016年7月7日星期四 2016年7月7日星期四 Data Mining: Concepts and Techniques 65 Major Issues in Data Mining (2) Efficiency and Scalability  Efficiency and scalability of data mining algorithms  Parallel, distributed, stream, and incremental mining methods Diversity of data types  Handling complex types of data  Mining dynamic, networked, and global data repositories Data mining and society  Social impacts of data mining  Privacy-preserving data mining  Invisible data mining

66 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Classification of data mining systems Major issues in data mining A Brief History of Data Mining and Data Mining Society Summary

67 Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining


Download ppt "Data Mining: Concepts and Techniques www.ePowerPoint.com."

Similar presentations


Ads by Google