Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Similar presentations


Presentation on theme: "Data Mining Theory and Practice Dr. Azuraliza Abu Bakar"— Presentation transcript:

1 Data Mining Theory and Practice Dr. Azuraliza Abu Bakar http://www.ftsm.ukm.my/jabatan/ts/aab/index.htm

2 What is Pattern Recognition n Pattern Recognition by Human – perceptual – specialized – decision making n Pattern Recognition by Computers – benefit of automated pattern recognition – advantage in complex calculations n Pattern Recognition from Data (Data Mining)

3 Pattern Recognition from Data n Pattern recognition from data is a process of learning or observing the past data by studying the dependencies and extracting knowledge from data

4 What is Data? Studies Education Works Income (D) 1Poor SPMPoorNone 2Poor SPMGoodLow 3Moderate SPMPoorLow 4Moderate Diploma Poor Low 5Poor SPMPoorNone 6Moderate Diploma PoorLow 7Good MSCGoodMedium : 99Poor SPMGoodLow 100Moderate DiplomaPoorLow

5 What is Knowledge?? studies(Poor) AND work(Poor) => income(None) studies(Poor) AND work(Good) => income(Low) education(Diploma) => income(Low) education(MSc) => income(Medium) OR income(High) studies(Mod) => income(Low) studies(Good) => income(Medium) OR income(High) education(SPM) AND work(Good) => income(Low)

6 What is Data Mining?? Extraction of knowledge from data exploration and analysis of large quantities of data to discover meaningful pattern from data. Discover Knowledge

7 How data mining looks into data?? Data

8 Data Mining : Motivation Huge amounts of data Important need for turning data into useful information Fast growing amount of data, collected and stored in large and numerous databases exceeded the human ability for comprehension without powerful tools

9 Questions?? What goods should be promoted to this customer? What is the probability that a certain customer will respond to a planned promotion? Can one predict the most profitable securities to buy/sell during the next trading session? Will this customer default on a loan or pay back on schedule? What medical diagnose should be assigned to this patient? What kind of cars should be sell this year??

10 Data Mining is simply... Finds relationship make prediction

11 Data Mining : 1-step of KDD Task KDD Data mining Techniques

12 Data Mining as a Step of KDD Patterns Data Warehouse Databases Flat files Selection and Transformation Data Mining Evaluation & Presentation Cleaning and Intergration Knowledge

13 Early Steps of Data Mining n Data preprocessing – handling incomplete data, noisy data, uncertain data n Data discretization/representation – transforms data into suitable values for the mining algorithm to find patterns n Data selection – selects the suitable data for mining purposes

14 Data Mining Techniques Decision Trees Neural Network Genetic Algorithms Fuzzy Set Theory Rough Set Theory Statistical Method (Regression Analysis)

15 Kinds of DB Relational Data warehouse Transactional DB Advanced DB system Flat files WWW Kinds of Knowledge Classification Association Clustering Prediction … Classification of Data Mining Systems

16 Techniques used DB oriented techniques Statistic Machine learning Pattern recognition Neural Network Rough Set etc Application adapted Finance Marketing Medical Stock Telecommunication, etc

17 Data Mining: confluence of multiple discipline DATA MINING Database technology statistic Machine learning Information science Neural network Pattern recognition visualization Information retrieval HPerformance computing Spatial data analysis

18 Data Mining What we are looking at?? What we are looking for??

19 Data Mining Tasks – Prediction – Classification – Clustering – Association Rules – Sequential Analysis – Deviation analysis – Similarity analysis – Trend analysis

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38 Classification algorithm Training data Studies Education Works Income (D) 1PoorSPMPoorNone 2PoorSPMGoodLow 3ModerateSPMPoorLow 4Moderate Diploma Poor Low 5PoorSPMPoorNone 6Moderate Diploma PoorLow 7GoodMSCGoodMedium : 99PoorSPMGoodLow 100Moderate DiplomaPoorLow Classification Rules If studies=“poor” and work=“poor” then Income=“poor”

39 Classification Test data Studies Education Works Income (D) Moderate Diploma Poor ? PoorSPMPoor? Moderate Diploma Poor? GoodMSCGood? : New data studies=“poor” and work=“poor” Classification rules poor classify

40 Type of Classifiers Statistical Classifier –Bayesion approach –Multiple Regression –K-nearest neighbour –Naïve Bayes –Causal Network –Discriminant Analysis Neural Classifier –Hopfield Network –Multilayer Perceptron –Radial Basis Function –Kohonen Networks Rough Classifier

41 DATASET Studies Education Works Income (D) 1Poor SPMPoorNone 2Poor SPMGoodLow 3Moderate SPMPoorLow 4Moderate Diploma Poor Low 5Poor SPMPoorNone 6Moderate Diploma PoorLow 7Good MSCGoodMedium : 99Poor SPMGoodLow 100Moderate DiplomaPoorLow

42 RULES studies(Poor) AND work(Poor) => income(None) studies(Poor) AND work(Good) => income(Low) education(Diploma) => income(Low) education(MSc) => income(Medium) OR income(High) studies(Mod) => income(Low) studies(Good) => income(Medium) OR income(High) education(SPM) AND work(Good) => income(Low)

43 Comparing Classifiers n Predictive Accuracy n Speed n Robustness n Scalability n Interpretability

44 Data Mining : Problems and Challenges Noisy data Difficult Training Set Dynamic Databases Large Databases Incomplete Data

45 Performance Issues Cost of the Learning Set Time and Memory Constrain t Predictive Ability

46 Performance Issues Cost of the Learning Set -number of examples necessary for training -cost of assuring the good accuracy

47 Performance Issues Time and Memory Constrain t -time complexity of the learning phase -time taken for evaluation -time it takes to reach a certain level of accuracy

48 Performance Issues Predictive Ability -to be able to predict the correct decision towards the test or unseen data -involve the generation of rules -measuring the quality or accuracy of rules

49 DA TA AG E SEXCPTRE ST BPS CH OL FBSFBS RESTECGTHALA CH EXA NG OL DP EA K SLOPECACA THALDISEA SE 163MaleTypical angina 145233TLV hyper150No2.3Downslo pe 0FixedNo 267MaleAsymp160286FLV hyper108Yes1.5Flat3NormalYes 367MaleAsymp120229FLV hyper129Yes2.6Flat2Reversabl e Yes 437MaleNon-anginal130250FNormal187No3.5Downslo pe 0NormalNo 541Fema le Atypical130204FLV hyper172No1.4Upslopin g 0NormalNo 656MaleAtypical120236FNormal178No0.8Upslopin g 0NormalNo 762Fema le Asymp140268FLV hyper160No3.6Downslo pe 2NormalYes 857Fema le Asymp120354FNormal163Yes0.6Upslopin g 0NormalNo 963MaleAsymp130254FLV hyper147No1.4Flat1Reversabl e Yes 1053MaleAsymp140203TLV hyper155Yes3.1Downslo pe 0Reversabl e Yes 1157MaleAsymp140192FNormal148No0.4Flat0Fixed defect No 1256Fema le Atypical140294FLV hyper153No1.3Flat0NormalNo 1356MaleNon-anginal130256TLV hyper142Yes0.6Flat1Fixed defect Yes 1444MaleAtypical120263FNormal173No0Upslopin g 0Reversabl e No 1552MaleNon-anginal172199TNormal162No0.5Upslopin g 0Reversabl e No 1657MaleNon-anginal150168FNormal174No1.6Upslopin g 0NormalNo 1748MaleAtypical110229FNormal168No1Downslo pe 0Reversabl e Yes 1854MaleAsymp140239FNormal160No1.2Upslopin g 0NormalNo 1948Fema le Non-anginal130275FNormal139No0.2Upslopin g 0NormalNo 2049MaleAtypical130266FNormal171No0.6Upslopin g 0NormalNo Samples of the CLEV Dataset (before scaling)

50 oldpeak(0.7) => disease(No) oldpeak(4.4) => disease(Yes) chol(233) AND restecg(LV hypertrophy) => disease(No) chol(204) AND restecg(LV hypertrophy) => disease(No) chol(236) AND restecg(Normal) => disease(No) chol(203) AND restecg(LV hypertrophy) => disease(Yes) chol(294) AND restecg(LV hypertrophy) => disease(No) chol(275) AND restecg(Normal) => disease(No) chol(266) AND restecg(Normal) => disease(No) chol(247) AND restecg(Normal) => disease(No) chol(219) AND restecg(LV hypertrophy) => disease(No) chol(266) AND restecg(LV hypertrophy) => disease(Yes) chol(304) AND restecg(Normal) => disease(No) chol(254) AND restecg(Normal) => disease(Yes) chol(267) AND restecg(Normal) => disease(Yes) chol(264) AND restecg(LV hypertrophy) => disease(No) chol(234) AND restecg(LV hypertrophy) => disease(No) Rules generated from data mining process


Download ppt "Data Mining Theory and Practice Dr. Azuraliza Abu Bakar"

Similar presentations


Ads by Google