Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr. Abdul Aziz Associate Dean Faculty of Computer Sciences Riphah International University Islamabad, Pakistan Dr. Nazir A. Zafar.

Similar presentations


Presentation on theme: "Dr. Abdul Aziz Associate Dean Faculty of Computer Sciences Riphah International University Islamabad, Pakistan Dr. Nazir A. Zafar."— Presentation transcript:

1 Dr. Abdul Aziz Associate Dean Faculty of Computer Sciences Riphah International University Islamabad, Pakistan aziz_gill@hotmail.com Dr. Nazir A. Zafar Department of Computer & Information Sciences Pakistan Institute of Engineering & Applied Sciences Nilore, Islamabad, Pakistan nazafar@pieas.edu.pk

2 2 Reduction in Over-Fitting for Classification without Compromising on Accuracy and Effectiveness

3 3 Machine Learning Machine learning covers following main types of learning: Classification learning: Learn to put instances into pre-defined classes based on other attributes Association learning: Learn relationships between the attributes Clustering: Discover classes of instances that belong together Regression: Learn to predict a numeric quantity instead of a class

4

5 5 Roots of Classification 1.Classification draws on the concepts of three major paradigms: Database technology Statistics Machines 2.Domain knowledge, i.e. the expertise of the end-user.

6

7 7 Knowledge Discovery in Databases 1.KDD process typically generates a model using past records with known target classes (outputs) and these models are used to predict outputs of future records (new cases). 2.Applications include fraud detection, marketing, investment analysis, insurance.

8 8 Marketing example The goal is to predict whether a customer will buy a product given gender, country and age. Freitas and Lavington (1998) Data Mining, CEC99.

9 9 no yes country? age? Germany England France <= 25> 25 This is the decision tree induced by the Marketing example data. The first branch is called the root of the tree.

10 10 Tree induction 1.The tree is built by selecting one attribute at a time - the one that ‘best’ separates the classes. 2.The set of examples is then partitioned according to value of selected attributes. 3.This is repeated at each branch node until segmentation is complete.

11 11 no yes country? age? GermanyEngland France <= 25> 25 (4Y 6N) (0Y 3N)(2Y 0N) (2Y 3N) (2Y 0N)(0Y 3N) Notice that in this simple example the leaf nodes contain records of one class only. The number of yes and no examples is conserved as you move up and down the tree.

12 12 Rule derivation no yes country? age? GermanyEngland France <= 25> 25 (4Y 6N) (0Y 3N)(2Y 0N) (2Y 3N) (2Y 0N)(0Y 3N) If (country = Germany) then (Buy? = No) What are the other rules? Rules can be extracted directly from induction trees.

13 13 Heart Disease Dataset

14 14 What is needed? 1.With databases of enormous size, the user needs help to analyse the data more effectively than just simply querying and reporting. 2.Semi-automatic methods to extract useful, unknown (higher-level) information in a concise format will help the user make more sense of their data.

15 15 The KDD roadmap 1.KDD may be divided into the following stages: 2.Note the iterative nature of the process.

16 16 Expertise required 1.Any organisation that undertakes a project in KDD will require much expert input to ensure that the results produced are of high quality, valid, interesting/useful/novel/surprising, and comprehensible by the human user. 2.“If patient is pregnant then gender is female” is very accurate, but is neither useful nor surprising.

17 17

18 18

19 19 S.No.Data SetSRSSDSER % 1Heart Disease(Cleveland)77.3782.1220.99 2Credit-A84.5284.9302.65 3Diabetes (PIMA)72.4573.9605.48 4Liver disorder (BUPA)64.8665.9002.96 5Breast cancer Wisconsin94.6394.864.28 6Hepatitis78.2088.3146.38 7Ionosphere89.0990.9116.68 8Boston housing82.5383.7907.21 9Credit (German)71.6472.2001.97 10Iris91.9398.6783.52 11Sonar73.4670.19-12.32 Over all average80.0682.3516.35 Relative error reduction SRS: Simple Random Sampling SDS: Systematic Distribution Sampling

20 20 Comparison Data Set SRSSIS AccuracyOver fittingAccuracyOver fitting Heart-C77.3706.14 79.5803.11 Credit-A84.5205.25 87.0802.93 Diabetes72.4507.83 72.6603.58 Liver64.8610.47 67.1704.79 Cancer94.6302.58 94.6201.26 Hepatitis78.2006.16 83.1203.07 Ionosphere89.0904.94 89.4102.19 Housing82.5303.68 83.0002.53 Credit-G71.6408.23 72.5603.97 Iris91.9302.91 96.0001.18 Sonar73.4608.26 76.7303.95 Average80.0606.04 81.992.96 SRS: Simple Random Sampling SIS:Stratified Induction Sampling

21 21 Conclusion In this study, we have shown that the original data sets partitioned into training and test data sets by using stratified induction approach reduces over fitting significantly without compromising on accuracy factor.

22 22 Supporting Texts Data Warehousing, Data Mining and OLAP, Alex Berson & Stephen Smith, McGraw-Hill (1997), ISBN 0-07-006272-2 Predictive Data Mining, Sholom Weiss & Nitin Indurkhya, Morgan Kauffmann (1998), ISBN 1-55860- 403-0 Data Mining, Ian Witten & Eibe Frank, Morgan Kaufmann (1999), ISBN 1-55860-552-5

23 23 Useful urls 1.University of East Anglia School of Computing Sciences, UK http://www.cmp.uea.ac.uk/research/groups/mag/kdd/ 2.UCI ML repository, USA http://www.ics.uci.edu/~mlearn/MLRepository.html 3.KD Nuggets, USA http://www.kdnuggets.com/

24 24 Questions and Answers Discussion

25 25 THANK YOU


Download ppt "Dr. Abdul Aziz Associate Dean Faculty of Computer Sciences Riphah International University Islamabad, Pakistan Dr. Nazir A. Zafar."

Similar presentations


Ads by Google