Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dept. of Computer Science University of Liverpool

Similar presentations


Presentation on theme: "Dept. of Computer Science University of Liverpool"— Presentation transcript:

1 Dept. of Computer Science University of Liverpool
COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 This is the full course notes, but not quite complete. You should come to the lectures anyway. Really. Introduction to the Course February 05, Slide 1

2 COMP527: Data Mining COMP527: Data Mining Introduction to the Course
Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Classification: Challenges, Basics February 05, Slide 2

3 Today's Topics Classification Basic Algorithms: KNN Perceptron Winnow
COMP527: Data Mining Classification Basic Algorithms: KNN Perceptron Winnow Classification: Challenges, Basics February 05, Slide 3

4 Training: Learn the classification model from labeled data
COMP527: Data Mining Main Idea: Learn the concept of what it means to be part of a named class of instances. Called Supervised Learning, as it learns by example from data which is already classified correctly. Often called the Class Label attribute, hence learns from Labeled data. Two main phases: Training: Learn the classification model from labeled data Prediction: Use the pre-built model to classify new instances Classification: Challenges, Basics February 05, Slide 4

5 Classification Accuracy
COMP527: Data Mining We need to use previously unseen instances to test a classifier. Over-fitting is the main problem. Classifiers will often learn too specific a model, and testing on data that was used in training would reinforce this problem. Need to split the data set into training and testing. Revised Phases for accuracy estimation: Split data set into distinct Training and Testing sets Build classifier with Training set Assess accuracy with Testing set – normally expressed as % Classification: Challenges, Basics February 05, Slide 5

6 Percent of instances classified correctly. Speed
Comparing Methods COMP527: Data Mining Accuracy Percent of instances classified correctly. Speed Computational cost of both learning model and predicting classes Robustness Ability to cope with noisy or missing data Scalability Ability to cope with very large amounts of data Interpretability Is the model understandable to a human, or otherwise useful? Classification: Challenges, Basics February 05, Slide 6

7 Classification vs Prediction
COMP527: Data Mining Classification predicts a class label from a given finite set. The label is a nominal attribute, so unordered and enumerable Some algorithms predict probability for more than one label Sometimes called a categorical attribute Prediction predicts a number instead of a label. Ordered and infinite set of possible outcomes Also often called Regression or Numeric Prediction Often viewed as a function Classification: Challenges, Basics February 05, Slide 7

8 Builds a model likely to be very different in structure to the data.
Eager vs Lazy Learners COMP527: Data Mining Eager Learner: Constructs the model when it receives the training data. Builds a model likely to be very different in structure to the data. Lazy Learner: Doesn't construct a model when training, only when classifying new instances. Does only enough work to ensure that data can be compared later Sometimes called instance-based learners Most classifiers are Eager, but there's an important Lazy classifier called 'KNN' – K Nearest Neighbour Classification: Challenges, Basics February 05, Slide 8

9 Which group, left or right, for these two flowers?
But First... COMP527: Data Mining Which group, left or right, for these two flowers? (Experiment reported on in Cognitive Science, 2002)‏ Classification: Challenges, Basics February 05, Slide 9

10 A combination of rote memorisation and the notion of 'resembles'.
Resemblance COMP527: Data Mining People classify things by finding other items that are similar which have already been classified. For example: Is a new species a bird? Does it have the same attributes as lots of other birds? If so, then it's probably a bird too. A combination of rote memorisation and the notion of 'resembles'. Although kiwis can't fly like most other birds, they resemble birds more than they resemble other types of animals. So the problem is to find which instances most closely resemble the instance to be classified. Classification: Challenges, Basics February 05, Slide 10

11 KNN: Distance Measures
COMP527: Data Mining Distance (or similarity) between instances is easy if the data is numeric. Typically use Euclidian distance: d = √(x1i-x1j)2 + (x2i-x2j) Also Manhattan / City Block distance: d = (x1i-x1j) + (x2i-x2j) + ... However we should normalise all of the values to the same scale first. Otherwise income will overpower age, for example. Classification: Challenges, Basics February 05, Slide 11

12 KNN: Non Numeric Distance
COMP527: Data Mining For nominal attributes, we can only compare whether the value is the same or not. Equally, this can be done by dividing enumerations into many boolean attributes. Might be able to convert to attributes between which distance can be determined by some function. Eg colour, temperature. Text can be treated as one attribute per word, with the frequency as the value, normalised to 0..1, and preferably with very high frequency words ignored (eg the, a, as, is...)‏ Classification: Challenges, Basics February 05, Slide 12

13 Classification process is then straight forward:
KNN: Classification COMP527: Data Mining Classification process is then straight forward: Find the k closest instances to the test instance Predict the most common class among those instances Or predict the mean, for numeric prediction What value to use for k? Depends on dataset size. Large dbs need a higher k, whereas a high k for a small dataset might cross out of the class boundaries Calculate accuracy on test set for increasing value of k, and use a hill climbing algorithm to find the best. Typically use an odd number to help avoid ties Classification: Challenges, Basics February 05, Slide 13

14 KNN: Classification COMP527: Data Mining 5-NN – find 5 closest to black point, 3 blue and 2 red, so predict blue Classification: Challenges, Basics February 05, Slide 14

15 Classification can be very slow to find the k nearest instances.
KNN: Classification COMP527: Data Mining Classification can be very slow to find the k nearest instances. In a trivial implementation, it could take |D| comparisons. Using indexing it can easily be improved. Also easy to parallelise as one comparison is completely distinct to other comparisons. Can remove instances from the data set that do not help, for example a tight cluster of 1000 instances of the same class is unnecessary for k<50 Can also use advanced data structures to improve the speed of classification, by storing the instance information appropriately. Classification: Challenges, Basics February 05, Slide 15

16 KNN: kD-Trees COMP527: Data Mining KD-Tree is a binary tree that divides the input space with a plane, then splits each such partition recursively. Each split is made parallel to an axis and through an instance. Typical strategy is to find the point closest to the mean in the current partition and split through it, along a different axis to the previous split. (Actually on the axis with the greatest variance)‏ Then to search, descend the tree to the leaf partition that contains the test instance. Search only that partition, then if an edge is closer than any of the k closest instances, search the parent partitions as well. Classification: Challenges, Basics February 05, Slide 16

17 KNN: kD-Trees First split at instance (7,4) then again at
COMP527: Data Mining First split at instance (7,4) then again at (6,7) which divides the search space into more easily searchable sections. Classification: Challenges, Basics February 05, Slide 17

18 KNN: kD-Trees Then to classify the star, descend
COMP527: Data Mining Then to classify the star, descend into the section with both star and the black instance. But note that the instance in the other section is closer, so we still must check the adjacent area. Note that the shaded area is the black node's sibling and hence cannot contain closer points. Classification: Challenges, Basics February 05, Slide 18

19 Two very simple eager methods: Perceptron and Winnow.
COMP527: Data Mining Two very simple eager methods: Perceptron and Winnow. The both use the idea of a single neuron that fires when given the right stimuli. (We'll look at this idea again later under Neural Networks)‏ First thing to keep in mind is that the input to the perceptron must be a vector of numbers. Secondly, that it can only answer a 2 class problem – either the neuron fires (class 1) or it doesn't (class 2). Classification: Challenges, Basics February 05, Slide 19

20 Perceptron and Winnow COMP527: Data Mining The square boxes are inputs, the w lines are weights and the circle is the perceptron. The learning problem is to find the correct weights to apply to the attributes. The bias is a fixed value (1)‏ that is then learnt in the same way as the other attributes, in order to ensure that the result perceptron can check if the result is > 0 or not to see if it should fire. Classification: Challenges, Basics February 05, Slide 20

21 We can then multiply weight by value, and add them all up...
Perceptron COMP527: Data Mining For each attribute, we have an input node. Then there is one output node to which all of them connect, with a weight on each connection. We can then multiply weight by value, and add them all up... w0a0 + w1a wnan Make it an equation equal to 0 and it's the equation for a hyperplane. So essentially we are learning the hyperplane that separates the two classes. Then classification is just checking which side of the plane the instance falls on. But how do we learn the weights? Classification: Challenges, Basics February 05, Slide 21

22 No complicated higher math here!
Perceptron COMP527: Data Mining Remember that instances are a set of numeric attributes (a vector). We can also treat the weights on the connections as a vector. We only want to classify between two classes. So: weightVector = [0,...0] while classificationFailed, for each training instance I, if not classify(I) == I.class, if I.class == class1: weightVector += I else: weightVector -= I No complicated higher math here! Classification: Challenges, Basics February 05, Slide 22

23 Winnow COMP527: Data Mining Winnow only updates when it finds a misclassified instance, and uses multiplication to do the update rather than addition. It only works when the attribute values are also binary. (1 or 0)‏ delta = (user defined)‏ while classificationFailed, for each instance I, if classify(I) != I.class, if I.class == class1, for each attribute ai in I, if ai == 1, wi *= delta else, if ai == 1, wi /= delta Classification: Challenges, Basics February 05, Slide 23

24 Further Reading Witten, Section 3.8, and pp 124-136
COMP527: Data Mining Witten, Section 3.8, and pp Han, Sections 6.1, 6.9 Dunham Sections Berry and Linoff, Chapter 8 Berry and Browne, Chapter 6 Devijver and Kittler, Pattern Recognition: A Statistical Approach, Chapter 3 For KNN and Perceptron, Wikipedia, as always :)‏ Classification: Challenges, Basics February 05, Slide 24


Download ppt "Dept. of Computer Science University of Liverpool"

Similar presentations


Ads by Google