Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.

Similar presentations


Presentation on theme: "CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification."— Presentation transcript:

1 CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification

2 What is Classification? Problem: assign items to pre-defined classes –Sample Y = Y 1 … Y n –Set of classes X –Given Y, choose C that contains Y How do we know how to do this? –Training data: set of items for which proper X i is known.

3 Issues Classification accuracy –False positives, False negatives –No clear “best” metric Computation cost –Training –Classification

4 Approaches: Naïve Bayes K-Nearest Neighbor Decision rules/Decision trees Neural Networks

5 Naïve Bayes: History Bayes classifier: From Probability Theory Idea: A-posteriori probability of class given all inputs is best possible classifier Problem: doesn’t generalize. Solution: Bayesian Belief network Y1Y1 Y2Y2 Y3Y3 Y4Y4 P(X i |Y) = P(Y 4 |Y 2,Y 3 )P(Y 2 |Y 1 )P(Y 3 |Y 1 )P(Y 1 )

6 Problems with Bayesian Belief Network What should the network structure be? –Some work in how to learn the structure –Getting it wrong results in over-specificity What are the probabilities? –Learning techniques exist here Computational cost to learn network

7 Naïve Bayes Two-layer Bayes network –No need to learn structure –Assumes inputs independent Learn the probabilities that work best on training data Y1Y1 Y2Y2 Y3Y3 P(X|Y 1...Y n ) = P(X)*Π i P(Y i |X) X

8 K-Nearest Neighbor Idea: Choose “closest” training item –Class of test is same as class of closest training item –Need to define distance What if this is a bad match? –Find K closest items –Use most common class in those K

9 KNN: Advantages As training set → ∞, K → ∞, result approaches optimal –View as “best probability over all samples”: this is Bayes theorem Training simple –Just put training set into a data structure

10 KNN: Problems With small K, only captures convex classes High dimensionality: may be “nearest” in irrelevant attributes Query time: Search all training data –Algorithms to make this faster But good enough to be “standard” for comparison

11 Classification and Security Ideas on how to use classifiers to improve security –Intrusion Detection –? Potential risks –Identifying private information based on similarity with training data


Download ppt "CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification."

Similar presentations


Ads by Google