Overview
Apriori Algorithm Support is 50% (2/4) Confidence is 66.67% (2/3) TX1Shoes,Socks,Tie TX2Shoes,Socks,Tie,Belt,Shirt TX3Shoes,Tie TX4Shoes,Socks,Belt
Example Five transactions from a supermarket (diaper=fralda) TIDList of Items 1Beer,Diaper,Baby Powder,Bread,Umbrella 2Diaper,Baby Powder 3Beer,Diaper,Milk 4Diaper,Beer,Detergent 5Beer,Milk,Coca-Cola
Step 1 Min_sup 40% (2/5) C1 L1 ItemSupport Beer"4/5" Diaper"4/5" Baby Powder"2/5" Bread"1/5" Umbrella"1/5" Milk"2/5" Detergent"1/5" Coca-Cola"1/5" Item Support Beer"4/5" Diaper"4/5" Baby Powder"2/5" Milk"2/5"
Step 2 and Step 3 C2 L2 ItemSupport Beer, Diaper"3/5" Beer, Baby Powder"1/5" Beer, Milk"2/5" Diaper,Baby Powder"2/5" Diaper,Milk"1/5" Baby Powder,Milk"0" ItemSupport Beer, Diaper"3/5" Beer, Milk"2/5" Diaper,Baby Powder"2/5"
Step 4 C3 empty Min_sup 40% (2/5) ItemSupport Beer, Diaper,Baby Powder"1/5" Beer, Diaper,Milk"1/5" Beer, Milk,Baby Powder"0" Diaper,Baby Powder,Milk"0"
Step 5 min_sup=40% min_conf=70% ItemSupport(A,B)Suport AConfidence Beer, Diaper60%80%75% Beer, Milk40%80%50% Diaper,Baby Powder40%80%50% Diaper,Beer60%80%75% Milk,Beer40% 100% Baby Powder, Diaper40% 100%
Results support 60%, confidence 70% support 40%, confidence 100% support 40%, confidence 70%
Construct FP-tree from a Transaction Database {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 min_support = 3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Sort frequent items in frequency descending order, f-list 3.Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
Find Patterns Having p From p-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3
From Conditional Pattern-bases to Conditional FP-trees For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam -> associations {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3
The Data Warehouse Toolkit, Ralph Kimball, Margy Ross, 2nd ed, 2002
k-means Clustering Cluster centers c 1,c 2,.,c k with clusters C 1,C 2,.,C k
Error The error function has a local minima if,
k-means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged!
Algorithm Random initialization of k cluster centers do { -assign to each x i in the dataset the nearest cluster center (centroid) c j according to d 2 -compute all new cluster centers } until ( |E new - E old | < or number of iterations max_iterations)
k-Means vs Mixture of Gaussians Both are iterative algorithms to assign points to clusters K-Means: minimize MixGaussian: maximize P(x|C=i) Mixture of Gaussian is the more general formulation Equivalent to k-Means when ∑ i =I,
Tree Clustering Tree clustering algorithm allow us to reveal the internal similarities of a given pattern set To structure these similarities hierarchically Applied to a small set of typical patterns For n patterns these algorithm generates a sequence of 1 to n clusters
Example Similarity between two clusters is assessed by measuring the similarity of the furthest pair of patterns (each one from the distinct cluster) This is the so-called complete linkage rule
Impact of cluster distance measures “Single-Link” (inter-cluster distance= distance between closest pair of points) “Complete-Link” (inter-cluster distance= distance between farthest pair of points)
There are two criteria proposed for clustering evaluation and selection of an optimal clustering scheme (Berry and Linoff, 1996) Compactness, the members of each cluster should be as close to each other as possible. A common measure of compactness is the variance, which should be minimized Separation, the clusters themselves should be widely spaced
Dunn index
The Davies-Bouldin (DB) index (1979)
Pattern Classification (2nd ed.), Richard O. Duda,, Peter E. Hart, and David G. Stork, Wiley Interscience, 2001Richard O. DudaPeter E. HartDavid G. Stork Pattern Recognition: Concepts, Methods and Applications, Joaquim P. Marques de Sa, Springer-Verlag, 2001
3-Nearest Neighbors query point q f 3 nearest neighbors 2x,1o
Machine Learning, Tom M. Mitchell, McGraw Hill, 1997
Bayes Naive Bayes
Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result (+) in only 98% of the cases in which the disease is actually present, and a correct negative result (-) in only 97% of the cases in which the disease is not present Furthermore, of the entire population have this cancer
Suppose a positive result (+) is returned...
Normalization The result of Bayesian inference depends strongly on the prior probabilities, which must be available in order to apply the method
Belief Networks Burglary P(B) Earthquake P(E) Alarm Burg. Earth.P(A) tt.95 tf.94 ft.29 f f.001 JohnCallsMaryCalls A P(J) t.90 f.05 A P(M) t.7 f.01
Full Joint Distribution
P(Burglary|JohnCalls=ture,MarryCalls=true) The hidden variables of the query are Earthquake and Alarm For Burglary=true in the Bayesain network
P(b) is constant and can be moved out, P(e) term can be moved outside summation a JohnCalls=true and MarryCalls=true, the probability that the burglary has occured is aboud 28%
Computation for Burglary=true
Artificial Intelligence - A Modern Approach, Second Edition, S. Russel and P. Norvig, Prentice Hall, 2003
ID3 - Tree learning
The credit history loan table has following information p(risk is high)=6/14 p(risk is moderate)=3/14 p(risk is low)=5/14
In the credit history loan table we make income the property tested at the root This makes the division into C 1 ={1,4,7,11},C 2 ={2,3,12,14},C 3 ={5,6,8,9,10,13}
gain(income)=I(credit_table)-E(income) gain(income)= gain(income)=0.967 bits gain(credit history)=0.266 gain(debt)=0.581 gain(collateral)=0.756
Overfitting Consider error of hypothesis h over Training data: error train (h) Entire distribution D of data: error D (h) Hypothesis h H overfits training data if there is an alternative hypothesis h’ H such that error train (h) < error train (h’) and error D (h) > error D (h’)
An ID3 tree consistent with the data Hair Color Lotion Used Sarah Annie Dana Katie Emily Alex Pete John Blond Red Brown No Yes Sunburned Not Sunburned Sunburned Not Sunburned
Corresponding rules by C4.5 If the person‘s hair is blonde and the person uses lotion then nothing happens If the person‘s hair color is blonde and the person uses no lotion then the person turns red If the person‘s hair color is red then the person turns red If the person‘s hair color is brown then nothing happens
Default rule If the person uses lotion then nothing happens If the person‘s hair color is brown then nothing happens If no other rule applies then the person turns red
Artificial Intelligence, Partick Henry Winston, Addison-Wesley, 1992 Artificial Intelligence - Structures and Strategies for Complex Problem Solving, Second Edition, G. L. Luger and W. A. Stubblefield, Benjamin/Cummings Publishing, 1993 Machine Learning, Tom M. Mitchell, McGraw Hill, 1997
Perceptron Limitations Gradient descent
XOR problem and Perceptron By Minsky and Papert in mid 1960
Gradient Descent To understand, consider simpler linear unit, where Let's learn w i that minimize the squared error, D={(x 1,t 1 ),(x 2,t 2 ),..,(x d,t d ),..,(x m,t m )} (t for target)
Feed-forward networks Back-Propagation Activation Functions
xkxk x 1 x 2 x 3 x 4 x 5
In our example E becomes E[w] is differentiable given f is differentiable Gradient descent can be applied
RBF-network
RBF-networks Support Vector Machines
Extension to Non-linear Decision Boundary Possible problem of the transformation High computation burden and hard to get a good estimate SVM solves these two issues simultaneously Kernel tricks for efficient computation Minimize ||w|| 2 can lead to a “good” classifier ( ) (.) ( ) Feature space Input space
Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Simon Haykin, Neural Networks, Secend edition Prentice Hall, 1999