Download presentation
Presentation is loading. Please wait.
Published byBriana Osborne Modified over 6 years ago
1
Machine Learning in Python Scikit-learn 2 Prof. Muhammad Saeed
2
Machine Learning (Dimensionality Reduction)
Principal Component Analysis (PCA): Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. PCA is not always an optimal dimensionality-reduction technique for classification purposes. 4/3/2018 Macine Learning FUUAST
3
Machine Learning (Dimensionality Reduction)
PCA Example: import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn import decomposition from sklearn import datasets iris = datasets.load_iris() X1 = iris.data y = iris.target pca = decomposition.PCA(n_components=3) pca.fit(X1) X = pca.transform(X1) fig = plt.figure(1, figsize=(8, 5)) ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=48, azim=130) 4/3/2018 Macine Learning FUUAST
4
Machine Learning (Dimensionality Reduction)
PCA: for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]: ax.text3D( X[y == label, 0].mean(), X[y == label, 1].mean() + 1.5, X[y == label, 2].mean(), name, horizontalalignment='center', bbox=dict(alpha=.5, edgecolor='w', facecolor='w') ) # Reorder the labels to have colors matching the cluster results y = np.choose(y, [1, 2, 0]).astype(np.float) ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.jet, edgecolor='k') ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) plt.show() 4/3/2018 Macine Learning FUUAST
5
Machine Learning (Dimensionality Reduction)
Linear Discriminant Analysis(LDA): Linear Discriminant Analysis (LDA) method is used to find a linear combination of features that characterizes or separates classes. The resulting combination is used for dimensionality reduction before classification. Though PCA (unsupervised) attempts to find the orthogonal component axes of maximum variance in a dataset, however, the goal of LDA (supervised) is to find the feature subspace that optimizes class separability. LDA is not always better when the training set is small 4/3/2018 Macine Learning FUUAST
6
Machine Learning (Dimensionality Reduction)
Linear Discriminant Analysis: 4/3/2018 Macine Learning FUUAST
7
Machine Learning (Dimensionality Reduction)
LDA: import matplotlib.pyplot as plt from sklearn import datasets from sklearn.decomposition import PCA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis iris = datasets.load_iris() X = iris.data y = iris.target target_names = iris.target_names pca = PCA(n_components=2) X_r = pca.fit(X).transform(X) lda = LinearDiscriminantAnalysis(n_components=2) X_r2 = lda.fit(X, y).transform(X) # Percentage of variance explained for each components print('explained variance ratio (first two components): %s' % str(pca.explained_variance_ratio_)) 4/3/2018 Macine Learning FUUAST
8
Machine Learning (Dimensionality Reduction)
LDA: plt.figure() colors = [‘red', ‘darkgreen', ‘blue'] for clr, i, target_name in zip(colors, [0, 1, 2], target_names): plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=clr, alpha=.8, lw=2, label=target_name) plt.legend(loc='best', shadow=False, scatterpoints=1) plt.title('PCA of IRIS dataset') for color, i, target_name in zip(colors, [0, 1, 2], target_names): plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color, plt.title('LDA of IRIS dataset') plt.show() 4/3/2018 Macine Learning FUUAST
9
Unsupervised Machine Learning
Clustering: "Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics." - wiki : Cluster analysis 4/3/2018 Macine Learning FUUAST
10
Unsupervised Machine Learning
Clustering Using KMeans Algorithm : The k-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like: The "cluster center" is the arithmetic mean of all the points belonging to the cluster. Each point is closer to its own cluster center than to other cluster centers. 4/3/2018 Macine Learning FUUAST
11
Unsupervised Machine Learning
Clustering Using KMeans Algorithm : where, ‘||xi - vj||’ is the Euclidean distance between xi and vj. ‘ci’ is the number of data points in ith cluster. ‘c’ is the number of cluster centers. 4/3/2018 Macine Learning FUUAST
12
Unsupervised Machine Learning
KMeans Algorithm Step-By-Step: Subject A B 1 2 1.5 3 4 5 7 3.5 6 4.5 Individual Mean Vector (centroid) Group 1 1 (1.0, 1.0) Group 2 4 (5.0, 7.0) 4/3/2018 Macine Learning FUUAST
13
Unsupervised Machine Learning
KMeans Algorithm Step-By-Step: Cluster 1 Cluster 2 Step Individual Mean Vector (centroid) 1 (1.0, 1.0) 4 (5.0, 7.0) 2 1, 2 (1.2, 1.5) 3 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0) 5 4, 5, 6 (4.3, 5.7) 6 4, 5, 6, 7 (4.1, 5.4) Subject A B 1 2 1.5 3 4 5 7 3.5 6 4.5 Individual Mean Vector (centroid) Cluster 1 1, 2, 3 (1.8, 2.3) Cluster 2 4, 5, 6, 7 (4.1, 5.4) 4/3/2018 Macine Learning FUUAST
14
Unsupervised Machine Learning
KMeans Algorithm Step-By-Step: Individual Distance to mean (centroid) of Cluster 1 Distance to mean (centroid) of Cluster 2 1 1.5 5.4 2 0.4 4.3 3 2.1 1.8 4 5.7 5 3.2 0.7 6 3.8 0.6 7 2.8 1.1 Individual Mean Vector (centroid) Cluster 1 1, 2 (1.3, 1.5) Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1) 4/3/2018 Macine Learning FUUAST
15
Unsupervised Machine Learning
Clustering Using KMeans Algorithm : from sklearn.cluster import KMeans from sklearn import metrics import numpy as np import matplotlib.pyplot as plt x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8]) x2 = np.array([5, 4, 6, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3]) plt.figure() plt.xlim([0, 10]) plt.ylim([0, 10]) plt.title('Dataset') plt.scatter(x1, x2) plt.show() X = np.array(list(zip(x1, x2))).reshape(len(x1), 2) colors = ['b', 'g', 'r'] markers = ['o', 'v', 's'] 4/3/2018 Macine Learning FUUAST
16
Unsupervised Machine Learning
Clustering Using KMeans Algorithm : K = 3 kmeans_model = KMeans(n_clusters=K).fit(X) plt.figure() for i, l in enumerate(kmeans_model.labels_): plt.plot(x1[i], x2[i], color=colors[l], marker=markers[l],ls='None') plt.xlim([0, 10]) plt.ylim([0, 10]) plt.show() 4/3/2018 Macine Learning FUUAST
17
Unsupervised Machine Learning
Clustering Using KMeans Algorithm : import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.cluster import KMeans from sklearn import datasets np.random.seed(5) iris = datasets.load_iris() X = iris.data y = iris.target estimators = [('k_means_iris_8', KMeans(n_clusters=8)), ('k_means_iris_3', KMeans(n_clusters=3)), ('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1, init='random'))] 4/3/2018 Macine Learning FUUAST
18
Unsupervised Machine Learning
Clustering Using KMeans Algorithm : fignum = 1 titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization'] for name, est in estimators: fig = plt.figure(fignum, figsize=(4, 3)) ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) est.fit(X) labels = est.labels_ ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float), edgecolor='k') ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) ax.set_xlabel('Petal width') ax.set_ylabel('Sepal length') ax.set_zlabel('Petal length') ax.set_title(titles[fignum - 1]) ax.dist = 12 fignum = fignum + 1 4/3/2018 Macine Learning FUUAST
19
Unsupervised Machine Learning
Clustering Using KMeans Algorithm : fig = plt.figure(fignum, figsize=(4, 3)) ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]: ax.text3D(X[y == label, 3].mean(), X[y == label, 0].mean(), X[y == label, 2].mean() + 2, name, horizontalalignment='center', bbox=dict(alpha=.2, edgecolor='w', facecolor='w')) # Reorder the labels to have colors matching the cluster results y = np.choose(y, [1, 2, 0]).astype(np.float) ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k') 4/3/2018 Macine Learning FUUAST
20
Unsupervised Machine Learning
Imputation of Missing Data: Replacement of missing data with some appropriate value as mean, median or most frequent value of the column is known as Imputation of missing data. from numpy import nan from sklearn.preprocessing import Imputer X = np.array([[ nan, 0, 3 ], [ 3, 7, 9 ], [ 3, 5, 2 ], [ 4, nan, 6 ], [ 8, 8, 1 ]]) y = np.array([14, 16, -1, 8, -5]) imp = Imputer(strategy='mean') X2 = imp.fit_transform(X) 4/3/2018 Macine Learning FUUAST
21
Macine Learning FUUAST
End 4/3/2018 Macine Learning FUUAST
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.