Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning in Python Vandana Bachani Spring 2012.

Similar presentations


Presentation on theme: "Machine Learning in Python Vandana Bachani Spring 2012."— Presentation transcript:

1 Machine Learning in Python Vandana Bachani Spring 2012

2  What is scikit-learn?  How can it be useful to the lab?  There are other packages too!  Features  Usage  Conclusion

3 scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib)numpyscipymatplotlib  A comprehensive package for all machine learning needs.  Faster  Accuracy? If you have the right data, it is pretty loyal. Ref:

4 4

5  Our daily jobs: ◦ Regression/Prediction ◦ Text Classification ◦ Text Feature Extraction ◦ Text Feature Selection  Using Chi-Square and other metrics ◦ Cross-Validation  K-Fold ◦ Clustering (K-Means, etc.)  Maybe in future: ◦ Image Classification All in one package!

6 NLTKOrangescikit-learn Machine Learning + Text Processing + … Machine Learning + visualizations Machine Learning + Machine Learning Mature (Book exists!)Naïve and sophisticatedNew, Still developing Documentation – Not so great. Good. Sufficient code examples. Documentation – Very good, but incomplete Lacks in functionality (w.r.t ML), old school Lacks lot of functionality (unsupervised learning) Almost complete w.r.t. machine learning + additional utilities Good Metrics Support Complicated to useEasy to useEasy and intuitive to use Rest APINo API support

7 Linear Models  Regression (Predicting Continuous Values) Example: Prices of houses (Boston house dataset) ◦ Linear, Ridge, Lasso (for sparse coefficients, useful in field of compressed sensing), LARS (very-high dimensional data), Bayesian  Classification ◦ Logistic Regression, Stochastic Gradient Descent

8 Support Vector Machines  Classification ◦ SVC (one-vs-one), LinearSVC (one-vs-rest)  Regression ◦ SVR  Density Detection & Outlier Detection (unsupervised learning)

9 Unsupervised Learning  Clustering ◦ K-Means, Mean Shift, Spectral Clustering ◦ Ward (hierarchical, constructs tree)  Manifold Learning ◦ Dimensionality Reduction (for visualization, etc)  Novelty and Outlier Detection ◦ Uses SVM

10 Miscellaneous  Nearest neighbors ◦ Unsupervised, Classification  Decision Trees ◦ Classification, Regression  Gaussian Processes ◦ Regression  Metrics  metrics.roc_curve(y_true, y_score)  metrics.precision_recall_fscore_support(...)  joblib and pickle

11  Cross-Validation  cross_validation.KFold(n, k[, indices])  Datasets  Feature Extraction ◦ Text  feature_extraction.text.WordNGramAnalyzer([...])  feature_extraction.text.CharNGramAnalyzer([...]) ◦ Image  feature_extraction.image.extract_patches_2d(...)  Feature Selection  feature_selection.chi2(X, y)  feature_selection.SelectKBest(score_func[, k])

12  Linear Regression >>> from sklearn import linear_model >>> clf = linear_model.LinearRegression() >>> clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2]) LinearRegression(copy_X=True, fit_intercept=True, normalize=False)  Classification >>> from sklearn.linear_model import SGDClassifier >>> X = [[0., 0.], [1., 1.]] >>> y = [0, 1] >>> clf = SGDClassifier(loss="hinge", penalty="l2") >>> clf.fit(X, y) SGDClassifier(alpha=0.0001, class_weight=None, eta0=0.0, fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0, shuffle=False, verbose=0)

13  SVC & Cross-Validation >>> from sklearn import datasets >>> from sklearn import svm >>> from sklearn import cross_validation >>> iris = datasets.load_iris() >>> clf = svm.SVC(kernel='linear') >>> scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5) >>> scores array([ 1...., , , , 1....])

14 data_train, data_test = trainData.data, testData.data y_train, y_test = trainData.target, testData.target print "Extracting features from the training dataset" #can use a specific analyzer to be passed to vectorizer #by default WordNGramAnalyzer is used vectorizer = Vectorizer() X_train = vectorizer.fit_transform(data_train) print "done in %fs" % (time() - t0) print "n_samples: %d, n_features: %d" % X_train.shape print "Extracting features from the test dataset" X_test = vectorizer.transform(data_test) print "done in %fs" % (time() - t0) print "n_samples: %d, n_features: %d" % X_test.shape penalty = "l2" #LinearSVC can be tried with L1, L2 penalties print "LinearSVC" linearSVC = LinearSVC(loss='l2', penalty=penalty, C=1000, dual=False, tol=1e-3) classify(linearSVC, X_train, y_train, X_test, y_test) #SGDClassifier print "SGDClassifier" sgdClf = SGDClassifier(alpha=.0001, n_iter=50, penalty=penalty) classify(sgdClf, X_train, y_train, X_test, y_test) print "NaiveBayes - Multinomial" bernoulliNBClf = BernoulliNB(alpha=.01) classify(bernoulliNBClf, X_train, y_train, X_test, y_test) def classify(clf, X_train, y_train, X_test, y_test): clf.fit(X_train, y_train) train_time = time() - t0 print "train time: %0.3fs" % train_time pred = clf.predict(X_test) test_time = time() - t0 print "test time: %0.3fs" % test_time print "classification report:" print metrics.classification_report(y_test, pred, target_names=categories)

15 SGDClassifier train time: 1.505s test time: 0.023s classification report: precision recall f1-score support TECHNOLOGY IDIOMS POLITICAL MUSIC GAMES SPORTS MOVIES CELEBRITY avg / total

16  If you are a python person - ◦ Seems like a good library ◦ NLTK + scikit-learn should make an excellent pair for our lab  Good documentation wins!

17 Thanks


Download ppt "Machine Learning in Python Vandana Bachani Spring 2012."

Similar presentations


Ads by Google