Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data analysis tools Subrata Mitra and Jason Rahman.

Similar presentations


Presentation on theme: "Data analysis tools Subrata Mitra and Jason Rahman."— Presentation transcript:

1 Data analysis tools Subrata Mitra and Jason Rahman

2 Scikit-learn http://scikit-learn.org/stable/ Python based Tutorial is pretty good Easy to use:

3 Python for machine learning Pros: Python is easy to learn and use – Simplifies data preparation code Python is a general purpose language – Integration of ML into any Python application (servers, etc) is simple Wide variety of complementary libraries available – Pandas – Numpy – Scipy – Matplotlib – Seaborn Cons: Most recent or sophisticated algorithms may not be available

4 Fit and predict All models (Classification and regression) implement at least two functions: fit(x, y) - Fit the model to the given dataset predict(x) - Predict the y values associated with the x values Transformers (Scaling, etc) implement at least two functions: fit(x) - Fit the transforms to an initial dataset transform(x) - Transform the given data based on the initial fitted data

5 Cross validation K-fold Stratified k-fold: each set contains approximately the same percentage of samples of each target class as the complete set Leave-One-Out: Each learning set is created by taking all the samples except one, the test set being the sample left out Leave-P-Out Shuffle & Split: Samples are first shuffled and then split into a pair of train and test sets. And many more

6 Grid search: Parameter tuning Ref: http://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/

7 Pipeline Combines multiple stages of a machine learning pipeline into a single entity model = Pipeline([('poly', PolynomialFeatures(degree=2)),('linear', LinearRegression(fit_intercept=True))]) model = model.named_steps['linear'].fit(a,c) Ref: https://github.com/subrata4096/regression/blob/master/regressFit.pyhttps://github.com/subrata4096/regression/blob/master/regressFit.py

8 Preprocessing utilities Center around zero: scale Vectors with unit norm: normalize Scaling features to lie between a given minimum and maximum value: MinMaxScalar Imputation of missing value: Imputer Examples at: http://scikit- learn.org/stable/modules/preprocessing.html

9 Commonly used techniques Decision trees (DecisionTreeClassifier, DecisionTreeRegressor) SVM (svm.SVC) Regression ( LinearRegression, Ridge, Lasso, ElasticNet) Naive Bayes (GaussianNB, MultinomialNB, BernoulliNB ) All can be used with “fit”, “predict” style calls.

10 A complete example: targetArr = preprocessing.scale(OrigTargetArr) polyReg = Pipeline([('poly', PolynomialFeatures(degree=deg)),('linear', Lasso(max_iter=2000))]) polyReg.fit(inArr, targetArr) scores = cross_validation.cross_val_score(polyReg, inArr, targetArr, cv=k) polyReg.predict(testArr)

11 Example: decision trees http://scikit- learn.org/stable/modules/tree.html#tree http://scikit- learn.org/stable/modules/tree.html#tree

12 PCA: dimensionality reduction

13 Old stuff

14 Linear regression

15 Y(w,x) = w0 + w1x1 + w2(x1)^2 + w3x2 + w4(x2)^2 + w5x1x2 2-deg polynomial regression Basically, you solve exactly the same way as linear regression but deal with more internally generated features.

16 Ordinary Least Squares Ridge Regression Impose a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares. Alpha, is a complexity parameter that controls the amount of shrinkage. LASSO: Least Absolute Shrinkage & Selection Operator It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent.

17 Ref: http://puriney.github.io/numb/2013/07/06/normal-equations-gradient-descent-and-linear-regression/


Download ppt "Data analysis tools Subrata Mitra and Jason Rahman."

Similar presentations


Ads by Google