MIT 802 Introduction to Data Platforms and Sources Lecture 1

MIT 802 Introduction to Data Platforms and Sources Lecture 1
Willem S. van Heerden

Tools for Data Science Based on a career website search, the most sought-after tools: R SQL Python Hadoop SAS Java Hive Matlab Pig C++

Tools for Data Science R
Programming language for statistical computing and graphics Initial version in 1995 Freely available under the GPL Interpreted, but implemented primarily in C and Fortran Many libraries primarily focused on statistical tests Linear/nonlinear modelling Classical statistical techniques Time-series analysis Clustering And so on... Very well documented, but steep learning curve

Tools for Data Science Python
Interpreted high-level scripting language Initial version in 1991 Free interpreters for all major platforms Emphasis on readability Large number of third-party libraries Database access Web scraping Statistical tests Machine learning And so on... Many examples online, but can be a bit slow

Tools for Data Science Python Often used when combining
Web apps and data analysis Production databases and data analysis Despite being a scripting language Works well for larger development projects Often used for production code Several free IDEs are available

Tools for Data Science SAS
Proprietary numerical analysis software suite by SAS Institute Originally developed in 1966, last stable release in 2013 Windows, IBM mainframe, Unix/Linux, OpenVMS Can access data from a wide variety of sources Used for Advanced analytics Multivariate analysis Business intelligence Data management Predictive analytics

Tools for Data Science SAS Current release provides Process
GUI with point-and-click interface for non-technical users SAS language with more advanced options for technical users Process Reads data from spreadsheets and databases Performs statistical analyses Outputs results in tables and graphs in RTF, HTML, or PDF documents

Tools for Data Science Scala A general purpose programming language
Design started in 2001, with first public release in 2004 Combines object-oriented and functional programming Is backward compatible with Java (runs on the JVM) Released under the BSD licence Offers support for streaming Useful for real-time data analysis Apache Spark programs are implemented in Scala

Tools for Data Science gnuplot A command-line tool
Early development in 1986 Specifically for generating 2D and 3D plots of data Can also perform data fits Possible to write scripts Amongst the most professional graphing results Very steep learning curve

Tools for Data Science Apache Hadoop Open source software framework
Allows for distributed processing of large data sets across clusters of computers Uses simple programming models Designed with scalability in mind Single servers Thousands of computers, each with local computation and storage

Tools for Data Science Apache Hadoop Storage component
Hadoop’s Distributed File System (HDFS) Processing component MapReduce programming model Splits files into large blocks Distributes blocks to cluster nodes Transfers packaged code to nodes Data is processed in parallel at nodes Hardware failures Likely to be common and should be automatically handled Hadoop detects and handles failures at application level Many extensions are available

Tools for Data Science Apache Spark
An open-source cluster-computing framework Developed in 2012 Interface for programming clusters Includes implicit parallelism and fault tolerance Runs alongside or on top of Hadoop Developed due to limitations of MapReduce to facilitate Implementation of algorithms that iterate over a data set Repeated database-style querying of data

Tools for Data Science Apache Spark Requires
A cluster manager, for example Native standalone Spark cluster Hadoop YARN Apache Mesos Distributes storage system, for example Hadoop's Distributed File System (HDFS) Amazon S3 (Simple Storage Service)

Tools for Data Science Apache Spark Spark MLlib
A distributed machine learning framework Includes many common learning and statistical algorithms Feature extraction and transformation functions Classification and regression Support Vector Machines (SVMs) Logistic and linear regression Decision trees Cluster analysis k-means and others Optimization algorithms

Python Useful Python libraries NumPy Matplotlib
Large, multi-dimensional arrays and matrices Mathematical functions to operate on arrays and matrices Matplotlib Plotting library for Python and NumPy Pyplot provides a MATLAB-like interface

Python: A Simple Scatter Plot
import matplotlib.pyplot as plt import numpy as np fileContent = np.loadtxt('pageSpeeds.csv', dtype=float, delimiter=',') pageSpeeds, purchaseAmounts = np.split(fileContent, 2, 1) x = pageSpeeds.flatten() y = purchaseAmounts.flatten() plt.scatter(x, y) plt.show()

Python: A Simple Scatter Plot

Python Useful Python libraries Scikit-learn Regression algorithms
Classification algorithms Clustering algorithms

Python: Least Squares Regression
import matplotlib.pyplot as plt import numpy as np from sklearn.metrics import r2_score fileContent = np.loadtxt('pageSpeeds.csv', dtype=float, delimiter=',') pageSpeeds, purchaseAmounts = np.split(fileContent, 2, 1) x = pageSpeeds.flatten() y = purchaseAmounts.flatten() poly = np.poly1d(np.polyfit(x, y, 3)) plt.scatter(x, y) xp = np.linspace(0, 7, 100) plt.plot(xp, poly(xp), c='r') plt.show() r2 = r2_score(y, poly(x)) print 'R-squared score: ' + str(r2)

Python: Least Squares Regression

Python: Training and Test Sets
import matplotlib.pyplot as plt import numpy as np from sklearn.metrics import r2_score fileContent = np.loadtxt('pageSpeeds.csv', dtype=float, delimiter=',') pageSpeeds, purchaseAmounts = np.split(fileContent, 2, 1) x = pageSpeeds.flatten() y = purchaseAmounts.flatten() numTrain = int(0.9 * x.size) trainX = x[:numTrain] testX = x[numTrain:] trainY = y[:numTrain] testY = y[numTrain:] poly = np.poly1d(np.polyfit(trainX, trainY, 7))

plt.scatter(trainX, trainY, c='g', s=50, marker='s', alpha=0.5) plt.scatter(testX, testY, c='b', s=50, marker='D', alpha=0.5) xp = np.linspace(0, 7, 100) plt.plot(xp, poly(xp), c='r', linewidth=3) plt.show() r2Train = r2_score(trainY, poly(trainX)) print 'R-squared score for training set: ' + str(r2Train) r2Test = r2_score(testY, poly(testX)) print 'R-squared score for test set: ' + str(r2Test)

Python Useful Python libraries Pandas Statsmodels
Data manipulation and analysis For numerical data tables (using DataFrames) For time series Statsmodels Estimate statistical models Perform statistical tests Plotting functions

Python: Multivariate Regression
import pandas as pd import statsmodels.api as sm from sklearn.preprocessing import StandardScaler pd.options.mode.chained_assignment = None df = pd.read_excel('cars.xls', sheetname='Sheet1', header=0, converters={'Price':float, 'Mileage':float, 'Cylinder':float, 'Doors':float}) independentVariables = df[['Mileage','Cylinder','Doors']] dependentVariable = df['Price'] independentMat = independentVariables[['Mileage','Cylinder','Doors']].as_matrix() scale = StandardScaler() independentVariables[['Mileage','Cylinder','Doors']] = scale.fit_transform(independentMat) est = sm.OLS(dependentVariable, independentVariables).fit() print est.summary()

Python: Multivariate Regression
OLS Regression Results ============================================================================== Dep. Variable: Price R-squared: Model: OLS Adj. R-squared: Method: Least Squares F-statistic: Date: Fri, 04 May 2018 Prob (F-statistic): Time: 05:10:31 Log-Likelihood: No. Observations: 21 AIC: Df Residuals: 18 BIC: Df Model: 3 Covariance Type: nonrobust coef std err t P>|t| [95.0% Conf. Int.] Mileage e Cylinder e+04 Doors e e+04 Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No. 1.53

Python: k-Means Clustering
from numpy import loadtxt from sklearn.cluster import KMeans import matplotlib.pyplot as plt from sklearn.preprocessing import scale X = loadtxt('s1.csv', dtype=float, delimiter=',') model = KMeans(n_clusters=15).fit(scale(X)) print model.labels_ plt.figure(figsize=(15, 10)) plt.scatter(X[:,0], X[:,1], c=model.labels_.astype(float)) plt.show()

Always remember to scale your data!

Python Useful Python libraries IPython Pydot
Command shell for interactive computing Tools for parallel computing Support for images Related to Project Jupyter Pydot Interface for GraphViz DOT graph description language

Python: Decision Trees
import numpy as np import pandas as pd from sklearn import tree from sklearn.externals.six import StringIO from IPython.display import Image import pydot df = pd.read_csv('iris.csv', header=0) d = {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2} df['class'] = df['class'].map(d) features = list(df.columns[:4]) X = df[features] y = df['class']

classifier = tree.DecisionTreeClassifier().fit(X, y) dot_data = StringIO() tree.export_graphviz(classifier, out_file=dot_data, feature_names=features) graph = pydot.graph_from_dot_data(dot_data.getvalue()) graph.write_png('tree.png') classificationResult = classifier.predict([[6.3,2.9,5.6,1.8]]) print d.keys()[d.values().index(classificationResult[0])]

Python: Random Forests
import pandas as pd from sklearn.ensemble import RandomForestClassifier df = pd.read_csv('iris.csv', header=0) d = {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2} df['class'] = df['class'].map(d) features = list(df.columns[:4]) X = df[features] y = df['class'] classifier = RandomForestClassifier(n_estimators=10).fit(X, y) classificationResult = classifier.predict([[6.3,2.9,5.6,1.8]]) print d.keys()[d.values().index(classificationResult[0])]

Python Useful Python libraries PyLab
Provides MATLAB-like functionality Not widely used anymore Here used for the loadtxt function

Python: Support Vector Machines
import numpy as np import pandas as pd import matplotlib as plt from pylab import * from sklearn import svm, datasets def plotPredictions(clf): xx, yy = np.meshgrid(np.arange(0, , 10), np.arange(10, 70, 0.5)) plt.figure(figsize=(8, 6)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8) plt.scatter(X[:,0], X[:,1], c=y.astype(np.float)) plt.show()

X = loadtxt('income_age.csv', dtype=float, delimiter=',', usecols=(0, 1)) y = loadtxt('income_age.csv', dtype=float, delimiter=',', usecols=(2,)) svc = svm.SVC(kernel='linear').fit(X, y) plotPredictions(svc)

MIT 802 Introduction to Data Platforms and Sources Lecture 1

Similar presentations

Presentation on theme: "MIT 802 Introduction to Data Platforms and Sources Lecture 1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MIT 802 Introduction to Data Platforms and Sources Lecture 1

Similar presentations

Presentation on theme: "MIT 802 Introduction to Data Platforms and Sources Lecture 1"— Presentation transcript:

Similar presentations

About project

Feedback