Presentation is loading. Please wait.

Presentation is loading. Please wait.

MIT 802 Introduction to Data Platforms and Sources Lecture 1

Similar presentations


Presentation on theme: "MIT 802 Introduction to Data Platforms and Sources Lecture 1"— Presentation transcript:

1 MIT 802 Introduction to Data Platforms and Sources Lecture 1
Willem S. van Heerden

2 Tools for Data Science Based on a career website search, the most sought-after tools: R SQL Python Hadoop SAS Java Hive Matlab Pig C++

3 Tools for Data Science R
Programming language for statistical computing and graphics Initial version in 1995 Freely available under the GPL Interpreted, but implemented primarily in C and Fortran Many libraries primarily focused on statistical tests Linear/nonlinear modelling Classical statistical techniques Time-series analysis Clustering And so on... Very well documented, but steep learning curve

4 Tools for Data Science Python
Interpreted high-level scripting language Initial version in 1991 Free interpreters for all major platforms Emphasis on readability Large number of third-party libraries Database access Web scraping Statistical tests Machine learning And so on... Many examples online, but can be a bit slow

5 Tools for Data Science Python Often used when combining
Web apps and data analysis Production databases and data analysis Despite being a scripting language Works well for larger development projects Often used for production code Several free IDEs are available

6 Tools for Data Science SAS
Proprietary numerical analysis software suite by SAS Institute Originally developed in 1966, last stable release in 2013 Windows, IBM mainframe, Unix/Linux, OpenVMS Can access data from a wide variety of sources Used for Advanced analytics Multivariate analysis Business intelligence Data management Predictive analytics

7 Tools for Data Science SAS Current release provides Process
GUI with point-and-click interface for non-technical users SAS language with more advanced options for technical users Process Reads data from spreadsheets and databases Performs statistical analyses Outputs results in tables and graphs in RTF, HTML, or PDF documents

8 Tools for Data Science Scala A general purpose programming language
Design started in 2001, with first public release in 2004 Combines object-oriented and functional programming Is backward compatible with Java (runs on the JVM) Released under the BSD licence Offers support for streaming Useful for real-time data analysis Apache Spark programs are implemented in Scala

9 Tools for Data Science gnuplot A command-line tool
Early development in 1986 Specifically for generating 2D and 3D plots of data Can also perform data fits Possible to write scripts Amongst the most professional graphing results Very steep learning curve

10 Tools for Data Science Apache Hadoop Open source software framework
Allows for distributed processing of large data sets across clusters of computers Uses simple programming models Designed with scalability in mind Single servers Thousands of computers, each with local computation and storage

11 Tools for Data Science Apache Hadoop Storage component
Hadoop’s Distributed File System (HDFS) Processing component MapReduce programming model Splits files into large blocks Distributes blocks to cluster nodes Transfers packaged code to nodes Data is processed in parallel at nodes Hardware failures Likely to be common and should be automatically handled Hadoop detects and handles failures at application level Many extensions are available

12 Tools for Data Science Apache Spark
An open-source cluster-computing framework Developed in 2012 Interface for programming clusters Includes implicit parallelism and fault tolerance Runs alongside or on top of Hadoop Developed due to limitations of MapReduce to facilitate Implementation of algorithms that iterate over a data set Repeated database-style querying of data

13 Tools for Data Science Apache Spark Requires
A cluster manager, for example Native standalone Spark cluster Hadoop YARN Apache Mesos Distributes storage system, for example Hadoop's Distributed File System (HDFS) Amazon S3 (Simple Storage Service)

14 Tools for Data Science Apache Spark Spark MLlib
A distributed machine learning framework Includes many common learning and statistical algorithms Feature extraction and transformation functions Classification and regression Support Vector Machines (SVMs) Logistic and linear regression Decision trees Cluster analysis k-means and others Optimization algorithms

15 Python Useful Python libraries NumPy Matplotlib
Large, multi-dimensional arrays and matrices Mathematical functions to operate on arrays and matrices Matplotlib Plotting library for Python and NumPy Pyplot provides a MATLAB-like interface

16 Python: A Simple Scatter Plot
import matplotlib.pyplot as plt import numpy as np fileContent = np.loadtxt('pageSpeeds.csv', dtype=float, delimiter=',') pageSpeeds, purchaseAmounts = np.split(fileContent, 2, 1) x = pageSpeeds.flatten() y = purchaseAmounts.flatten() plt.scatter(x, y) plt.show()

17 Python: A Simple Scatter Plot

18 Python Useful Python libraries Scikit-learn Regression algorithms
Classification algorithms Clustering algorithms

19 Python: Least Squares Regression
import matplotlib.pyplot as plt import numpy as np from sklearn.metrics import r2_score fileContent = np.loadtxt('pageSpeeds.csv', dtype=float, delimiter=',') pageSpeeds, purchaseAmounts = np.split(fileContent, 2, 1) x = pageSpeeds.flatten() y = purchaseAmounts.flatten() poly = np.poly1d(np.polyfit(x, y, 3)) plt.scatter(x, y) xp = np.linspace(0, 7, 100) plt.plot(xp, poly(xp), c='r') plt.show() r2 = r2_score(y, poly(x)) print 'R-squared score: ' + str(r2)

20 Python: Least Squares Regression

21 Python: Training and Test Sets
import matplotlib.pyplot as plt import numpy as np from sklearn.metrics import r2_score fileContent = np.loadtxt('pageSpeeds.csv', dtype=float, delimiter=',') pageSpeeds, purchaseAmounts = np.split(fileContent, 2, 1) x = pageSpeeds.flatten() y = purchaseAmounts.flatten() numTrain = int(0.9 * x.size) trainX = x[:numTrain] testX = x[numTrain:] trainY = y[:numTrain] testY = y[numTrain:] poly = np.poly1d(np.polyfit(trainX, trainY, 7))

22 Python: Training and Test Sets
plt.scatter(trainX, trainY, c='g', s=50, marker='s', alpha=0.5) plt.scatter(testX, testY, c='b', s=50, marker='D', alpha=0.5) xp = np.linspace(0, 7, 100) plt.plot(xp, poly(xp), c='r', linewidth=3) plt.show() r2Train = r2_score(trainY, poly(trainX)) print 'R-squared score for training set: ' + str(r2Train) r2Test = r2_score(testY, poly(testX)) print 'R-squared score for test set: ' + str(r2Test)

23 Python: Training and Test Sets

24 Python Useful Python libraries Pandas Statsmodels
Data manipulation and analysis For numerical data tables (using DataFrames) For time series Statsmodels Estimate statistical models Perform statistical tests Plotting functions

25 Python: Multivariate Regression
import pandas as pd import statsmodels.api as sm from sklearn.preprocessing import StandardScaler pd.options.mode.chained_assignment = None df = pd.read_excel('cars.xls', sheetname='Sheet1', header=0, converters={'Price':float, 'Mileage':float, 'Cylinder':float, 'Doors':float}) independentVariables = df[['Mileage','Cylinder','Doors']] dependentVariable = df['Price'] independentMat = independentVariables[['Mileage','Cylinder','Doors']].as_matrix() scale = StandardScaler() independentVariables[['Mileage','Cylinder','Doors']] = scale.fit_transform(independentMat) est = sm.OLS(dependentVariable, independentVariables).fit() print est.summary()

26 Python: Multivariate Regression
OLS Regression Results ============================================================================== Dep. Variable: Price R-squared: Model: OLS Adj. R-squared: Method: Least Squares F-statistic: Date: Fri, 04 May 2018 Prob (F-statistic): Time: 05:10:31 Log-Likelihood: No. Observations: 21 AIC: Df Residuals: 18 BIC: Df Model: 3 Covariance Type: nonrobust coef std err t P>|t| [95.0% Conf. Int.] Mileage e Cylinder e+04 Doors e e+04 Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No. 1.53

27 Python: k-Means Clustering
from numpy import loadtxt from sklearn.cluster import KMeans import matplotlib.pyplot as plt from sklearn.preprocessing import scale X = loadtxt('s1.csv', dtype=float, delimiter=',') model = KMeans(n_clusters=15).fit(scale(X)) print model.labels_ plt.figure(figsize=(15, 10)) plt.scatter(X[:,0], X[:,1], c=model.labels_.astype(float)) plt.show()

28 Python: k-Means Clustering

29 Python: k-Means Clustering
Always remember to scale your data!

30 Python Useful Python libraries IPython Pydot
Command shell for interactive computing Tools for parallel computing Support for images Related to Project Jupyter Pydot Interface for GraphViz DOT graph description language

31 Python: Decision Trees
import numpy as np import pandas as pd from sklearn import tree from sklearn.externals.six import StringIO from IPython.display import Image import pydot df = pd.read_csv('iris.csv', header=0) d = {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2} df['class'] = df['class'].map(d) features = list(df.columns[:4]) X = df[features] y = df['class']

32 Python: Decision Trees
classifier = tree.DecisionTreeClassifier().fit(X, y) dot_data = StringIO() tree.export_graphviz(classifier, out_file=dot_data, feature_names=features) graph = pydot.graph_from_dot_data(dot_data.getvalue()) graph.write_png('tree.png') classificationResult = classifier.predict([[6.3,2.9,5.6,1.8]]) print d.keys()[d.values().index(classificationResult[0])]

33 Python: Decision Trees

34 Python: Random Forests
import pandas as pd from sklearn.ensemble import RandomForestClassifier df = pd.read_csv('iris.csv', header=0) d = {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2} df['class'] = df['class'].map(d) features = list(df.columns[:4]) X = df[features] y = df['class'] classifier = RandomForestClassifier(n_estimators=10).fit(X, y) classificationResult = classifier.predict([[6.3,2.9,5.6,1.8]]) print d.keys()[d.values().index(classificationResult[0])]

35 Python Useful Python libraries PyLab
Provides MATLAB-like functionality Not widely used anymore Here used for the loadtxt function

36 Python: Support Vector Machines
import numpy as np import pandas as pd import matplotlib as plt from pylab import * from sklearn import svm, datasets def plotPredictions(clf): xx, yy = np.meshgrid(np.arange(0, , 10), np.arange(10, 70, 0.5)) plt.figure(figsize=(8, 6)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8) plt.scatter(X[:,0], X[:,1], c=y.astype(np.float)) plt.show()

37 Python: Support Vector Machines
X = loadtxt('income_age.csv', dtype=float, delimiter=',', usecols=(0, 1)) y = loadtxt('income_age.csv', dtype=float, delimiter=',', usecols=(2,)) svc = svm.SVC(kernel='linear').fit(X, y) plotPredictions(svc)

38 Python: Support Vector Machines


Download ppt "MIT 802 Introduction to Data Platforms and Sources Lecture 1"

Similar presentations


Ads by Google