Download presentation
Presentation is loading. Please wait.
Published bySybil Blankenship Modified over 6 years ago
1
MIT 802 Introduction to Data Platforms and Sources Lecture 1
Willem S. van Heerden
2
Tools for Data Science Based on a career website search, the most sought-after tools: R SQL Python Hadoop SAS Java Hive Matlab Pig C++
3
Tools for Data Science R
Programming language for statistical computing and graphics Initial version in 1995 Freely available under the GPL Interpreted, but implemented primarily in C and Fortran Many libraries primarily focused on statistical tests Linear/nonlinear modelling Classical statistical techniques Time-series analysis Clustering And so on... Very well documented, but steep learning curve
4
Tools for Data Science Python
Interpreted high-level scripting language Initial version in 1991 Free interpreters for all major platforms Emphasis on readability Large number of third-party libraries Database access Web scraping Statistical tests Machine learning And so on... Many examples online, but can be a bit slow
5
Tools for Data Science Python Often used when combining
Web apps and data analysis Production databases and data analysis Despite being a scripting language Works well for larger development projects Often used for production code Several free IDEs are available
6
Tools for Data Science SAS
Proprietary numerical analysis software suite by SAS Institute Originally developed in 1966, last stable release in 2013 Windows, IBM mainframe, Unix/Linux, OpenVMS Can access data from a wide variety of sources Used for Advanced analytics Multivariate analysis Business intelligence Data management Predictive analytics
7
Tools for Data Science SAS Current release provides Process
GUI with point-and-click interface for non-technical users SAS language with more advanced options for technical users Process Reads data from spreadsheets and databases Performs statistical analyses Outputs results in tables and graphs in RTF, HTML, or PDF documents
8
Tools for Data Science Scala A general purpose programming language
Design started in 2001, with first public release in 2004 Combines object-oriented and functional programming Is backward compatible with Java (runs on the JVM) Released under the BSD licence Offers support for streaming Useful for real-time data analysis Apache Spark programs are implemented in Scala
9
Tools for Data Science gnuplot A command-line tool
Early development in 1986 Specifically for generating 2D and 3D plots of data Can also perform data fits Possible to write scripts Amongst the most professional graphing results Very steep learning curve
10
Tools for Data Science Apache Hadoop Open source software framework
Allows for distributed processing of large data sets across clusters of computers Uses simple programming models Designed with scalability in mind Single servers Thousands of computers, each with local computation and storage
11
Tools for Data Science Apache Hadoop Storage component
Hadoop’s Distributed File System (HDFS) Processing component MapReduce programming model Splits files into large blocks Distributes blocks to cluster nodes Transfers packaged code to nodes Data is processed in parallel at nodes Hardware failures Likely to be common and should be automatically handled Hadoop detects and handles failures at application level Many extensions are available
12
Tools for Data Science Apache Spark
An open-source cluster-computing framework Developed in 2012 Interface for programming clusters Includes implicit parallelism and fault tolerance Runs alongside or on top of Hadoop Developed due to limitations of MapReduce to facilitate Implementation of algorithms that iterate over a data set Repeated database-style querying of data
13
Tools for Data Science Apache Spark Requires
A cluster manager, for example Native standalone Spark cluster Hadoop YARN Apache Mesos Distributes storage system, for example Hadoop's Distributed File System (HDFS) Amazon S3 (Simple Storage Service)
14
Tools for Data Science Apache Spark Spark MLlib
A distributed machine learning framework Includes many common learning and statistical algorithms Feature extraction and transformation functions Classification and regression Support Vector Machines (SVMs) Logistic and linear regression Decision trees Cluster analysis k-means and others Optimization algorithms
15
Python Useful Python libraries NumPy Matplotlib
Large, multi-dimensional arrays and matrices Mathematical functions to operate on arrays and matrices Matplotlib Plotting library for Python and NumPy Pyplot provides a MATLAB-like interface
16
Python: A Simple Scatter Plot
import matplotlib.pyplot as plt import numpy as np fileContent = np.loadtxt('pageSpeeds.csv', dtype=float, delimiter=',') pageSpeeds, purchaseAmounts = np.split(fileContent, 2, 1) x = pageSpeeds.flatten() y = purchaseAmounts.flatten() plt.scatter(x, y) plt.show()
17
Python: A Simple Scatter Plot
18
Python Useful Python libraries Scikit-learn Regression algorithms
Classification algorithms Clustering algorithms
19
Python: Least Squares Regression
import matplotlib.pyplot as plt import numpy as np from sklearn.metrics import r2_score fileContent = np.loadtxt('pageSpeeds.csv', dtype=float, delimiter=',') pageSpeeds, purchaseAmounts = np.split(fileContent, 2, 1) x = pageSpeeds.flatten() y = purchaseAmounts.flatten() poly = np.poly1d(np.polyfit(x, y, 3)) plt.scatter(x, y) xp = np.linspace(0, 7, 100) plt.plot(xp, poly(xp), c='r') plt.show() r2 = r2_score(y, poly(x)) print 'R-squared score: ' + str(r2)
20
Python: Least Squares Regression
21
Python: Training and Test Sets
import matplotlib.pyplot as plt import numpy as np from sklearn.metrics import r2_score fileContent = np.loadtxt('pageSpeeds.csv', dtype=float, delimiter=',') pageSpeeds, purchaseAmounts = np.split(fileContent, 2, 1) x = pageSpeeds.flatten() y = purchaseAmounts.flatten() numTrain = int(0.9 * x.size) trainX = x[:numTrain] testX = x[numTrain:] trainY = y[:numTrain] testY = y[numTrain:] poly = np.poly1d(np.polyfit(trainX, trainY, 7))
22
Python: Training and Test Sets
plt.scatter(trainX, trainY, c='g', s=50, marker='s', alpha=0.5) plt.scatter(testX, testY, c='b', s=50, marker='D', alpha=0.5) xp = np.linspace(0, 7, 100) plt.plot(xp, poly(xp), c='r', linewidth=3) plt.show() r2Train = r2_score(trainY, poly(trainX)) print 'R-squared score for training set: ' + str(r2Train) r2Test = r2_score(testY, poly(testX)) print 'R-squared score for test set: ' + str(r2Test)
23
Python: Training and Test Sets
24
Python Useful Python libraries Pandas Statsmodels
Data manipulation and analysis For numerical data tables (using DataFrames) For time series Statsmodels Estimate statistical models Perform statistical tests Plotting functions
25
Python: Multivariate Regression
import pandas as pd import statsmodels.api as sm from sklearn.preprocessing import StandardScaler pd.options.mode.chained_assignment = None df = pd.read_excel('cars.xls', sheetname='Sheet1', header=0, converters={'Price':float, 'Mileage':float, 'Cylinder':float, 'Doors':float}) independentVariables = df[['Mileage','Cylinder','Doors']] dependentVariable = df['Price'] independentMat = independentVariables[['Mileage','Cylinder','Doors']].as_matrix() scale = StandardScaler() independentVariables[['Mileage','Cylinder','Doors']] = scale.fit_transform(independentMat) est = sm.OLS(dependentVariable, independentVariables).fit() print est.summary()
26
Python: Multivariate Regression
OLS Regression Results ============================================================================== Dep. Variable: Price R-squared: Model: OLS Adj. R-squared: Method: Least Squares F-statistic: Date: Fri, 04 May 2018 Prob (F-statistic): Time: 05:10:31 Log-Likelihood: No. Observations: 21 AIC: Df Residuals: 18 BIC: Df Model: 3 Covariance Type: nonrobust coef std err t P>|t| [95.0% Conf. Int.] Mileage e Cylinder e+04 Doors e e+04 Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No. 1.53
27
Python: k-Means Clustering
from numpy import loadtxt from sklearn.cluster import KMeans import matplotlib.pyplot as plt from sklearn.preprocessing import scale X = loadtxt('s1.csv', dtype=float, delimiter=',') model = KMeans(n_clusters=15).fit(scale(X)) print model.labels_ plt.figure(figsize=(15, 10)) plt.scatter(X[:,0], X[:,1], c=model.labels_.astype(float)) plt.show()
28
Python: k-Means Clustering
29
Python: k-Means Clustering
Always remember to scale your data!
30
Python Useful Python libraries IPython Pydot
Command shell for interactive computing Tools for parallel computing Support for images Related to Project Jupyter Pydot Interface for GraphViz DOT graph description language
31
Python: Decision Trees
import numpy as np import pandas as pd from sklearn import tree from sklearn.externals.six import StringIO from IPython.display import Image import pydot df = pd.read_csv('iris.csv', header=0) d = {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2} df['class'] = df['class'].map(d) features = list(df.columns[:4]) X = df[features] y = df['class']
32
Python: Decision Trees
classifier = tree.DecisionTreeClassifier().fit(X, y) dot_data = StringIO() tree.export_graphviz(classifier, out_file=dot_data, feature_names=features) graph = pydot.graph_from_dot_data(dot_data.getvalue()) graph.write_png('tree.png') classificationResult = classifier.predict([[6.3,2.9,5.6,1.8]]) print d.keys()[d.values().index(classificationResult[0])]
33
Python: Decision Trees
34
Python: Random Forests
import pandas as pd from sklearn.ensemble import RandomForestClassifier df = pd.read_csv('iris.csv', header=0) d = {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2} df['class'] = df['class'].map(d) features = list(df.columns[:4]) X = df[features] y = df['class'] classifier = RandomForestClassifier(n_estimators=10).fit(X, y) classificationResult = classifier.predict([[6.3,2.9,5.6,1.8]]) print d.keys()[d.values().index(classificationResult[0])]
35
Python Useful Python libraries PyLab
Provides MATLAB-like functionality Not widely used anymore Here used for the loadtxt function
36
Python: Support Vector Machines
import numpy as np import pandas as pd import matplotlib as plt from pylab import * from sklearn import svm, datasets def plotPredictions(clf): xx, yy = np.meshgrid(np.arange(0, , 10), np.arange(10, 70, 0.5)) plt.figure(figsize=(8, 6)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8) plt.scatter(X[:,0], X[:,1], c=y.astype(np.float)) plt.show()
37
Python: Support Vector Machines
X = loadtxt('income_age.csv', dtype=float, delimiter=',', usecols=(0, 1)) y = loadtxt('income_age.csv', dtype=float, delimiter=',', usecols=(2,)) svc = svm.SVC(kernel='linear').fit(X, y) plotPredictions(svc)
38
Python: Support Vector Machines
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.