Presentation is loading. Please wait.

Presentation is loading. Please wait.

Our Data Science Roadmap

Similar presentations


Presentation on theme: "Our Data Science Roadmap"— Presentation transcript:

1 Our Data Science Roadmap
Raw data collected Lab1, Lab2 Exploratory data analysis EDA R/Rstudio+ Machine learning algorithms; Statistical models Spark ML Build data products Lab2 Communication Visualization Report Findings Lab2, Lab3 Make decisions Data is processed Data is cleaned Lab1 Big data methods MapReduce Lab2 Lab3 CSE4/587 B. Ramamurthy 5/16/2019

2 Topics for Final Exam Data-Intensive Text Processing with MapReduce
by Jimmy Lin and Chris Dyer Ch. 2, 3 upto p.57 Ch. 5 Text processing, MR, and graph processing including shortest path and page rank Lab 2 MR usage details Naïve Bayes and Bayesian Classification (Class notes) Study Field Cady’s text: Chapter 6,7 and 8: focus on Bayes, logistic regressions and evalution Apache Spark RDD paper by Zaharia et al Motivation for Spark Spark APIs CSE4/587 B. Ramamurthy 5/16/2019

3 Topics for Final Exam Data-Intensive Text Processing with MapReduce
by Jimmy Lin and Chris Dyer Ch. 2, 3 upto p.57 Ch. 5 Text processing, MR, and graph processing including shortest path and page rank Lab 2 MR usage details Naïve Bayes and Bayesian Classification (Class notes) Apache Spark RDD paper by Zaharia et al Motivation for Spark Spark APIs Lab3 details: Data pipeline you designed for lab3 CSE4/587 B. Ramamurthy 5/16/2019

4 Confusion Matrix Evaluating and comparing performance of prediction classifiers. Confusion matrix: Only binary confusion matrix In the next slide I have shown an easy way to remember the various metrics The slide after than shows a sample computation. Lets explore CSE4/587 B. Ramamurthy 5/16/2019

5 Classified Positive Classified Negative Actual Positive TP FN Sensitivity= TP/(TP+FN) Actual Negative FP TN Specificity= TN/(FP+TN) Misclassification Rate= (FN+FP)/Total Precision= TP/(TP+FP) Accuracy = (TP+TN)/Total

6 Total = 200 Classified Positive Classified Negative Actual Positive 60 10 Sensitivity= TP/(TP+FN)= 60/70 Actual Negative 5 125 Specificity= TN/(FP+TN) =125/130 Mis-classification Rate= (FN+FP)/Total= 15/200 Precision= TP/(TP+FP) =60/65 Accuracy = (TP+TN)/Total =185/200 Prevalence = 70/200 = 35%

7 Final exam format 5 questions (15-20 points each)
Closed book and closed notes Classification 1: Naïve Bayes Classification 2 : Logistic regression Spark given code—interpret, Spark concepts: RDD, lazy evaluation, etc. Short answer MapReduce synthesis: Lab2 details Graph algorithms problem solve: write pseudo code MaReduce analysis: pagerank, shortest path: simulate Evaluate performance of classification: (binary) confusion matrix


Download ppt "Our Data Science Roadmap"

Similar presentations


Ads by Google