Our Data Science Roadmap

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

...visualizing classifier performance in R Tobias Sing, Ph.D. (joint work with Oliver Sander) Modeling & Simulation Novartis Pharma AG 3 rd BaselR meeting.
Predicting Risk of Re-hospitalization for Congestive Heart Failure Patients (in collaboration with ) Jayshree Agarwal Senjuti Basu Roy, Ankur Teredesai,
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
25.All-Pairs Shortest Paths Hsu, Lih-Hsing. Computer Theory Lab. Chapter 25P.2.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Final Review for CS 562. Final Exam on December 18, 2014 in CAS 216 Time: 3PM – 5PM (~2hours) OPEN NOTES, SLIDES, BOOKS Study the topics that we covered.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
The Fish4Knowledge Project Disclosing Computer Vision Errors to End-Users Emma Beauxis-Aussalet, Lynda Hardman, Jacco Van Ossenbruggen, Jiyin He, Elvira.
Performance measurement. Must be careful what performance metric we use For example, say we have a NN classifier with 1 output unit, and we code ‘1 =
10/31/2015B.Ramamurthy1 Final Review CSE487/587 B.Ramamurthy.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
1/3/2016B.Ramamurthy1 Final Review CSE487/587 B.Ramamurthy.
An Exercise in Machine Learning
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Enhancing Tor’s Performance using Real- time Traffic Classification By Hugo Bateman.
Lecture 00: Introduction
Danny Hendler Advanced Topics in on-line Social Networks Analysis
Data Mining Introduction to Classification using Linear Classifiers
Elizabeth R McMahon 14 April 2017
Evolving Decision Rules (EDR)
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Machine Learning – Classification David Fenyő
Performance Evaluation 02/15/17
Prepared by: Mahmoud Rafeek Al-Farra
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.
Final Review CSE487 B.Ramamurthy 7/30/2018 B.Ramamurthy.
Summary Tel Aviv University 2016/2017 Slava Novgorodov
DATA ANALYTICS AND TEXT MINING
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining Classification: Alternative Techniques
Features & Decision regions
Naïve Bayes CSE651 6/7/2014.
TED Talks – A Predictive Analysis Using Classification Algorithms
Prepared by: Mahmoud Rafeek Al-Farra
Evaluation and Its Methods
Evaluating Classifiers (& other algorithms)
Pattern Recognition and Image Analysis
Data-intensive Computing - Review
Final Exam Review CSE487/587.
Evaluating Models Part 1
Machine Learning in Practice Lecture 7
Midterm Review CSE4/587 B.Ramamurthy 4/4/2019 4/4/2019 B.Ramamurthy
CSE 491/891 Lecture 25 (Mahout).
Midterm Review CSE4/587 B.Ramamurthy 4/8/2019 4/8/2019 B.Ramamurthy
CSE486/586 Distributed Systems
Dr. Sampath Jayarathna Cal Poly Pomona
Summary Tel Aviv University 2017/2018 Slava Novgorodov
Evaluation and Its Methods
Roc curves By Vittoria Cozza, matr
Our Data Science Roadmap
Evaluating Classifiers
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
The Student’s Guide to Apache Spark
Machine Learning: Methodology Chapter
Dr. Sampath Jayarathna Cal Poly Pomona
Evaluation and Its Methods
Midterm Exam Review.
COSC 4368 Intro Supervised Learning Organization
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Igor Stančin, Alan Jović to: {igor.stancin,
Lecturer: Geoff Hulten TAs: Alon Milchgrub, Andrew Wei
Presentation transcript:

Our Data Science Roadmap Raw data collected Exploratory data analysis EDA R/Rstudio+ Machine learning algorithms; Statistical models Spark ML Build data products Communication Visualization Report Findings Make decisions Data is processed Data is cleaned Big data methods MapReduce CSE4/587 B. Ramamurthy 11/10/2018

Topics for Final Exam Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer Ch. 2, 3 upto p.57 Ch. 5 Text processing, MR, and graph processing including shortest path and page rank Lab 2 MR usage details Naïve Bayes and Bayesian Classification (Class notes) Study Field Cady’s text: Chapter 6,7 and 8: focus on Bayes, logistic regressions and evalution Apache Spark RDD paper by Zaharia et al Motivation for Spark Spark APIs Lab3 details CSE4/587 B. Ramamurthy 11/10/2018

Topics for Final Exam Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer Ch. 2, 3 upto p.57 Ch. 5 Text processing, MR, and graph processing including shortest path and page rank Lab 2 MR usage details Naïve Bayes and Bayesian Classification (Class notes) Apache Spark RDD paper by Zaharia et al Motivation for Spark Spark APIs Lab3 details CSE4/587 B. Ramamurthy 11/10/2018

Confusion Matrix Evaluating and comparing performance of prediction classifiers. Confusion matrix: Only binary confusion matrix In the next slide I have shown an easy way to remember the various metrics The slide after than shows a sample computation. Lets explore CSE4/587 B. Ramamurthy 11/10/2018

Classified Positive Classified Negative Actual Positive TP FN Sensitivity= TP/(TP+FN) Actual Negative FP TN Specificity= TN/(FP+TN) Misclassification Rate= (FN+FP)/Total Precision= TP/(TP+FP) Accuracy = (TP+TN)/Total

Total = 200 Classified Positive Classified Negative Actual Positive 60 10 Sensitivity= TP/(TP+FN)= 60/70 Actual Negative 5 125 Specificity= TN/(FP+TN) =125/130 Mis-classification Rate= (FN+FP)/Total= 15/200 Precision= TP/(TP+FP) =60/65 Accuracy = (TP+TN)/Total =185/200 Prevalence = 70/200 = 35%

Final exam format 6 questions (15-20 points each) Closed book and closed notes Classification 1: Naïve Bayes Classification 2 : Logistic regression Spark given code—interpret MapReduce synthesis: Graph algorithms problem solve: write pseudo code MaReduce analysis: pagerank: simulate Evaluate performance of classification: (Binary) confusion matrix