Our Data Science Roadmap

Slides:

Advertisements

Similar presentations

Florida International University COP 4770 Introduction of Weka.

Advertisements

...visualizing classifier performance in R Tobias Sing, Ph.D. (joint work with Oliver Sander) Modeling & Simulation Novartis Pharma AG 3 rd BaselR meeting.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

25.All-Pairs Shortest Paths Hsu, Lih-Hsing. Computer Theory Lab. Chapter 25P.2.

1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.

Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute.

Algorithms for Data Analytics Chapter 3. Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter.

Final Review for CS 562. Final Exam on December 18, 2014 in CAS 216 Time: 3PM – 5PM (~2hours) OPEN NOTES, SLIDES, BOOKS Study the topics that we covered.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.

Chapter 4 Pattern Recognition Concepts continued.

The Fish4Knowledge Project Disclosing Computer Vision Errors to End-Users Emma Beauxis-Aussalet, Lynda Hardman, Jacco Van Ossenbruggen, Jiyin He, Elvira.

Performance measurement. Must be careful what performance metric we use For example, say we have a NN classifier with 1 output unit, and we code ‘1 =

Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

10/31/2015B.Ramamurthy1 Final Review CSE487/587 B.Ramamurthy.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

1/3/2016B.Ramamurthy1 Final Review CSE487/587 B.Ramamurthy.

An Exercise in Machine Learning

Quiz 1 review. Evaluating Classifiers Reading: T. Fawcett paper, link on class website, Sections 1-4 Optional reading: Davis and Goadrich paper, link.

ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.

***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.

ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.

Enhancing Tor’s Performance using Real- time Traffic Classification By Hugo Bateman.

Lecture 00: Introduction

Danny Hendler Advanced Topics in on-line Social Networks Analysis

2/13/2018 4:38 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.

Elizabeth R McMahon 14 April 2017

Evolving Decision Rules (EDR)

Name: Sushmita Laila Khan Affiliation: Georgia Southern University

Machine Learning – Classification David Fenyő

CSI5388: A Critique of our Evaluation Practices in Machine Learning

Performance Evaluation 02/15/17

Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.

Evaluating Results of Learning

Final Review CSE487 B.Ramamurthy 7/30/2018 B.Ramamurthy.

Summary Tel Aviv University 2016/2017 Slava Novgorodov

DATA ANALYTICS AND TEXT MINING

Our Data Science Roadmap

Lecture Notes for Chapter 4 Introduction to Data Mining

Data Mining Classification: Alternative Techniques

Features & Decision regions

TED Talks – A Predictive Analysis Using Classification Algorithms

Evaluation and Its Methods

Algorithms for Data Analytics

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Evaluating Classifiers (& other algorithms)

Pattern Recognition and Image Analysis

Data-intensive Computing - Review

Evaluating Models Part 1

Midterm Review CSE4/587 B.Ramamurthy 4/4/2019 4/4/2019 B.Ramamurthy

Midterm Review CSE4/587 B.Ramamurthy 4/8/2019 4/8/2019 B.Ramamurthy

CSE486/586 Distributed Systems

False discovery rate estimation

Dr. Sampath Jayarathna Cal Poly Pomona

Summary Tel Aviv University 2017/2018 Slava Novgorodov

Evaluation and Its Methods

Roc curves By Vittoria Cozza, matr

Evaluating Classifiers

Assignment 1: Classification by K Nearest Neighbors (KNN) technique

Working with Spark With Focus on Lab3.

The Student’s Guide to Apache Spark

Machine Learning: Methodology Chapter

Dr. Sampath Jayarathna Cal Poly Pomona

Evaluation and Its Methods

Midterm Exam Review.

COSC 4368 Intro Supervised Learning Organization

ECE – Pattern Recognition Lecture 8 – Performance Evaluation

Igor Stančin, Alan Jović to: {igor.stancin,

Lecturer: Geoff Hulten TAs: Alon Milchgrub, Andrew Wei

Presentation transcript:

Our Data Science Roadmap Raw data collected Lab1, Lab2 Exploratory data analysis EDA R/Rstudio+ Machine learning algorithms; Statistical models Spark ML Build data products Lab2 Communication Visualization Report Findings Lab2, Lab3 Make decisions Data is processed Data is cleaned Lab1 Big data methods MapReduce Lab2 Lab3 CSE4/587 B. Ramamurthy 5/16/2019

Topics for Final Exam Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer Ch. 2, 3 upto p.57 Ch. 5 Text processing, MR, and graph processing including shortest path and page rank Lab 2 MR usage details Naïve Bayes and Bayesian Classification (Class notes) Study Field Cady’s text: Chapter 6,7 and 8: focus on Bayes, logistic regressions and evalution Apache Spark RDD paper by Zaharia et al Motivation for Spark Spark APIs CSE4/587 B. Ramamurthy 5/16/2019

Topics for Final Exam Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer Ch. 2, 3 upto p.57 Ch. 5 Text processing, MR, and graph processing including shortest path and page rank Lab 2 MR usage details Naïve Bayes and Bayesian Classification (Class notes) Apache Spark RDD paper by Zaharia et al Motivation for Spark Spark APIs Lab3 details: Data pipeline you designed for lab3 CSE4/587 B. Ramamurthy 5/16/2019

Confusion Matrix Evaluating and comparing performance of prediction classifiers. Confusion matrix: Only binary confusion matrix In the next slide I have shown an easy way to remember the various metrics The slide after than shows a sample computation. Lets explore CSE4/587 B. Ramamurthy 5/16/2019

Classified Positive Classified Negative Actual Positive TP FN Sensitivity= TP/(TP+FN) Actual Negative FP TN Specificity= TN/(FP+TN) Misclassification Rate= (FN+FP)/Total Precision= TP/(TP+FP) Accuracy = (TP+TN)/Total

Total = 200 Classified Positive Classified Negative Actual Positive 60 10 Sensitivity= TP/(TP+FN)= 60/70 Actual Negative 5 125 Specificity= TN/(FP+TN) =125/130 Mis-classification Rate= (FN+FP)/Total= 15/200 Precision= TP/(TP+FP) =60/65 Accuracy = (TP+TN)/Total =185/200 Prevalence = 70/200 = 35%

Final exam format 5 questions (15-20 points each) Closed book and closed notes Classification 1: Naïve Bayes Classification 2 : Logistic regression Spark given code—interpret, Spark concepts: RDD, lazy evaluation, etc. Short answer MapReduce synthesis: Lab2 details Graph algorithms problem solve: write pseudo code MaReduce analysis: pagerank, shortest path: simulate Evaluate performance of classification: (binary) confusion matrix