Our Data Science Roadmap

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

...visualizing classifier performance in R Tobias Sing, Ph.D. (joint work with Oliver Sander) Modeling & Simulation Novartis Pharma AG 3 rd BaselR meeting.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
25.All-Pairs Shortest Paths Hsu, Lih-Hsing. Computer Theory Lab. Chapter 25P.2.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute.
Algorithms for Data Analytics Chapter 3. Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter.
Final Review for CS 562. Final Exam on December 18, 2014 in CAS 216 Time: 3PM – 5PM (~2hours) OPEN NOTES, SLIDES, BOOKS Study the topics that we covered.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Chapter 4 Pattern Recognition Concepts continued.
The Fish4Knowledge Project Disclosing Computer Vision Errors to End-Users Emma Beauxis-Aussalet, Lynda Hardman, Jacco Van Ossenbruggen, Jiyin He, Elvira.
Performance measurement. Must be careful what performance metric we use For example, say we have a NN classifier with 1 output unit, and we code ‘1 =
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
10/31/2015B.Ramamurthy1 Final Review CSE487/587 B.Ramamurthy.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
1/3/2016B.Ramamurthy1 Final Review CSE487/587 B.Ramamurthy.
An Exercise in Machine Learning
Quiz 1 review. Evaluating Classifiers Reading: T. Fawcett paper, link on class website, Sections 1-4 Optional reading: Davis and Goadrich paper, link.
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Enhancing Tor’s Performance using Real- time Traffic Classification By Hugo Bateman.
Lecture 00: Introduction
Danny Hendler Advanced Topics in on-line Social Networks Analysis
2/13/2018 4:38 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Elizabeth R McMahon 14 April 2017
Evolving Decision Rules (EDR)
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Machine Learning – Classification David Fenyő
CSI5388: A Critique of our Evaluation Practices in Machine Learning
Performance Evaluation 02/15/17
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.
Evaluating Results of Learning
Final Review CSE487 B.Ramamurthy 7/30/2018 B.Ramamurthy.
Summary Tel Aviv University 2016/2017 Slava Novgorodov
DATA ANALYTICS AND TEXT MINING
Our Data Science Roadmap
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining Classification: Alternative Techniques
Features & Decision regions
TED Talks – A Predictive Analysis Using Classification Algorithms
Evaluation and Its Methods
Algorithms for Data Analytics
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Evaluating Classifiers (& other algorithms)
Pattern Recognition and Image Analysis
Data-intensive Computing - Review
Evaluating Models Part 1
Midterm Review CSE4/587 B.Ramamurthy 4/4/2019 4/4/2019 B.Ramamurthy
Midterm Review CSE4/587 B.Ramamurthy 4/8/2019 4/8/2019 B.Ramamurthy
CSE486/586 Distributed Systems
False discovery rate estimation
Dr. Sampath Jayarathna Cal Poly Pomona
Summary Tel Aviv University 2017/2018 Slava Novgorodov
Evaluation and Its Methods
Roc curves By Vittoria Cozza, matr
Evaluating Classifiers
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Working with Spark With Focus on Lab3.
The Student’s Guide to Apache Spark
Machine Learning: Methodology Chapter
Dr. Sampath Jayarathna Cal Poly Pomona
Evaluation and Its Methods
Midterm Exam Review.
COSC 4368 Intro Supervised Learning Organization
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Igor Stančin, Alan Jović to: {igor.stancin,
Lecturer: Geoff Hulten TAs: Alon Milchgrub, Andrew Wei
Presentation transcript:

Our Data Science Roadmap Raw data collected Lab1, Lab2 Exploratory data analysis EDA R/Rstudio+ Machine learning algorithms; Statistical models Spark ML Build data products Lab2 Communication Visualization Report Findings Lab2, Lab3 Make decisions Data is processed Data is cleaned Lab1 Big data methods MapReduce Lab2 Lab3 CSE4/587 B. Ramamurthy 5/16/2019

Topics for Final Exam Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer Ch. 2, 3 upto p.57 Ch. 5 Text processing, MR, and graph processing including shortest path and page rank Lab 2 MR usage details Naïve Bayes and Bayesian Classification (Class notes) Study Field Cady’s text: Chapter 6,7 and 8: focus on Bayes, logistic regressions and evalution Apache Spark RDD paper by Zaharia et al Motivation for Spark Spark APIs CSE4/587 B. Ramamurthy 5/16/2019

Topics for Final Exam Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer Ch. 2, 3 upto p.57 Ch. 5 Text processing, MR, and graph processing including shortest path and page rank Lab 2 MR usage details Naïve Bayes and Bayesian Classification (Class notes) Apache Spark RDD paper by Zaharia et al Motivation for Spark Spark APIs Lab3 details: Data pipeline you designed for lab3 CSE4/587 B. Ramamurthy 5/16/2019

Confusion Matrix Evaluating and comparing performance of prediction classifiers. Confusion matrix: Only binary confusion matrix In the next slide I have shown an easy way to remember the various metrics The slide after than shows a sample computation. Lets explore CSE4/587 B. Ramamurthy 5/16/2019

Classified Positive Classified Negative Actual Positive TP FN Sensitivity= TP/(TP+FN) Actual Negative FP TN Specificity= TN/(FP+TN) Misclassification Rate= (FN+FP)/Total Precision= TP/(TP+FP) Accuracy = (TP+TN)/Total

Total = 200 Classified Positive Classified Negative Actual Positive 60 10 Sensitivity= TP/(TP+FN)= 60/70 Actual Negative 5 125 Specificity= TN/(FP+TN) =125/130 Mis-classification Rate= (FN+FP)/Total= 15/200 Precision= TP/(TP+FP) =60/65 Accuracy = (TP+TN)/Total =185/200 Prevalence = 70/200 = 35%

Final exam format 5 questions (15-20 points each) Closed book and closed notes Classification 1: Naïve Bayes Classification 2 : Logistic regression Spark given code—interpret, Spark concepts: RDD, lazy evaluation, etc. Short answer MapReduce synthesis: Lab2 details Graph algorithms problem solve: write pseudo code MaReduce analysis: pagerank, shortest path: simulate Evaluate performance of classification: (binary) confusion matrix