Algorithms for Data Analytics Chapter 3. Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter.

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Indian Statistical Institute Kolkata
Classification and Decision Boundaries
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Machine Learning CMPT 726 Simon Fraser University
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
CS Instance Based Learning1 Instance Based Learning.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Evaluating Performance for Data Mining Techniques
Doing Data Science Chapter 1 What is Data Science? Big Data and Data Science Hype Getting Past the Hype / Why Now? Datafication The Current Landscape.
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
General Information Course Id: COSC6342 Machine Learning Time: TU/TH 10a-11:30a Instructor: Christoph F. Eick Classroom:AH123
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 6: Logistic Regression.
CHAPTER 2 Statistical Inference, Exploratory Data Analysis and Data Science Process cse4/587-Sprint
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College Bio Informatics January
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Machine Learning Lecture for Methodological Foundations of Biomedical Informatics Fall 2015 (BMSC-GA 4449) Sisi Ma NYU Langone Medical Center CHIBI.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Mining of Massive Datasets Edited based on Leskovec’s from
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
PatReco: Introduction Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Why Intelligent Data Analysis? Joost N. Kok Leiden Institute of Advanced Computer Science Universiteit Leiden.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Book web site:
Usman Roshan Dept. of Computer Science NJIT
Machine Learning with Spark MLlib
Who am I? Work in Probabilistic Machine Learning Like to teach 
Machine Learning for Computer Security
Eick: Introduction Machine Learning
Prepared by: Mahmoud Rafeek Al-Farra
Machine Learning I & II.
Machine Learning Basics
Data Mining Lecture 11.
Data Science Process Chapter 2 Rich's Training 11/13/2018.
Instance Based Learning (Adapted from various sources)
What is Pattern Recognition?
Machine Learning Week 1.
Term Definition Examples Data Science Statistics with large data sets
Nearest-Neighbor Classifiers
Prepared by: Mahmoud Rafeek Al-Farra
Prepared by: Mahmoud Rafeek Al-Farra
Algorithms for Data Analytics
Overview of Machine Learning
Lecture 6: Introduction to Machine Learning
Artificial Intelligence 10. Neural Networks
Statistical Models and Machine Learning Algorithms --Review
Welcome! Knowledge Discovery and Data Mining
Midterm Exam Review.
What is Artificial Intelligence?
Machine Learning and Its Applications in Molecular Biophysics Jacob Andrzejczyk and Harish Vashisth Department of Chemical Engineering, University of New.
Presentation transcript:

Algorithms for Data Analytics Chapter 3

Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter 2) (Lecture 2) This week we will look at Algorithms for data analytics (Chapter 3) A Data Scientist: Stat (Ch.2) + Algorithms (Ch.3) + BigData (Lin&Dyer’s text) Uniqueness of this course Using the right tools and pre-existing libraries “creatively” (see Project 1) Statistical inference comes from statisticians (nothing new) Algorithms come from Computer Scientists (nothing new) Both area have taken a new meaning in the context of Big-data

Data Analytics (Data Science) EDA Data Intuition/ understanding Big-data analytics Stats/Algs Discoveries/ intelligence Statistical Inference Decisions/ Answers/ Results * *

Three Types of Data Science Algorithms Pipelines (data flow) to prepare data Three types: 1.Data preparation algorithms such as sorting, MapReduce, and Pregel 2.Optimization algorithms stochastic gradient descent, least squares… 3.Machine learning algorithms…

Machine Learning Algorithms Comes from Artificial Intelligence No underlying generative process Build to predict or classify something …. Read the very nice comparison on p.53 Three algorithms are discussed: linear regression, k-nn, k-means We will start with k-means…and move backwards Exclusive algorithms: what one can accomplish other(s) cannot

K-means K-means is unsupervised: no prior knowledge of the “right answer” Goal of the algorithm Is to determine the definition of the right answer by finding clusters of data Kind of data g+ data, survey data, medical data, SAT scores Assume data {age, gender, income, state, household, size}, your goal is to segment the users. Lets understand kmeans using an example. Also read about “birth of statistics” in John Snow’s classic study of Cholera epidemic in London 1854: “cluster” around Broadstreet pump:

K-NN K- nearest neighbor Supervised ML You know the “right answers” or at least data that is “labeled”: training set Set of objects have been classified or labeled (training set) Another set of objects are yet to be labeled or classified (test set) Your goal is to automate the processes of labeling the test set. Intuition behind k-NN is to consider most similar items --- similarity defined by their attributes, look at the existing label and assign the object a label.

K-NN Issues How many nearest neighbors? In other words what is the value of k Implications of small k and large k How do define similarity or closeness? Error rate or misclassification (k can chosen to lower this) Curse of dimensionality