Data Science Algorithms: The Basic Methods

Slides:



Advertisements
Similar presentations
Clustering.
Advertisements

1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.
Conceptual Clustering
Data Mining Classification: Alternative Techniques
Searching on Multi-Dimensional Data
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
Computational Support for RRTs David Johnson. Basic Extend.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Ensemble Learning: An Introduction
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
CS Instance Based Learning1 Instance Based Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Module 04: Algorithms Topic 07: Instance-Based Learning
Issues with Data Mining
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.4: Covering Algorithms Rodney Nielsen Many.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Classification Algorithms Covering, Nearest-Neighbour.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Data Science Credibility: Evaluating What’s Been Learned
Spatial Data Management
Data Mining Practical Machine Learning Tools and Techniques
Slides by Eamonn Keogh (UC Riverside)
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Instance Based Learning
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Ch8: Nonparametric Methods
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
K Nearest Neighbor Classification
Data Mining Practical Machine Learning Tools and Techniques
Clustering.
Instance Based Learning
Classification Algorithms
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning January 2018.
Nearest Neighbors CSC 576: Data Mining.
Data Mining Classification: Alternative Techniques
Text Categorization Berlin Chen 2003 Reference:
Chapter 7: Transformations
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
CSE4334/5334 Data Mining Lecture 7: Classification (4)
Data Mining CSCI 307, Spring 2019 Lecture 24
Data Mining CSCI 307, Spring 2019 Lecture 21
Machine Learning in Practice Lecture 20
Data Mining CSCI 307, Spring 2019 Lecture 23
Data Mining CSCI 307, Spring 2019 Lecture 11
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data Science Algorithms: The Basic Methods Instance-Based Learning WFH: Data Mining, Chapter 4.7 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Algorithms: The Basic Methods Inferring rudimentary rules Naïve Bayes, probabilistic model Constructing decision trees Constructing rules Association rule learning Linear models Instance-based learning Clustering

Instance-Based Learning Distance function defines what’s learned Most instance-based schemes use Euclidean distance: x(i) and x(l): two instances with d attributes Taking the square root is not required when comparing relative distances

Normalization and Other Issues Different attributes are measured on different scales => need to be normalized: Nominal attributes: distance either 0 or 1 Common policy for missing values: assumed to be maximally distant (given normalized attributes)

Finding Nearest Neighbors Efficiently Simplest way of finding nearest neighbor: linear scan of the data Classification takes time proportional to the product of the number of instances in training and test sets Nearest-neighbor search can be done more efficiently using appropriate data structures We will discuss two methods that represent training data in a tree structure: kD-trees and ball trees

kD-tree Example

Using kD-trees: Example

kD-trees Complexity depends on depth of tree, given by logarithm of number of nodes

kD-trees Amount of backtracking required depends on quality of tree (“square” vs. “skinny” nodes)

kD-trees How to build a good tree? Need to find good split point and split direction Split direction: direction with greatest variance Split point: median value along that direction

kD-trees Using value closest to mean (rather than median) can be better if data is skewed Can apply this recursively

Building Trees Incrementally Big advantage of instance-based learning: classifier can be updated incrementally Just add new training instance! Can we do the same with kD-trees? Heuristic strategy: Find leaf node containing new instance Place instance into leaf if leaf is empty Otherwise, split leaf according to the longest dimension (to preserve squareness) Tree should be re-built occasionally (i.e. if depth grows to twice the optimum depth)

Ball Trees Problem in kD-trees: corners Observation: no need to make sure that regions don't overlap Can use balls (hyperspheres) instead of hyperrectangles A ball tree organizes the data into a tree of d-dimensional (or k-dimensional) hyperspheres Normally allows for a better fit to the data and thus more efficient search

Ball Tree Example

Using Ball Trees Nearest-neighbor search is done using the same backtracking strategy as in kD-trees Ball can be ruled out from consideration if: distance from target to ball's center exceeds ball's radius plus current upper bound

Building Ball Trees Ball trees are built top down (like kD-trees) Don't have to continue until leaf balls contain just two points: can enforce minimum occupancy (same in kD-trees) Basic problem: splitting a ball into two Simple (linear-time) split selection strategy: Choose point farthest from ball's center Choose second point farthest from first one Assign each point to these two points Compute cluster centers and radii based on the two subsets to get two balls

Discussion of Nearest-Neighbor Learning Often very accurate Assumes all attributes are equally important Remedy: attribute selection or weights Possible remedies against noisy instances: Take a majority vote over the k nearest neighbors Removing noisy instances from dataset (difficult!) If n → ∞ and d/n → 0, error approaches minimum kD-trees become inefficient when number of attributes is too large (approximately > 10) Ball trees (which are instances of metric trees) work well in somewhat higher-dimensional spaces