Data Mining CSCI 307, Spring 2019 Lecture 11

Data Mining CSCI 307, Spring 2019 Lecture 11
Output: Rules

Instance-based Representation
Simplest form of learning: rote learning Training instances are searched for instance that most closely resembles new instance The instances themselves represent the knowledge Also called instance-based learning Difference from rules (trees, etc.): Just store the instances; defer the work ("lazy" learning) No need to create and store rules. work from existing instances to find closest to the new one.

Instance-based Representation
Similarity function defines what’s “learned” Methods: nearest-neighbor (use a distance function to find the instance that is most like the one needing to be classified) k-nearest-neighbor (use more than one neighbor; classify using the majority from the k-neighbors) Criticism of the method: No structure is "learned" – the instances don't describe the patterns. Proponents say: instances combined with the distance function is the structure.

The Distance Function Simplest case: only one numeric attribute
Distance is the difference between the two attribute values involved (or a function thereof) Several numeric attributes: normally, Euclidean distance is used and attributes are normalized Nominal attributes: distance is set to 1 if values are different, 0 if they are equal Are all attributes equally important? Weighting the attributes might be necessary

Learning Prototypes Two classes: solid circles and open circles Different ways of partitioning the instance space, i.e. maybe save only critical examples of each class Nearest neighbor split Discard the grey ones Only those instances involved in a decision need to be stored (don't want to store all the instances) Noisy instances should be filtered out Idea: only use prototypical examples

Go Farther: Generalize with Rectangles
Rectangles enclose same class Nesting allows an inner region to have a different class If fall in same rectangle, then same class but different decision boundary than nearest neighbor on previous slide. nearest-neighbor rule is used outside rectangles Rectangles are rules! (But they can be more conservative than “normal” rules.) Nested rectangles are rules with exceptions Note: Nominal regions are hard to visualize, need multi-dimensions

Cluster Representation I
When a cluster is learned, output takes the form of a diagram The simplest case: assign a cluster number to each instance, lay out instances, and partition. Simple 2-D Representation

Representing Clusters II
Some clustering algorithms allow clusters to overlap. Venn Diagram Overlapping Clusters

Representing Clusters III
Probabilistic Assignment Some algorithms use probabilities. For each instance there is a degree of membership to the clusters 1, 2, or 3. a b c d e f g h …

Representing Clusters IV
Dendrogram Here, the clusters at the "leaves" of the diagram are more tightly clustered than at the higher levels. NB: Dendron is the Greek word for tree

Data Mining CSCI 307, Spring 2019 Lecture 11

Similar presentations

Presentation on theme: "Data Mining CSCI 307, Spring 2019 Lecture 11"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining CSCI 307, Spring 2019 Lecture 11

Similar presentations

Presentation on theme: "Data Mining CSCI 307, Spring 2019 Lecture 11"— Presentation transcript:

Similar presentations

About project

Feedback