Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Brandon Leonardo CS157B (Spring 2006).

Similar presentations


Presentation on theme: "Data Mining Brandon Leonardo CS157B (Spring 2006)."— Presentation transcript:

1 Data Mining Brandon Leonardo CS157B (Spring 2006)

2 What Is Data Mining? A way to discover knowledge “Semiautomatically analyzing large databases to find useful patterns” Notable Characteristics Large amounts of data Data Stored on Disk

3 What Are We Looking For? Rules Use sets of rules to predict/classify objects Ex. “Students with annual income less than $20,000 year are most likely to get a student loan” Patterns Different kinds of patterns Multiple patterns in one data set

4 What Can Data Mining Do? Applications Prediction What class the data will belong in or what the value will be based on attributes What kind of animal will this be, considering that it has stripes, 4 legs, and talks? What customers are likely to switch to a competitor?

5 What Can Data Mining Do? Applications Association Data that goes together in a class Amazon – books that are bought together Causality Whether riding a motorcycle increases your chances of dying in an accident Descriptive patterns Clusters

6 Classification Taking a new item (training instance) and, given past instances, figure out which class the new item belongs in How? Rules Decision Trees Bayesian Classifiers

7 Rule Classifiers Break down what classes some data belongs in based on rules Ex. If a new customer signs up for a credit card, and makes less than $30,000 a year, then place them in a high risk category

8 Decision Tree Classifiers Traverse the tree based on attributes, making a decision at each node until a leaf is reached Ex. Being Hired At Google Degree School HiredNot Hired PhDBachelors Not StanfordStanford HiredNot Hired Not StanfordStanford

9 Bayesian Classifiers Bayesian Predict the probability of an item being in a class for every class The class with the largest probability “wins” P(cj|d) = p(d|cj)p(cj) / p(d) P(d|cj) – probability of generating instance d given class cj P(cj) – probability of getting class cj P(d) – probability of d occurring If a variable isn’t present, it isn’t included in probability

10 Regression Linear regression/Curve fitting Y = a0 + a1*X1 + a2*X2 + … + an * Xn You create the co-efficients a0, a1, a2, …, an Find the best fit Not always exact noise in data relationship isn’t polynomial

11 Association Rules Rules denoted by ‘=>’ Support What fraction of population has both the antecedent and consequent of the rule Confidence How often the consequent is true when the antecedent is true Ex. Owning car => Buying Gas Support – 99.9% Confidence – 99.9% Probably True

12 Association Rules Shortcomings Sometimes there are correlations that aren’t really caused by each other Ex. Haircuts and Grocery Shopping 99% of population gets haircuts 100% of population goes grocery shopping Everybody who gets a haircut goes grocery shopping, but does that mean that one correlates with the other Deviation from existing patterns Correlation (positive and negative)

13 Clustering Clusters of points in a data set Break the set down into subsets Types Hierarchical clustering Based on different levels, break things down as you go deeper Agglomerative clustering Start small, then create higher levels Divisive clustering Start big, then create lower levels

14 Other Types of Mining Text mining Mining text documents Data visualization Maps, charts, other graphical things Don’t analyze the data, just present it for users (humans are good at seeing patterns)

15 References Database System Concepts


Download ppt "Data Mining Brandon Leonardo CS157B (Spring 2006)."

Similar presentations


Ads by Google