Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Jim King.

Similar presentations


Presentation on theme: "Data Mining Jim King."— Presentation transcript:

1 Data Mining Jim King

2 What is Data Mining? A.k.a. knowledge discovery Why?
The search for previously unknown relationships in large data sets Why? Improved technology allows for vast quantities of data to be gathered Those relationships can perhaps be used to make future decisions and strategies

3 How do we Data Mine? Three considerations to be made Classification
Association Sequential

4 Classification Generate grouping rules
Future data can then be classified quickly Example: Disease classification based on symptoms may lead to better treatments

5 Association Two conditions occur together
Cond1 => Cond2 Presumptive Objective With some probability (confidence)

6 Sequential Event B follows Event A
Ex. In e-commerce, what links do people follow? After following links to a product, how often do they buy?

7 Classification Algorithms
Hard clustering vs. Soft clustering Collection of classes { C1, C2, .. Cn } Arbitrary Object O Soft Clustering: Classes may overlap where an object belongs to multiple classes Hard Clustering: Every object may belong to only one class. No overlap

8 Classification One way: Agglomerative Every object is its own cluster
Find two objects with least distance Combine into one cluster Stop when only one cluster remains Returns hierarchy of the clustering Need to decide on some distance function

9 Classification Another way: Division method
Everything initially in one cluster Split into two clusters Split each new cluster into two more clusters Stop when can’t divide any more Requires more computational power, but usually worse results

10 Association Algorithms
Given constraints, minimize the criteria need for a condition Bought cereal & eggs -> Bought milk 80% confidence Bought cereal -> Bought milk 90% confidence

11 Association Prune conditions which fall below minimum improvement yields simplifications Other constraints: Minimum confidence ( 30% with A include B) Minimum support ( 2% have both A and B)

12 Sequential Algorithms
People buy basic camping equipment Later buy other items related Starting with basic item sets, try to concatenate and find the resulting set among customer behavior

13 Sequential If resulting item set is not supported (at all or above a threshold), drop it Sequences do not have to be contiguous i.e. A customer buys A then B then C, sequence A then C is valid

14 Case Study - SchulWeb Search Site for schools in Germany
How to improve performance and user satisfaction? Use log to track user navigation patterns (i.e. What URLs requested, what order?) Extract Information from these

15 Interpretations of Mining
Users don’t like to type text Prefer to select from available choices What were they looking for? Schools close to some region Used option to specify a state (for location) Used option to specify a school type (to limit search size)

16 Changes Made Made “Near Town” Default
Made option obvious, people started to use Limited region size further, short lists produced Shorter lists less intimidating, more people found what they need

17 Conclusions Data mining is a useful tool with multiple algorithms that can be tuned for specific tasks Can benefit business, medicine, science More efficient algorithms needed to speed up data mining process

18 Conclusions Making Data mining easier to use
Data with rich descriptions (more fields) More Data/Records Controlled/Reliable Data Collection (automated vs. manual) Way to evaluate results Integrate information gained back into system

19 Final Questions?


Download ppt "Data Mining Jim King."

Similar presentations


Ads by Google