Download presentation

Presentation is loading. Please wait.

Published byHudson Willey Modified over 2 years ago

1
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Advanced Database Systems F24DS2 / F29AT2 About Data Mining

2
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings There are lots of things you can do with a database: 1.Access data via straightforward queries, to answer straightforward questions about instances. E.g.: ``What is Ellen McArthur’s home phone number?” “What is the ISBN number of Eats, shoots and leaves, by Lynne Truss?”, “What grade did Larry Page get for the Internet module??” “Give me a list of all pages on the www that contain the phrase “fried egg sandwich”.

3
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings (cont…) 2.Generate simple reports about data via straightforward queries, to answer questions about sets of instances. E.g.: ``How many of our customers are called “Trevor?”” “Which of our books has been borrowed more times in the last month than Eats, shoots and leaves, by Lynne Truss?”, “Which student has the highest average marks?” “ What percentage of house owners also own a car?”.

4
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings (cont…) 3.Generate complex/and/or comprehensive statistical reports about the database as a whole, to summarise and understand the data – this is what tends to be done in the Analysis stage of Data Cleaning.. E.g.: For each field, generate a histogram of the values Run one or more clustering algorithms to find the clusters in the data.

5
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings (cont…) 4.Build Predictive models that may then be useful for business or research. For example: Based on stock market data, we can construct a model that attempts to predict tomorrow’s Dow Index closing price, given the previous few days’ prices. Based on blood test data from past patients, we can construct a model that attempts to predict whether or not a patient is developing hepatitis. Based on historic data on vibrations, we can build a model that tries to predict beforehand if an aircraft wing is likely to fail.

6
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings (cont…) 5.Discover INTERESTING and USEFUL rules that are `hidden’ in the data. For example: An analysis of supermarket basket data will show a surprising amount of baskets that contain both beer and nappies.. Analysis of crime records data may find that the violent crimes rate in newcastle seems to reduce significantly whenever the violent crimes rate in Sunderland increases significantly..

7
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me All Together 1.Accessing 2.Reporting 3.Clustering/Histograms 4.Predictive models 5.Discovery of interesting/surprising things

8
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining 1.Accessing 2.Reporting 3.Clustering/Histograms 4.Predictive models 5.Discovery of interesting/surprising things When you hear the term `data mining’, it can mean any of 2, 3, 4 and 5. In business/industry, `2’ and `3’ are called data mining. In academia we usually take data mining to mean mainly `4’ and `5’

9
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Notes on 3/4 1.Accessing 2.Reporting 3.Clustering/Histograms 4.Predictive models 5.Discovery of interesting/surprising things These are the things that you would look at more closely in a machine learning course. The predictive models are things like neural networks, decision trees and rulesets

10
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What DM means for us 1.Accessing 2.Reporting 3.Clustering/Histograms 4.Predictive models 5.Discovery of interesting/surprising things So, 3 and 4 are dealt with in another course. 5 could be an entire MSc course on its own, but that is what DM means for us.. In particular, we take a small bite of it that is relevant to practical discovery of interesting things in very large DBs. We look at a fast algorithm that can discover interesting rules in transaction databases, and that is a component in several advanced commercial systems..

11
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me First, some important motivational/explanatory notes Why do we need something like `type 5’ data mining at all? Couldn’t the `beer and nappies’ thing have been found by types 2 or 3 DM? The next slide shows a tiny `supermarket basket’ database. E.g. Record 11 is a customer who bought eggs and glue only; record 12 Records a transaction where the basket contained only apples.

12
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream

13
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Numbers Our example DB has 20 records of supermarket transactions, from a supermarket that only sells 9 things One month in a large supermarket with five stores spread around a reasonably sized city might easily yield a DB of 20,000,000 baskets, each containing a set of products from a pool of around 1,000

14
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Rules A `rule’ is something like this: If a basket contains apples and cheese, then it also contains beer Any such rule has two associated measures: 1.confidence – when the `if’ part is true, how often is the `then’ bit true? 2.coverage or support – how much of the database contains the `if’ part?

15
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Example: What is the confidence and coverage of: If the basket contains beer and cheese, then it also contains honey 2/20 of the records contain both beer and cheese, so coverage is 10% Of these 2, 1 contains honey, so confidence is 50% Is that interesting ? Is that useful ? What makes a rule interesting or useful?

16
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Interesting/Useful rules Statistically, anything that is interesting is something that happens significantly more than you would expect by chance. E.g. basic statistical analysis of basket data may show that 10% of baskets contain bread, and 4% of baskets contain washing-up powder. I.e: –There is a probability 0.1 that a basket contains bread. –There is a probability 0.04 that a basket contains washing-up powder.

17
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Bread and washing up powder What is the probability of a basket containing both bread and washing-up powder? The laws of probability say: If these two things are independent, chance is 0.1 * 0.04 = That is, we would expect 0.4% of baskets to contain both bread and washing up powder

18
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Interesting means surprising We therefore have a prior expectation that just 4 in 1,000 baskets should contain both bread and washing up powder. If we investigate, and discover that really it is 20 in 1,000 baskets, then we will be very surprised. It tells us that: –Something is going on in shoppers’ minds: bread and washing-up powder are connected in some way. –There may be ways to exploit this discovery … put the powder and bread at opposite ends of the supermarket?

19
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Finding surprising rules Suppose we ask `what is the most surprising rule in this database? This would be, presumably, a rule whose accuracy is more different from its expected accuracy than any others. But it also has to have a suitable level of coverage, or else it may be just a statistical blip, and/or unexploitable. Looking only at rules of the form: if basket contains X and Y, then it also contains Z … our realistic numbers tell us that there may be around 500,000,000 distinct possible rules. For each of these we need to work out its accuracy and coverage, by trawling through a database of around 20,000,000 basket records. … c operations … Yes, it’s easy to use `type 2’ DM, say, to work out the confidence and coverage of a given rule. But type 5 DM is all about searching through, somehow, 500,000,000 (or usually immensely more) rules to sniff out what may be the interesting ones.

20
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Here are some interesting ones in our mini basket DB: If a basket contains glue, then it also contains either beer or eggs confidence: 100% ; coverage 25% If a basket contains apples and dates, then it also contains honey confidence 100% ; coverage 20%

21
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What this lecture was about The many different meanings of data mining Warming up for the next lecture, via gentle discussion on transaction databases, rules, confidence, coverage, and what it takes for a rule to be interesting.

22
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Next A classic fast algorithm for finding useful rules in large databases,

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google