These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.

Slides:



Advertisements
Similar presentations
WHERE TO NEXT? Using Reading Data. Group Learning Pathways.
Advertisements

Wall Street Journal Scavenger Hunt Find the all the prices first for a prize.
The Relational Model and Normalization (1)
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
MA 1128: Lecture 18 – 6/19/13 Completing the Square.
Saving is Not Just Child’s Play
Your boss asks… How many of these things do we have to sell before we start making money? Use your arrow keys to navigate the slides.
Research pt 3: Evaluating Websites. The web is like a car boot sale. There is a lot to choose from but not all of it is quality. Some websites are offered.
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
Chapter 16 Inferential Statistics
DATABASE CLASSIFICATIONS
09/04/2015Unit 2 (b) Back-Office processes Unit 2 Assessment Criteria (b) 10 marks.
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
Data Mining.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Classical Techniques: Statistics, Neighborhoods, and Clustering.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data Mining: A Closer Look
IP Addressing & Subnetting Made Easy. Part 1: Working with IP Addresses.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Physical DB Issues, Indexes, Query Optimisation Database Systems Lecture 13 Natasha Alechina.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
Knowledge Discovery and Data Mining Evgueni Smirnov.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Fox MIS Spring 2011 Data Mining Week 9 Introduction to Data Mining.
ATAA Presentation 19 th November 2014 Bruce Vanstone.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
DATABASE What exactly is a database How do databases work? What's the difference between a spreadsheet database and a "real" database?
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
What is Data Mining? process of finding correlations or patterns among dozens of fields in large relational databases process of finding correlations or.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Frequent-Itemset Mining. Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small.
Association Rule Mining
 Enhancing User Experience  Why it is important?  Discussing user experience one-by-one.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Data Mining (and machine learning) The A Priori Algorithm.
MIS2502: Data Analytics Advanced Analytics - Introduction.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees.
Elsayed Hemayed Data Mining Course
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Copyright © 2009 Pearson Education, Inc.
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Mining (and machine learning)
MIS5101: Data Analytics Advanced Analytics - Introduction
1.2 Sampling LEARNING GOAL
Data Mining (and machine learning)
MIS2502: Data Analytics Introduction to Advanced Analytics
MIS2502: Data Analytics Introduction to Advanced Analytics and R
Presentation transcript:

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Advanced Database Systems F24DS2 / F29AT2 About Data Mining

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings There are lots of things you can do with a database: 1.Access data via straightforward queries, to answer straightforward questions about instances. E.g.: ``What is Ellen McArthur’s home phone number?” “What is the ISBN number of Eats, shoots and leaves, by Lynne Truss?”, “What grade did Larry Page get for the Internet module??” “Give me a list of all pages on the www that contain the phrase “fried egg sandwich”.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings (cont…) 2.Generate simple reports about data via straightforward queries, to answer questions about sets of instances. E.g.: ``How many of our customers are called “Trevor?”” “Which of our books has been borrowed more times in the last month than Eats, shoots and leaves, by Lynne Truss?”, “Which student has the highest average marks?” “ What percentage of house owners also own a car?”.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings (cont…) 3.Generate complex/and/or comprehensive statistical reports about the database as a whole, to summarise and understand the data – this is what tends to be done in the Analysis stage of Data Cleaning.. E.g.: For each field, generate a histogram of the values Run one or more clustering algorithms to find the clusters in the data.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings (cont…) 4.Build Predictive models that may then be useful for business or research. For example: Based on stock market data, we can construct a model that attempts to predict tomorrow’s Dow Index closing price, given the previous few days’ prices. Based on blood test data from past patients, we can construct a model that attempts to predict whether or not a patient is developing hepatitis. Based on historic data on vibrations, we can build a model that tries to predict beforehand if an aircraft wing is likely to fail.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining has Many meanings (cont…) 5.Discover INTERESTING and USEFUL rules that are `hidden’ in the data. For example: An analysis of supermarket basket data will show a surprising amount of baskets that contain both beer and nappies.. Analysis of crime records data may find that the violent crimes rate in newcastle seems to reduce significantly whenever the violent crimes rate in Sunderland increases significantly..

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me All Together 1.Accessing 2.Reporting 3.Clustering/Histograms 4.Predictive models 5.Discovery of interesting/surprising things

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Mining 1.Accessing 2.Reporting 3.Clustering/Histograms 4.Predictive models 5.Discovery of interesting/surprising things When you hear the term `data mining’, it can mean any of 2, 3, 4 and 5. In business/industry, `2’ and `3’ are called data mining. In academia we usually take data mining to mean mainly `4’ and `5’

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Notes on 3/4 1.Accessing 2.Reporting 3.Clustering/Histograms 4.Predictive models 5.Discovery of interesting/surprising things These are the things that you would look at more closely in a machine learning course. The predictive models are things like neural networks, decision trees and rulesets

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What DM means for us 1.Accessing 2.Reporting 3.Clustering/Histograms 4.Predictive models 5.Discovery of interesting/surprising things So, 3 and 4 are dealt with in another course. 5 could be an entire MSc course on its own, but that is what DM means for us.. In particular, we take a small bite of it that is relevant to practical discovery of interesting things in very large DBs. We look at a fast algorithm that can discover interesting rules in transaction databases, and that is a component in several advanced commercial systems..

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me First, some important motivational/explanatory notes Why do we need something like `type 5’ data mining at all? Couldn’t the `beer and nappies’ thing have been found by types 2 or 3 DM? The next slide shows a tiny `supermarket basket’ database. E.g. Record 11 is a customer who bought eggs and glue only; record 12 Records a transaction where the basket contained only apples.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Numbers Our example DB has 20 records of supermarket transactions, from a supermarket that only sells 9 things One month in a large supermarket with five stores spread around a reasonably sized city might easily yield a DB of 20,000,000 baskets, each containing a set of products from a pool of around 1,000

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Rules A `rule’ is something like this: If a basket contains apples and cheese, then it also contains beer Any such rule has two associated measures: 1.confidence – when the `if’ part is true, how often is the `then’ bit true? 2.coverage or support – how much of the database contains the `if’ part?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Example: What is the confidence and coverage of: If the basket contains beer and cheese, then it also contains honey 2/20 of the records contain both beer and cheese, so coverage is 10% Of these 2, 1 contains honey, so confidence is 50% Is that interesting ? Is that useful ? What makes a rule interesting or useful?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Interesting/Useful rules Statistically, anything that is interesting is something that happens significantly more than you would expect by chance. E.g. basic statistical analysis of basket data may show that 10% of baskets contain bread, and 4% of baskets contain washing-up powder. I.e: –There is a probability 0.1 that a basket contains bread. –There is a probability 0.04 that a basket contains washing-up powder.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Bread and washing up powder What is the probability of a basket containing both bread and washing-up powder? The laws of probability say: If these two things are independent, chance is 0.1 * 0.04 = That is, we would expect 0.4% of baskets to contain both bread and washing up powder

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Interesting means surprising We therefore have a prior expectation that just 4 in 1,000 baskets should contain both bread and washing up powder. If we investigate, and discover that really it is 20 in 1,000 baskets, then we will be very surprised. It tells us that: –Something is going on in shoppers’ minds: bread and washing-up powder are connected in some way. –There may be ways to exploit this discovery … put the powder and bread at opposite ends of the supermarket?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Finding surprising rules Suppose we ask `what is the most surprising rule in this database? This would be, presumably, a rule whose accuracy is more different from its expected accuracy than any others. But it also has to have a suitable level of coverage, or else it may be just a statistical blip, and/or unexploitable. Looking only at rules of the form: if basket contains X and Y, then it also contains Z … our realistic numbers tell us that there may be around 500,000,000 distinct possible rules. For each of these we need to work out its accuracy and coverage, by trawling through a database of around 20,000,000 basket records. … c operations … Yes, it’s easy to use `type 2’ DM, say, to work out the confidence and coverage of a given rule. But type 5 DM is all about searching through, somehow, 500,000,000 (or usually immensely more) rules to sniff out what may be the interesting ones.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Here are some interesting ones in our mini basket DB: If a basket contains glue, then it also contains either beer or eggs confidence: 100% ; coverage 25% If a basket contains apples and dates, then it also contains honey confidence 100% ; coverage 20%

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What this lecture was about The many different meanings of data mining Warming up for the next lecture, via gentle discussion on transaction databases, rules, confidence, coverage, and what it takes for a rule to be interesting.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Next A classic fast algorithm for finding useful rules in large databases,