Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Introduction to Data Mining COMP207:

Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Introduction to Data Mining COMP207: Data Mining

What is Data Mining? Definitions KDD: Knowledge Discovery in Databases KDD Process Differences with Statistics Views on the Process Basic Functions Why would you do this? Motivations Applications Summary Today's Topics Introduction to Data Mining COMP207: Data Mining

Some Definitions:  “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” (Piatetsky-Shapiro)  "...the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web,... or data streams." (Han, pg xxi)‏  “...the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5)‏  “...finding hidden information in a database.” (Dunham, pg 3)‏  “...the process of employing one or more computer learning techniques to automatically analyse and extract knowledge from data contained within a database.” (Roiger, pg 4)‏ What is Data Mining? Introduction to Data Mining COMP207: Data Mining

Keywords from each definition:  “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” (Piatetsky-Shapiro)‏  "...the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web,... or data streams." (Han, pg xxi)  “...the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5)‏  “...finding hidden information in a database.” (Dunham, pg 3)‏  “...the process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data contained within a database.” (Roiger, pg 4)‏ What is Data Mining? Introduction to Data Mining COMP207: Data Mining

Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD. Dunham: KDD is the process of finding useful information and patterns in data. Data Mining is the use of algorithms to extract information and patterns derived by the KDD process. For this course, we will discuss the entire process (KDD) but focus mostly on the algorithms used for discovery. KDD: Knowledge Discovery in Databases Introduction to Data Mining COMP207: Data Mining

KDD (Knowledge Discovery in Databases) is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96) Or KDD : non-trivial extraction of implicit, previously unknown and potentially useful information Data mining is just a part of the KDD process Data mining applies algorithms to large data to produce models or patterns interesting to the user. KDD: Knowledge Discovery in Databases Introduction to Data Mining COMP207: Data Mining

The Data Mining (KDD) Process Introduction to Data Mining COMP207: Data Mining

Operational Data - Day-to-day data used to run business Clean, collect and summarise - Most Data is not suitable for data mining - Errors or Noise, missing data, invalid formats Data warehouse - Mega store of clean (analysis) data Data Preparation - Validating the data for mining (e.g. remove noise, formatting, running validation routines etc.) Training Data – Data used as test case for mining Data Mining – the process of applying mining algorithms on data to produce interesting patterns KDD Process Components Introduction to Data Mining COMP207: Data Mining

Differences with Statistics Introduction to Data Mining COMP207: Data Mining Algorithms scale to large data Data is used secondary for Data mining DM–tools use background knowledge for End-User Strategy : –Exploration –Cyclic Statistics Many algorithms with quadratic run time. Data is used for the Statistic (primary) Statistical background is often required Strategy: –Conformational –Verifying –Few loops

Piatetsky-Shapiro View Introduction to Data Mining COMP207: Data Mining Initial Data Target Data Preprocessed Data Transformed Data Data Model Knowledge Selection Preprocessing Transformation Data Mining Interpretation (As tweaked by Dunham)‏

CRISP-DM View Introduction to Data Mining COMP207: Data Mining

All Data Mining functions can be thought of as attempting to find a model to fit the data. Each function needs Criteria to create one model over another. Each function needs a technique to Compare the data. Two types of model:  Predictive models predict unknown values based on known data  Descriptive models identify patterns in data Each type has several sub-categories, each of which has many algorithms. We won't have time to look at ALL of them in detail. Data Mining Functions Introduction to Data Mining COMP207: Data Mining

Data Mining Functions Introduction to Data Mining COMP207: Data Mining Predictive Descriptive Classification: Maps data into predefined classes Regression: Maps data into a function Prediction: Predict future data states Time Series Analysis: Analyze data over time (Supervised Learning)‏ Clustering: Find groups of similar items Association Rules: Find relationships between items Characterisation: Derive representative information Sequence Discovery: Find sequential patterns (Unsupervised Learning)‏

The aim of classification is to create a model that can predict the 'type' or some category for a data instance that doesn't have one. Two phases: 1. Given labelled data instances, learn model for how to predict the class label for them. (Training)‏ 2. Given an unlabelled, unseen instance, use the model to predict the class label. (Prediction)‏ Some algorithms predict only a binary split (yes/no), some can predict 1 of N classes, some give probabilities for each of N classes. Classification Introduction to Data Mining COMP207: Data Mining

The aim of clustering is similar to classification, but without predefined classes. Clustering attempts to find clusters of data instances which are more similar to each other than to instances outside of the cluster. Unsupervised Learning: learning by observation, rather than by example. Some algorithms must be told how many clusters to find, others try to find an 'appropriate' number of clusters. Clustering Introduction to Data Mining COMP207: Data Mining

The aim of association rule mining is to find patterns that occur in the data set frequently enough to be interesting. Hence the association or correlation of data attributes within instances, rather than between instances. These correlations are then expressed as rules – if X and Y appear in an instance, then Z also appears. Most algorithms are extensions of a single base algorithm known as 'A Priori', however a few others also exist. Association Rule Mining Introduction to Data Mining COMP207: Data Mining

That all sounds... complicated. Why should I learn about Data Mining? What's wrong with just a relational database? Why would I want to go through these extra [complicated] steps? Isn't it expensive? It sounds like it takes a lot of skill, programming, computational time and storage space. Where's the benefit? Data Mining isn't just a cute academic exercise, it has very profitable real world uses. Practically all large companies and many governments perform data mining as part of their planning and analysis. Why? Introduction to Data Mining COMP207: Data Mining

We are Data rich but knowledge poor Computing affordable - Storage, CPU, networking Data is too large to analyse (Very Large Databases (VLBD) - Dimensionality (size) - distributed (location spread) - heterogeneous (different types of data) Traditional techniques infeasible - Statistics, databases Competitive pressure in business enterprises - Customer profiling (Need to know who is a good customer) - Business to Business (B2B – Being “old” is not profitable) Why Data Mining? Some general reasons Introduction to Data Mining COMP207: Data Mining

Relational database—A commodity of every enterprise Huge data warehouses are under construction POS (Point of Sales): Transactional DBs in terabytes Object, relational, distributed, heterogeneous and legacy databases Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases (Genetic data etc) Time-series data (e.g., stock trading) and temporal data Text (documents, emails) and multimedia databases WWW:A huge, hyper-linked, dynamic, global information system (XML, Web content and Web usage data) Crime data – terrorist data  more recent applications Data is Everywhere! Introduction to Data Mining COMP207: Data Mining

The rate of data creation is accelerating each year. In 2003, UC Berkeley estimated that the previous year generated 5 exabytes of data, of which 92% was stored on electronically accessible media. Mega < Giga < Tera < Peta < Exa... All the data in all the books in the US Library of Congress is ~136 Terabytes. So 37,000 New Libraries of Congress in 2002. VLBI Telescopes produce 16 Gigabytes of data every second. Each engine of each plane of each company produces ~1 Gigabyte of data every trans-atlantic length journey. Google searches 18 billion+ accessible web pages. The Data Explosion Introduction to Data Mining COMP207: Data Mining

As the amount of data increases, the proportion of information decreases. As more and more data is generated automatically, we need to find automatic solutions to turn those stored raw results into information. Companies need to turn stored data into profit... otherwise why are they storing it? Let's look at some real world examples. Data Explosion Implications Introduction to Data Mining COMP207: Data Mining

The data generated by airplane engines can be used to determine when it needs to be serviced. By discovering the patterns that are indicative of problems, companies can service working engines less often (increasing profit) and discover faults before they materialise (increasing safety). Loan companies can “give you results in minutes” by classifying you into a good credit risk or a bad risk, based on your personal information and a large supply of previous, similar customers. Cell phone companies can classify customers into those likely to leave, and hence need enticement, and those that are likely to stay regardless. Classification Introduction to Data Mining COMP207: Data Mining

Discover previously unknown groups of customers/items. By finding clusters of customers, companies can then determine how best to handle that particular cluster. For example, this could be used for targeted advertising, special offers, transferring information gathered by association rule mining to other members of the cluster, and so forth. The concept of 'Similarity' is often used for determining other items that you might be interested in, eg 'More Like This' links. Clustering Introduction to Data Mining COMP207: Data Mining

By finding association rules from shopping baskets, supermarkets can use this information for many things, including:  Product placement in the store  What to put on sale  What to create as 'joint special offers'  What to offer the customer in terms of coupons  What to advertise together It shouldn't be surprising that your Tesco coupons are for things that you sometimes buy, rather than things you always or never buy. Wal-Mart in the US records every transaction at every store -- petabytes of information to sift through. (TeraData)‏ Association Rule Mining Introduction to Data Mining COMP207: Data Mining

Note well that data mining applications have no wisdom. They cannot apply the knowledge that they discover appropriately. For example, a data mining application may tell you that there is a correlation between buying music magazines and beer, but it doesn't tell you how to use that knowledge. Should you put the two close together to reinforce the tendency, or should you put them far apart as people will buy them anyway and thus stay in the store longer? Data mining can help managers plan strategies for a company, it does not give them the strategies. Data/Information/Knowledge/Wisdom Introduction to Data Mining COMP207: Data Mining

What is data mining? KDD - knowledge discovery in databases: nontrivial extraction of implicit, previously unknown and potentially useful information Why do we need data mining? - Very large data - data explosion, - Dimensionality of data - Heterogeneity of data - Technology rich - Traditional techniques infeasible Summary Introduction to Data Mining COMP207: Data Mining

Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Introduction to Data Mining COMP207:

Similar presentations

Presentation on theme: "Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Introduction to Data Mining COMP207:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Introduction to Data Mining COMP207:

Similar presentations

Presentation on theme: "Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Introduction to Data Mining COMP207:"— Presentation transcript:

Similar presentations

About project

Feedback