Section 5 Data Mining.

Section 5 Data Mining

Section Content 5.1 Introduction 5.2 Knowledge Discovery
5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining 5.7 Applications of Data Mining CA306 Data Mining

5.1 Data Mining Introduction
the discovery of new information in terms of patterns or rules from huge amounts of data mining tools should identify these patterns, rules and trends with minimal user input data mining is related to statistics: exploratory data analysis artificial intelligence: knowledge discovery and machine learning techniques from machine learning, statistics, neural networks and genetic algorithms are used due to the vastness of the amount of data, efficiency/scalability of data mining algorithms is a key issue CA306 Data Mining

Data Mining and Data Warehousing
The goal of data warehousing is to support decision making with data. Data mining can help in conjunction with a data warehouse with certain types of decisions. Data mining helps to extract new patterns/rules that cannot be found by merely querying or processing data. Aggregated or summarised collections of data in warehouses improves the efficiency of data mining in these cases. The potential use of data mining needs to be considered early in the design of a data warehouse. CA306 Data Mining

Sections Covered 5.1 Introduction 5.2 Knowledge Discovery

5.2 Knowledge Discovery Data mining is part of the knowledge discovery process: data selection data cleansing enrichment data transformation / encoding data mining reporting and display Example: Database: Transaction database for a goods retailer Client data: name, zip code, phone, date of purchase, item code, price, quantity, total amount CA306 Data Mining

Knowledge Discovery - Example
New knowledge can be discovered from the client data data selection: data about specific items or categories of items items from stores in specific regions data cleansing: correct incorrect zip codes eliminate records with incorrect phone numbers enrichment: add additional information age, income, credit rating of client data transformation: reduce the amount of data group items into product categories group zip codes into regions CA306 Data Mining

Data Mining - Knowledge Discovery
Data mining might discover co-occurrences - items that are typically bought together association rules - when a customer buys video equipment, he/she also buys another electronic gadget sequential patters - when a customer buys a camera, then within 3 months he/she buys photographic supplies classification trees - customers can be classified by frequency of visits, types of finance used, etc. combined with statistics about the classes This information can then be used to for example optimise store locations run promotions plan seasonal marketing strategies Talk about real applications such as supermarkets, credit card etc. CA306 Data Mining

Goals of Data Mining Prediction Identification
show how certain attributes within the data will behave in the future example: predict what customers will buy under certain discounts example: predict sales volume for some period Identification data patterns can be used to identify the existence of an item, an event, or an activity example: detecting intruders by the commands they execute Typical Q: What are the four goals of data mining? CA306 Data Mining

Goals of Data Mining Classification Optimisation
partition data such that different classes or categories can be identified example: customers can be categorised into regular and infrequent shoppers, into discount-seeking customers etc. categorisation - e.g. into food categories - can reduce the complexity of data mining Optimisation optimise the use of limited resources (time, space, money, etc) example: what are the best products to spend our money on over the next three months? CA306 Data Mining

Types of Knowledge Discovered
Co-occurrences collection of items/actions/events that occur together example: items that are bought together by a consumer in a shop Association rules correlation of a set of items with another range of values for another set of variables example: when someone buys bread, he/she is likely to buy cheese Classification hierarchies create a hierarchy of classes from an existing set of events or transactions example: customers might be divided into a credit worthiness hierarchy based on their previous credit transactions Exam Q: What are the types of knowledge discovered? Provide examples. CA306 Data Mining

Types of Knowledge Discovered
Sequential patterns search for a sequence of events or actions example: a patient that underwent cardiac surgery and later developed high blood urea, is likely to suffer from kidney problems Patterns within time series detection of similarities within positions of the time series example: a pattern in a time series of stock market prices may be used to predict employment rates Categorisation and segmentation partition a set of events of items into segments/categories/classes example: treatment data on a disease can be partitioned into groups based on the side effects that are caused CA306 Data Mining

Counting Co-occurrences
The problem is to count co-occurring itemsets - motivated by market basket analysis. A database of consumer transactions forms the basis transaction: a single visit to a store, an order at a virtual store (Web site), or a single order through a mail-order catalog a transaction consists of a transaction ID, customer ID, date, item and quantity The goal is to identify items that are typically purchased together. This can be used to improve the layout of shops or catalogs. CA306 Data Mining

Frequent Itemsets (1) Consider the following transaction table:
Transaction Customer Date Items bought /09 milk, bread, juice /09 milk, juice /09 milk, eggs /09 bread, coffee, biscuits Items bought in one visit are already grouped together into itemsets. Support of an itemset: the fraction of transactions that contain all items in the itemset Examples {milk, juice} has a support of 50 % {bread, coffee} has a support of 25 % CA306 Data Mining

Frequent Itemsets (2) Large itemsets are itemsets that have a certain minimum support, i.e. are itemsets that occur frequently. Example: for a minimum support of 40%, the large itemsets are {milk, juice}, {milk}, {juice}, {bread} Proposition: every subset of a large itemset is also a large itemset Algorithm: large itemsets can be computed incrementally start with itemsets of cardinality 1 that have the required support CA306 Data Mining

5.3 Association Rules A database can be regarded as a collection of transactions. Each transaction involves a set of items. Example: the items in a basket that a shopper uses in a supermarket Transaction Time Items bought 101 6:35 milk, bread, juice 792 7:38 milk, juice :05 milk, eggs :40 bread, coffee, biscuits CA306 Data Mining

Association Rules An association rule is of form X => Y where X and Y are two disjoint sets of items Example: for sets of goods as itemsets X and Y, the expression X => Y means that if a customer buys X, he/she is also likely to buy Y. if the customer buys milk, he/she is also likely to buy juice. The support for a rule X => Y is the percentage of transactions that hold all of the items in the union X  Y. Examples: Milk => Juice has 50% support Bread => Juice has 25% support CA306 Data Mining

Association Rules The confidence of a rule X => Y is the percentage (fraction) of all transactions including X that also include Y. Example: the rule Milk => Juice has confidence 66.7% that means that 2/3 of all transactions with milk also include juice Note that support and confidence might be different. The goal is to discover rules with a certain minimum support and confidence. These rules can be used for prediction: for a rule Pen => Ink offer discounts on pens and you might increase ink sales. CA306 Data Mining

Association Rules How to compute these rules?
Generate large itemsets (itemsets with a certain minimum support) For each large itemset X, generate all rules with a certain minimum confidence (mconf): for X and Y  X, let Z = X - Y (divide X into Y and Z) if support(X) / support(Y) > mconf then Y => Z is a valid rule the confidence of rule Y => Z is defined as support(X) / support(Y) Example: for X={milk, juice} and Y={milk}  {milk, juice}, let Z={juice} X, Y, Z have support 50%, 75% and 50%, resp. (support for itemsets 5.14) for mconf=40% {milk} => {juice} is a valid rule with confidence 66.7% ( 50/75 ) Note that the X,Y,Z calculations are are based on support for itemsets (see slide 5.14) CA306 Data Mining

Generating Association Rules
In principle, generating rules based on large itemsets and their support is straightforward. Computing all large itemsets and their support creates an efficiency problem if the number of items is very high. If m is the number of items, then 2m is the number of different itemsets. Example: a typical supermarket might have several thousands of items. Computing the support of all itemsets might take a long time. Reducing the combinatorial search space is therefore important - the following properties can be used: subsets of large itemsets are large extensions of small itemsets are small If m = 3, then itemsets = 8: use simple binary example {0,0,0} to {1,1,1} CA306 Data Mining

Association Rules - Algorithms
Outline of an algorithm that finds large itemsets: Step 1: test the support for itemsets of length 1 - called 1-itemsets - by scanning the database; discard those that do not meet the minimum requirement. Step 2: extend large 1-itemsets into 2-itemsets by appending one item each time (this generates all itemsets of length two); test the support and eliminate all 2-itemsets that do not meet the minumum support. Step 3: repeat the above steps: extend (k-1)-itemsets into k-itemsets. CA306 Data Mining

Association Rules among Hierarchies
Items might be divided among disjoint hierarchies based on some classification, e.g. Beverage can be divided into Juice and Milk Associations might occur among the hierarchies of items. Example: healthy frozen yoghurt => bottled water Particularly interesting are associations across hierarchies. this kind of information can be used to arrange different kinds of items in a supermarket CA306 Data Mining

Negative Associations
Negative associations are more difficult to detect than positive associations. Example: 60% of customers who buy crisps do not buy bottled water. There are usually more negative associations than positive ones. The majority of itemset combinations do not occur in databases. Finding interesting negative associations can be difficult. CA306 Data Mining

Association Rules - Additional Considerations
Sampling: For very large databases, sampling improves efficiency. Truly representative samples can help to find most of the rules. The danger is that false positives might be discovered (large itemsets that are not truly large); true positives might be missing. Other problems: Cardinality of itemsets and volume of transactions can be very high. Variablity of transactions (geographical, season) makes sampling difficult. Multiple classifications along different dimensions. CA306 Data Mining

5.4 Sequential Patterns Sequential patterns are based on sequences of itemsets. Assume transactions to be ordered by time. Example: transactions in a supermarket {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} may be based on three visits of a customer A subsequence of a sequence is obtained by deleting one or more itemsets. let {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} be the orginal sequence {milk, bread, juice} ; {bread, eggs} is a subsequence {milk, bread, juice} ; {milk, coffee, biscuits} is a subsequence CA306 Data Mining

Support for Sequences A sequence {a1, ... , am} is contained in another sequence S if S has a subsequence {b1, ..., bn} such that ai  bi for 1 <= i <= n Example: {milk, bread} ; {coffee, biscuits} is contained in {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} The support of a sequence S is the percentage of a set of given sequences that contain S as a subsequence. CA306 Data Mining

Discovery of Patterns in Time Series
Time series are sequences of events. An event might be a fixed type of transaction. Example: closing price of a stock or fund each day. Analysis of time series: find period of time in which the stock did not fluctuate more than 1% find period (week/month/quarter) with the greatest loss identify stocks with similar behaviour CA306 Data Mining

5.5 Classification and Regression
Classification Rules Regression Tree-structured Rules CA306 Data Mining

Discovery of Classification Rules
Classification means defining/identifying a function that maps an object into one of many possible classes. Example: a bank wants to classify loan applicants into “loanworthy” and “not loanworthy” a classification rule could define the classification not loanworthy: current monthly debt obligation exceeds 25% of monthly net income loanworthy: otherwise loanworthiness is a dependent, categorical attribute In general there is one rule (set) per class (var1 in range1) and ... and (varn in rangen) => object O in class C1 var1 , ..., varn are the predictor attributes Create a class C; define one rule for that class; determine what the extent of that class is i.e. what items belong to that class. Categorical means we are classifying or categorising. CA306 Data Mining

Support and Confidence
Again we can define support and confidence for these rules. The support for a classification condition C is the percentage of tuples that satisfy C. The support for a rule C1 => C2 is the support for the condition C1  C2. (C1 AND C2 is the set of objects in both C1 and C2.) Consider those tuples that satisfy condition C1. The confidence for a rule C1 => C2 is the percentage of such rules that also satisfy condition C2. C1  C2 means LOGICAL AND CA306 Data Mining

Regression Regression is similar to classification, except that the dependent variable is numerical (and not categorical). Rules (such as classification rules) can be regarded as functions. A regression rule is a function that maps variables into a target class variable. Example: LabTest(patientID, test1, ... , testn) the values in that relation result from a series of lab tests the target variable P is the probability of survival - a numerical variable the regression rule: (test1 in range1) and ... and (testn in rangen) => P = x the regression function is P = f(test1, ... , testn) Test1 to testn must provide an overall value P CA306 Data Mining

Regression (2) If P appears as a function y = f(x1, ... , xn)
and f is linear in the domain variables, then the process of deriving f from a given set of tuples <x1, ... , xn, y> is called linear regression. Linear regression is a common statistical technique. CA306 Data Mining

Tree-Structured Rules
Specific classification and regression rules shall now be examined. These are rules that can be represented as trees - called classification trees or decision trees. These trees are typically the output of the data mining activity. Each path from a root to a leaf node represents one classification rule. Example: Insurance risk determination for motor insurance Age <= > 25 Car Type NO sports family YES NO If greater than 25, then no risk. If less than 25, then if Sports car then risk. CA306 Data Mining

Decision Trees A decision tree is a graphical representation of a collection of classification rules. Each node in the tree is labelled with a predictor or splitting attribute. Each outgoing edge of an internal node is labelled with a predicate that involves the splitting attribute. Each leaf node is labelled with a value of the depending attribute. A classification rule can be associated with each leaf node - constructed as the conjunction of the predicates: Age <= 25 and Car Type = sports for the YES-leaf Decision trees are constructed in two phases: growth phase: create tree based on specialised rules from an input database (relation) pruning phase: reduce tree size by generalising rules CA306 Data Mining

5.6 Other Types of Data Mining
Neural Networks Genetic Algorithms Clustering and Segmentation CA306 Data Mining

Neural Networks Techniques from artificial intelligence can be used to generalise regression. Neural networks provide an iterative method to carry out this generalised regression. Neural networks use a curve-fitting approach to infer a function from a set of samples. This process is based on learning: a test sample is the initial input, the system then incrementally infers functions based on more samples Neural networks can be applied to classification problems. Modelling time series with neural networks is difficult. CA306 Data Mining

Genetic Algorithms (1) Genetic algorithms (GA) are a class of randomised search procedures for adaptive and robust search over a wide range of search topologies. Principle: Genetic algorithms extend the idea of characterising human DNA by a four-letter alphabet (A,C,T,G). Construction: Devise an alphabet that allows the encoding of a solution to the decision problem in terms of strings of that alphabet. Usage: Study the cutting and combination of strings (compare natural reproduction and evolution). New generations of individuals (solutions) are generated and assessed - survival of the fittest. CA306 Data Mining

Genetic Algorithms (2) Generation of solutions - comparison with other techniques. GA search uses a set of solutions during each generation rather than a single solution. The search in the string-space represents a much larger parallel search in the space of encoded solutions. The memory of the search completed is represented solely by the set of solutions available for generation. A GA is a randomised algorithm since search mechanisms use probabilistic operators. While progressing from one generation to the next, a GA finds near-optimal balance between knowledge acquisition and exploitation by manipulating encoded solutions. CA306 Data Mining

Clustering and Segmentation
Clustering is about identification and classification. Clustering tries to identify categories (or clusters) to which a data object can be mapped. The categories can be disjoint or might overlap; they might be organised into trees. A related problem: multivariate probability density functions. CA306 Data Mining

5.7 Applications of Data Mining
Decision-making contexts: marketing: analysis of customer behaviour based on buying patterns; determination of marketing strategies (store locations, advertising campaigns, etc); segmentation of customers, stores, products. finance: analysis of creditworthiness of clients; performance analysis of finance investments; evaluation of financing options; fraud detection. CA306 Data Mining

Applications Manufacturing: Health care:
optimisation of resources (machines, manpower, material); optimal design of manufacturing process, shop-floor layout, etc. Health care: analysis of effectiveness of certain treatments; optimisation of processes in a hospital; analysing side effects of drugs; relating patient wellness and doctor qualifications. CA306 Data Mining

Section 5 Data Mining.

Similar presentations

Presentation on theme: "Section 5 Data Mining."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Section 5 Data Mining.

Similar presentations

Presentation on theme: "Section 5 Data Mining."— Presentation transcript:

Similar presentations

About project

Feedback