Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our.

Similar presentations


Presentation on theme: "Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our."— Presentation transcript:

1 Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above. Copyright © 2007 Cengage Learning Australia Pty Limited Computers are useless. They can only give you answers. –Pablo Picasso Pablo Picasso

2 My Request A good listener is not only popular everywhere, but after a while he gets to know something - Wilson Mizner

3 Association Rule Mining PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above. Copyright © 2007 Cengage Learning Australia Pty Limited

4 Objectives On completion of this lecture you should know: Features of association rule mining. Apriori: Most popular association rule mining algorithm. Association rules evaluation. Association rule mining using WEKA. Strengths and weaknesses of association rule mining. Applications of association rule mining.

5 Affinity Analysis Market Basket Analysis: Which products go together in a basket? –Uses: determine marketing strategy, plan promotions, shelf layout. Looks like production rules, but more than one attribute may appear in the consequent. –IF customers purchase milk THEN they purchase bread AND sugar Association rules

6 Transaction data Transaction IDItemset or Basket 01{webcam, laptop, printer} 02{laptop, printer, scanner} 03{desktop, printer, scanner} 04{desktop, printer, webcam} Table 7.1. Transactions Data

7 Rule for Support: The minimum percentage of instances in the database that contain all items listed in a given association rule. Concepts of association rules

8 Example 5,000 transactions contain milk and bread in a set of 50,000 Support => 5,000 / 50,000 = 10%

9 Rule for Confidence: Given a rule of the form If A then B, rule for confidence is the conditional probability that B is true when A is known to be true. Concepts of association rules

10 Example IF customers purchase milk THEN they also purchase bread: –In a set of 50,000, there are 10,000 transactions that contain milk, and 5,000 of these contain also bread. –Confidence => 5,000 / 10,000= 50%

11 Parameters of ARM 1.To find all items that appears frequently in transactions. The level of frequency of appearance is determined by pre-specified minimum support count. Any item or set of items that occur less frequently than this minimum support level are not included for analysis.

12 2. To find strong associations among the frequent items. The strength of the association is quantified by the confidence. Any association below a pre-specified level of confidence is not used to generate rules.

13 Relevance of ARM On Thursdays, grocery store consumers often purchase diapers and beer together. Customers who buy a new car are very likely to purchase vehicle extended warranty. When a new hardware store opens, one of the most commonly sold items is toilet fittings.

14 Functions of ARM Finding the set of items that has significant impact on the business. Collating information from numerous transactions on these items from many disparate sources. Generating rules on significant items from counts in transactions.

15 Single-dimensional association rules Transaction id webcamlaptopprinterscannerdesktop Table 7.2 Boolean form of a transaction data. (cont.)

16

17 Multidimensional association rules

18 General considerations We are interested in association rules that show a lift in product sales where the lift is the result of the products association with one or more other products. We are also interested in association rules that show a lower than expected confidence for a particular association.

19 ItemsetSupports in % Webcam50% Laptop50% Printer100% Scanner50% Desktop50% { webcam, laptop}25% {webcam, printer }50% {webcam, scanner}00% {webcam, desktop}25% {laptop, printer}50% {laptop, scanner}25% {laptop, desktop}00% {printer, scanner}50% {printer, desktop}50% {scanner, desktop}25% {webcam, laptop, printer}25% {webcam, laptop, scanner}00% {webcam, laptop, desktop}00% {laptop, printer, scanner}25% {laptop, printer, desktop}00% {printer, scanner, desktop}25% {webcam, laptop, printer, scanner, desktop}00%

20 Enumeration tree Figure 7.1 Enumeration tree of transaction items of Table 7.1. In the left nodes, branches reduce by 1 for each downward progression – starting with 5 branches and ending with 1 branch, which is typical

21 Association models n C k = The number of combinations of n things taken k at a time.

22 Two other parameters Improvement (IMP) = Share (SH) = where LMV = local measure value and TMV is total measure value.

23 IMP and SH measure Transaction ID Yogurt (A) Cheese (B) Rice (C) Corn (D) T T20305 T T T Table 7.4. Market transaction data (cont.)

24 Yogurt(A)Cheese(B)Rice(C)Corn(D)Itemset LMVSHLMVSHLMVSHLMVSHLMVSH A B C D AB AC AD BC BD CD ABC ABD ACD BCD00.00 ABCD00.00 Table 7.5 Share measurement

25 Taxonomies Low-support products are lumped into bigger categories and high-support products are broken up into subgroups. Examples are: Different kinds of potato chips can be lumped with other munchies into snacks, and ice cream can be broken down into different flavours.

26 Large Datasets The number of combinations that can generate from transactions in an ordinary supermarket can be in the billions and trillions. The amount of computation thus required for Association Rule Mining can stretch any computer.

27 APRIORI algorithm 1.All singleton itemsets are candidates in the first pass. Any item that has a support value of less than a specified minimum is eliminated. 2.Selected singleton itemsets are combined to form two-member candidate itemsets. Again, only the candidates above the pre-specified support value are retained. (cont.)

28 3.The next pass creates three-member candidate itemsets and the process is repeated. The process stops only when all large itemsets are accounted for. 4.Association Rules for the largest itemsets are created first and then rules for the subsets are created recursively.

29 Figure 7.2 Graphical demonstration of the working of the Apriori algorithm

30 APRIORI in Weka Figure 7.3 Weka environment with market-basket.arff data file

31 Step 2 Figure 7.4 Spend98 attribute information visualisation.

32 Step 3 Figure 7.5 Target attributes selection through Weka

33 Step 4 Figure 7.6 Discretisation filter selection

34 Step 5 Figure 7.7 Parameter selections for discretisation.

35 Step 6 Figure 7.8 Descretisation activation

36 Discretised data visualisation Figure 7.9 Discretised data visualisation

37 Step 7 Figure 7.10 Apriori algorithm selection from Weka for ARM

38 Step 8 Figure 7.11 Apriori output

39 Associator output 1.Dairy='(-inf ] Deli='(-inf ]' 847 ==> Bakery='(-inf ] 833 conf:(0.98)

40 Strengths and weaknesses Easy Interpretation Easy Start Flexible Data Formats Simplicity Exponential Growth in Computations Lumping Rule Selection

41 Applications of ARM Store Layout Changes Cross/Up selling Disaster Weather forecasting Remote Sensing Gene Expression Profiling

42 Recap What is association rule mining? Apriori: Most popular association rule mining algorithm. Applications of association rule mining.

43 The Clustering Task PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above. Copyright © 2007 Cengage Learning Australia Pty Limited

44 Objectives On completion of this lecture you should know: Unsupervised clustering technique Measures for clustering performance Clustering algorithms Clustering task demonstration using WEKA Applications, strengths and weaknesses of the algorithms

45 Clustering: Unsupervised learning Clustering is a very common technique that appears in many different settings (not necessarily in a data mining context) –Grouping similar products together to improve the efficiency of a production line –Packing similar items into a basket –Grouping similar customers together –Grouping similar stocks together

46 Sl. No.Subjects Code Marks 1COIT COIS COIS COIT Table 8.1 A simple unsupervised problem A simple clustering example

47 Figure 8.1 Basic clustering for data of Table 8.1.The X-axis is the serial number and Y-axis is the marks Cluster representation

48 How many clusters can you form? A A A A K K K K Q Q Q Q J J J J Figure 8.2 Simple playing card data

49 Distance measure The similarity is usually captured by a distance measure. The original proposed measure of distance is the Euclidean distance.

50 Figure 8.3 Euclidean distance D between two points A and B

51 Other distance measures City-block (Manhattan) distance Chebychev distance: Maximum Power distance: Minkowski distance when p = r.

52 Distance measure for categorical data Percent disagreement

53 Types of clustering Hierarchical Clustering –Agglomerative –Divisive

54 Agglomerative clustering 1.Place each instance into a separate partition. 2.Until all instances are part of a single cluster: a. Determine the two most similar clusters. b. Merge the clusters chosen into a single cluster. 3. Choose a clustering formed by one of the step 2 iterations as a final result.

55 Dendrogram

56 Example 8.1 Figure 8.5 Hypothetical data points for agglomerative clustering

57 Example 8.1 cont. C = {{P 1 }, {P 2 }, {P 3 }, {P 4 }, {P 5 }, {P 6 }, {P 7 }, {P 8 }} Step 1 Step 2 Step 3

58 Example 8.1 cont. Step 4 Step 5 Step 6 Step 7

59 Agglomerative clustering: An example Figure 8.6 Hierarchical clustering of the data points of Example 8.1

60 Dendrogram of the example Figure 8.7 The dendrogram of the data points of Example 8.1

61 Types of clustering cont. Non-Hierarchical Clustering –Partitioning methods –Density-based methods –Probability-based methods

62 Partitioning methods The K-Means Algorithm: 1.Choose a value for K, the total number of clusters. 2.Randomly choose K points as cluster centers. 3.Assign the remaining instances to their closest cluster center. 4.Calculate a new cluster center for each cluster. 5.Repeat steps 3-5 until the cluster centers do not change.

63 General considerations of K- means algorithm Requires real-valued data. We must pre-select the number of clusters present in the data. Works best when the clusters in the data are of approximately equal size. Attribute significance cannot be determined. Lacks explanation capabilities.

64 Example 8.2 Let us consider the dataset of Example 8.1 to find two clusters using the k-means algorithm. Step 1. Arbitrarily, let us choose two cluster centers to be the data points P 5 (5, 2) and P 7 (1, 2). Their relative positions can be seen in Figure 8.6. We could have started with any two other points. The initial selection of points does not affect the final result.

65 Step 2. Let us find the Euclidean distances of all the data points from these two cluster centers.

66 Step 2. (Cont.)

67 Step 3. The new cluster centres are:

68 Step 4. The distances of all data points from these new cluster centres are:

69 Step 4. (cont.)

70 Step 5. By the closest centre criteria P 5 should be moved from C 2 to C 1, and the new clusters are C 1 = {P 1, P 5, P 6, P 7, P 8 } and C 2 = {P 2, P 3, P 4 }. The new cluster centres are:

71 Step 6. We may repeat the computations of Step 4 and we will find that no data point will switch clusters. Therefore, the iteration stops and the final clusters are C 1 = {P 1, P 5, P 6, P 7, P 8 } and C 2 = {P 2, P 3, P 4 }.

72 Density-based methods Figure 8.8 (a) Three irregular shaped clusters (b) Influence curve of a point

73 Probability-based methods Expectation Maximization (EM) uses a Gaussian mixture model: Guess initial values of all the parameters until a termination criterion is achieved Use the probability density function to compute the cluster probability for each instance. Use the probability score assigned to each instance in the above step to re-estimate the parameters.

74 Clustering through Weka Step 1. Figure 8.9 Weka environment with credit-g.arff data

75 Step 2. Figure 8.10 SimpleKMeans algorithm and its parameter selection

76 Step 3. Figure 8.11 K-means clustering performance

77 Step 3. (cont.) Figure 8.12 Weka result window

78 Cluster visualisation Figure 8.13 Cluster visualisation

79 Individual cluster information Figure 8.14 Cluster0 instances information

80 Step 4. Figure 8.15 Cluster 1 instance information

81 Kohonen neural network Figure 8.16 A Kohonen network with two input nodes and nine output nodes

82 Contains only an input layer and an output layer but no hidden layer. The number of nodes in the output layer that finally captures all the instances determine the number of clusters in the data. Kohonen self-organising maps:

83 Example 8.3 Figure 8.17 Connections between input and output nodes of a neural network

84 Example 8.3 Cont. = = The scoring for any output node k is done using the formula:

85 Example 8.3 cont.

86 Assuming that the learning rate is 0.3, we get:

87 Cluster validation t-test -test Validity in Test Cases

88 Strengths and weaknesses Unsupervised Learning Diverse Data Types Easy to Apply Similarity Measures Model Parameters Interpretation

89 Applications of clustering algorithms Biology Marketing research Library Science City Planning Disaster Studies Worldwide Web Social Network Analysis Image Segmentation

90 Recap What is clustering? K-means: Most popular clustering algorithm Applications of clustering techniques

91 The Estimation Task PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above. Copyright © 2007 Cengage Learning Australia Pty Limited

92 Objectives On completion of this lecture you should know: Assess the numeric value of a variable from other related variables. Predict the behaviour of one variable from the behaviour of related variables. Discuss the reliability of different methods of estimation and perform a comparative study.

93 What is estimation? Finding the numeric value of an unknown attribute from observations made on other related attributes. The unknown attribute is called the dependent (or response or output) attribute (or variable) and the known related attributes are called the independent (or explanatory or input) attributes (or variables).

94 Scatter Plots and Correlation Week endingASXBHP RIO Table 9.1 Weekly closing stock prices (in dollars) at the Australian Stock Exchange

95 Figure 9.1a Computer screen-shots of Microsoft Excel spreadsheets to demonstrate plotting of scatter plot

96 Figure 9.1b

97 Figure 9.1c

98 Figure 9.1d

99 Figure 9.1e Computer screen-shots of Microsoft Excel spreadsheets to demonstrate plotting of scatter plot

100 Correlation coefficient

101 Scatter plots of X and Y variables and their correlation coefficients Figure 9.2 Scatter plots of X and Y variables and their correlation coefficients

102 CORREL xls function Figure 9.3 Microsoft Excel command for the correlation coefficient

103 Example 9.2 DateRainfall (mm/day) Streamflow (mm/day)

104 Example 9.2 cont. The computations can be done neatly in tabular form as given in the next slide: (a) For the mean values: = 167.2/10 = 16.72, = 36.01/10 = 3.601

105 Example 9.2 cont. Therefore, the correlation coefficient, r =

106 Example 9.2 cont. Therefore, the correlation coefficient, r =

107 Linear means all exponents (powers) of x must be one, i.e., it cannot be a fraction or a value greater than 1. There cannot be a product term of variables as well. Linear regression analysis

108 Fitting a straight line y = m x + c Suppose the line passes through two points A and B, where A is (x 1,y 1 ) and B is (x 2, y 2 ). Eq. 9.3

109 Example 9.3 Problem: The number of public servants claiming compensation for stress has been steadily rising in Australia. The number of successful claims in was 800 while in the figure was How many claims are expected in the year if the growth continues steadily? If each claim costs an average of $24,000, what should be the budget allocation of Comcare in year for stress-related compensation?

110 Therefore, using equation (9.3) we get: Solving, we have Y = 220.X – 437,000. If we now let X = 2007, we get the expected number of claims in the year So the number of claims in the year is expected to be 220(2007) – 437,000 = 4,540. At $24,000 per claim, Comcare's budget should be $108,960,000. Example 9.3 cont.

111 Simple linear regression Figure 9.6 Schematic representation of the simple linear regression model

112 Least squares criteria

113 StateNo. of Inst., X Membership, Y X2X2 Y2Y2 XY NSW x QLD x SA x TAS x VIC x WA x Others x Total x Table 9.2 Unisuper membership by States Example 9.5

114 Example 9.5 cont. b 1 = S xy /S xx = /915.7 = = (320.7)(14.57) = 1005 Therefore, the regression equation is Y = 320.7X

115 Type regression under help and then go to linest function. Highlight District office building data and copy with cntrl C and paste with cntrl V in your spreadsheet. Multiple linear regression with Excel

116 Multiple regression Where Y is the dependent variable; X 1, X 2,... are independent variables; 0, 1,... are regression coefficients; and a,b,... are exponents.

117 Example 9.6 Period No. of Private Houses Average weekly earnings ($) No. of persons in workforce (in millions) Variable Home loan rate (in %) = LINEST (A2:A10,B2:D10,TRUE,TRUE)

118 Example 9.6 cont. Figure 9.7 Demonstration of use of LINEST function Hence, from the printout, the regression equation is the following:H = E – W I The Ctrl and Shift keys must be kept depressed while striking the Enter key to get tabular output.

119 Coefficient of determination If the fit is perfect, the R 2 value will be one and if there is no relationship at all, the R 2 value will be zero.

120 Regression equation cannot model discrete values. We get a better reflection of the reality if we replace the actual values by its probability. The ratio of the probabilities of occurrence and non-occurrence directs us close to the actual value. Logistic regression

121 Transforming the linear regression model Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.

122 The logistic regression model ax in the right-hand side of the regression equation in vector form..

123 Logistic regression cont. Figure 9.8 Graphical representation of the logistic regression equation

124 Regression in Weka Figure 9.10 Selection of logistic function

125 Output from logistic regression Figure 9.12 Output from logistic regression

126 Visualisation option of the results Figure 9.13 Visualisation option of the results

127 Visual impression of data and clusters Figure 9.14 Visual impression of data and clusters

128 Particular instance information Figure 9.15 Information about a particular instance

129 Strengths and weaknesses Regression analysis is a powerful tool suitable for linear relationships, but most real-world problems are nonlinear. Mostly, therefore, the output is not accurate but useful. Regression techniques assume normality in the distribution of uncertainty and the instances are assumed to be independent of each other. This is not the case with many real problems.

130 Applications of regression algorithms Financial Markets Medical Science Retail Industry Environment Social Science

131 Recap What is estimation? How to solve the estimation problem? Applications of regression analysis.


Download ppt "Conventional Data Mining Techniques II A B M Shawkat Ali 1 PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our."

Similar presentations


Ads by Google