Market Basket Analysis - An Introduction (Ignore ItemCount)

Market Basket Analysis - An Introduction (Ignore ItemCount)
Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99258

Why do organizations need business intelligence?
BI systems are computer programs provide valuable information for decision making. Three primary BI systems: __________ tools read data, process them, and format the data into structured reports (e.g., sorting, grouping, summing, and averaging) that are delivered to users. They are used primarily for assessment. RFM (Recency, Frequency, and Monetary Value) is one of the tool for reporting. ____________ tools process data using statistical, regression, decision tree, and market basket techniques to discover hidden patterns and relationships, and make predictions based on the results _______________________ tools store employee knowledge, make it available to whomever needs it. These tools are distinguished from the others because the source of the data is human knowledge. Reporting Data-mining Reporting – routine decision Data mining – analysis Knowledge – expertise to be retained and reused right away RFM: Recency, Frequency, Monetary Value Knowledge management

Rapid Miner RapidMiner is unquestionable the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data ... Two versions of products Studio Server

RapidMiner (Studio) Open and extensible.
Easy-to-use visual environment for predictive analytics. No programming required. RapidMiner is easily the most powerful and intuitive graphical user interface for the design of analysis processes. Forget sifting through code! You can also choose to run in batch mode. Whatever you prefer, RapidMiner has it all. Open and extensible. Hundreds of data loading, data transformation, data modeling, and data visualization methods with access to a comprehensive list of data sources like Excel, Access, Oracle, IBM DB2, Microsoft SQL, Sybase, Ingres, MySQL, Postgres, SPSS, Text files and more! Easily integrate your own specialized algorithms into RapidMiner by leveraging its powerful and open extension APIs. Advanced analytics at every scale – perfect for big data. In-memory and in-database analytics for every size data source. RapidMiner Studio breaks away from the limitations of traditional data analysis tools and allows you to work with large data sources. Runs on all major platforms and operating systems. Easy-to-use visual environment for predictive analytics. No programming required. RapidMiner is easily the most powerful and intuitive graphical user interface for the design of analysis processes. Forget sifting through code! You can also choose to run in batch mode. Whatever you prefer, RapidMiner has it all. Open and extensible. Hundreds of data loading, data transformation, data modeling, and data visualization methods with access to a comprehensive list of data sources like Excel, Access, Oracle, IBM DB2, Microsoft SQL, Sybase, Ingres, MySQL, Postgres, SPSS, dBase, Text files and more! Easily integrate your own specialized algorithms into RapidMiner by leveraging its powerful and open extension APIs. Advanced analytics at every scale – perfect for big data. In-memory, in-database and in-Hadoop analytics for every size data source. RapidMiner Studio breaks away from the limitations of traditional data analysis tools and allows you to work with large data sources. Runs on all major platforms and operating systems.

A Small Experiment 2 Think of a number between 1 and 10
Multiply this number by 9. Work out the checksum of the result, i.e. the sum of the numbers. Multiply the result by 4. Divide the result by 3. Deduct 10. What is the result? The result is ___ Do you believe in coincidence? Prediction … 2

Market Basket

Market Basket Analysis
Definition: Market Basket Analysis (Association Analysis) is a mathematical modeling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. The term market basket or commodity bundle refers to a fixed list of items used specifically to track the progress of inflation in an economy or specific market. Purpose: It is used to analyze the customer purchasing behavior and helps in increasing the sales and maintain inventory by focusing on the point of sale transaction data. Given a dataset, an algorithm (e.g., FPGrowth or Apriori Algorithm) trains and identifies product baskets and product association rules to find frequent items. More about Apriori Algorithm see Market Basket Analysis-NEW.pptx Apriori and FPGrowth are two different algorithms for finding Frequent Items.

Market Basket Analysis (cont.)
For example, if you are in an English pub and you buy a pint of beer and don't buy a bar meal, you are more likely to buy chips at the same time than somebody who didn't buy beer. The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases. Typically the relationship will be in the form of a rule: IF {beer, no bar meal} THEN {chips}. The probability that a customer will buy beer without a bar meal (i.e. buy beer and no bar meal together - the antecedent is true) is referred to as the support for the rule. The conditional probability that a customer will purchase chips is referred to as the confidence.

Market Basket Analysis (cont.)
Other Application Areas Although Market Basket Analysis conjures up pictures of shopping carts and supermarket shoppers, it is important to realize that there are many other areas in which it can be applied. These include: Analysis of credit card purchases. Analysis of telephone calling patterns. Identification of fraudulent medical insurance claims. (Consider cases where common rules are broken). Analysis of telecom service purchases. BENEFITS: simple computations and different data forms can be analyzed can be undirected (don’t have to have hypotheses before analysis)

Definitions and Terminology
What items are bought together What’s in each shopping cart/basket? Basket data consist of collection of transaction data and items bought in a transaction _________ How does this data differ from a transaction database? Retail organizations interested in generating qualified decisions and strategy based on analysis of transaction data what to put on sale, how to place merchandise on shelves for maximizing profit, customer segmentation based on buying pattern Itemset Pivoting

Definitions and Terminology (cont.)
Transaction is a set of items (Itemset). Confidence : It is the measure of uncertainty or trust worthiness associated with each discovered pattern. Support : It is the measure of how often the collection of items in an association occur together as percentage of all transactions. If you only want to find strong correlations, you can set it ____. If you want a wide range of unknown correlations, set it ____. Therefore, the lower support value the easier to get an solution. Frequent itemset : If an itemset satisfies minimum support, then it is a frequent itemset. Strong Association rules: Rules that satisfy both a minimum support threshold and a minimum confidence threshold In Association rule mining, we first find all frequent itemsets (i.e., meet a user-specified minimum support threshold) and then generate strong association rules (based on a user-specified minimum confidence) from the frequent itemsets high low Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps: First, minimum support is applied to find all frequent itemsets in a database. Second, these frequent itemsets and the minimum confidence constraint are used to form rules. While the second step is straightforward, the first step needs more attention. The minimum threshold is there for two reasons. One, to filter out data that is inconsequential, and two, to prevent the algorithm from running out of memory. If you only want to find strong correlations, you can set it high. If you want a wide range of unknown correlations, set it low.

Grocery Store Scenario
Here we have an Excel-based dataset containing information about customers who have shopped in a grocery store. We are trying to detect relationships or associations (i.e.,_______ ) between specific items in a large "catalog" of objects using market basket analysis. A simple example would be the occurrence of beers and diapers in the same sales transaction. However the real value from association rules analysis is finding connections between seemingly non-intuitive items: it would be indeed surprising (and valuable) to the retailer if it was found that there was a strong association between beer and diapers. patterns

Grocery Store Scenario
In this simple scenario that there are three customers (John, Mark and Alex) with six different items they purchased. Information about each customer’s purchasing history is included in the dataset as shown in Table 1. In order to produce the result from market basket analysis, we are using the RapidMiner software ( RapidMiner supports many different data mining techniques, but we will focus only on market basket analysis here. Please note that TID and ITEM should be in ‘upper’ case.

RapidMiner 1. Market Basket Analysis (Grocery store example) use data provided by your instructor a) follow the tutorial to produce the “support” from the data set b) ‘CreateAssociation” operator needs to be added in order to obtain both ‘support’ and ‘confidence’ information and perform its analysis 2. HW – ScubaGear Data file: Grocery.xlsx

Part II Develop market basket analysis using RapidMiner (Ignore ItemCounts) Create a folder of \bmis441 and a sub-folder of \datamining under \bmis441 and then Downoad data files from Bb (at CourseDocuments  Software …Data Mining) Grocery.xls Save the file in the folder of \bmis441\datamining\

Market Basket Analysis – (without ItemCount)
1. We will begin by starting a new project from the main menu by clicking the “New Process” Icon when starting RapidMiner as shown in Figure 1-a. 2. On the left hand side of the screen, we will click on Repositories, and will follow by expanding (clicking the plus (+) sign) Samples > processes > 02_Preprocessing. Double-click the 23_Transactional2Basket Process (Figure 1-b) the process model of “Transactional2Basket Process” is automatically created as in Fig 1-c.

Fig. 1-a Welcome Page

New Process Page Operators Tab Repositories Tab Parameters Area
Process Area Parameters Area Process Help Area Problems/Log Area

Step 2. On the left hand side of the screen, we will click on Repositories, and will follow by expanding (clicking the plus (+) sign) Samples > processes > 02_Preprocessing. Double-click the 23_Transactional2Basket Process (Figure 2-a). Main Process window will have loaded the skeleton format for the Market Basket Analysis as is shown in Fig 2-b. Fig 2-a Fig 2-b

Your Turn Follow the handout to complete the tutorial

Step 8. The Example2PivotingAttribute process transforms an example set by grouping multiple examples of single units to single examples. By clicking Data View as was done in Fig 8-a, we see that the data has been grouped together by Transaction Analysis. In the resulting table, question marks mean that said item has not been purchased in that transaction. In the AttributeSubsetPreprocessing we will substitute the question marks with the word “false.” (Fig 8-b) Fig 8-a Fig 8-a

Step 10. The results screen shows the support of each individual item, and the support of items occurring together. As seen in Fig 9, the support for item 1 (id_1.0) is 0.667, or 66.7%. Meaning that item 1 is found in 66.7% of the transactions. Whereas items 2 and 3 (and items 4,5,6) only occur once, so their support is only 33.3%. How about confidence? It means that if there are 100 transactions done today, item 1 is found (included) in almost 67 transactions. Whereas items 2 and 3 (and items 4,5,6) only occur once, so their support is only 33.3% How does this imply? How does this imply? Fig 9

How about confidence? Step 11. While the previous result shows the support of items occurring together, we need to add an additional “operator” to be able to create and obtain association rules between the items. As shown in Fig 10-a, we must add a “Create Association Rules” operator between the FPGrowth operator and the result. However, if we go ahead to RUN the process, an error was detected as shown in Fig 10-b

Step 11 (Cont.). … Therefore, we need re-connect the ports between operators. First, reconnect ‘exa’ (example set) from FPGrowoth directly to ‘res’ and then connect from ‘fre’ in FPGrowth (frequent sets) to ‘ite’ (item sets) in CreateAsociation. Finally, reconnect ‘rul’ (rules) from CreateAssociation to ‘res’ as shown in Fig 10-d and lights turn into yellow. Before After

Interpretation on the results: (Ignores Data Amounts)
Step 13. Row no. 1 can be read as: The purchase of item 1 implies the purchase of item 3. These items are bought together in 33% of the transactions, and 50% probability of the times that Item 1 is bought, Item 3 is bought as well. If we look at row no. 4, we see that the purchase of Item 3 implies the purchase of Item 1. Again, this combination happens in 33% of the transactions, but we see that every time Item 3 is bought, Item 1 is bought as well (100% confidence). What is the business “implication”? Fig 11-c

Finally, if we look at row no
Finally, if we look at row no. 15, we see that the purchase of Item 1 and Item 3 together implies the purchase of Item 2. This combination happens in 33% of the transactions, but, we see that every time Item 1 and Item 3 are bought, Item 2 is also bought (100% confidence).The text view (AssociatioRules) is illustrated in Fig 11-d. Fig 11-c

Fig 11-d

HW Hints: 1. Follow the handout 2. Create your data file
modify the data file from the previous handout Use “actual” customer name rather than customer id (numerical values in TID) Please note that TID and ITEM should be in ‘upper’ case. 3. Figure out how to “Read” your excel file into the model

Data files used between tutorial and HW is that HW is with four customers – TID; see detail on the HW)

What is the difference between these two files?
End note: If we want to change the value of TID and ITEM from numbers to customer’s name and item’s name, information stated in step 4 should be changed as follows (the worksheet of “Grocery Data (Name) should be selected in Step 2 of “Data import wizard) as shown in Fig 12-a thru Fig 12-d. Fig 12-a Fig 12-b What is the difference between these two files?

Fig 12-c Fig 12-d The results (Table View) with names of customers and items are shown in Fig 13-a and Fig 13-b. Fig 13-a

We now can re-read row no
We now can re-read row no. 1 as: The purchase of Beers implies the purchase of Diapers. These items are bought together in 33% of the transactions, and 50% probability of the times that Beers is bought, Diapers is bought as well. Further, if we look at row no. 4, we see that the purchase of Diapers implies the purchase of Beers. Again, this combination happens in 33% of the transactions, but we see that every time Diapers is bought, Beers is bought as well (100% confidence). Fig 13-b

Data files used between tutorial and HW is that HW is with four customers – TID; see detail on the HW)

Errors on Column Heading with ‘Mixed’ Case
If the a column heading is with ‘lower ’ or ‘mixed”case (i.e., Item), error messages will be displayed. How to fix it?

Errors on Column Heading with ‘Mixed’ Case (cont.)

Errors on Column Heading with ‘Mixed’ Case (cont.)
Then, the model is ready to be “RUN”.

Market Basket Analysis - An Introduction (Ignore ItemCount)

Similar presentations

Presentation on theme: "Market Basket Analysis - An Introduction (Ignore ItemCount)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Market Basket Analysis - An Introduction (Ignore ItemCount)

Similar presentations

Presentation on theme: "Market Basket Analysis - An Introduction (Ignore ItemCount)"— Presentation transcript:

Similar presentations

About project

Feedback