Oct 6, 2021

# An Introduction to Market Basket Analysis

Market Basket Analysis is a set of techniques which probably derived its name from the “Shopping Basket”. It is an essential tool for any retailer to understand and identify hidden patterns from the point-of-sale data. While Market Basket Analysis a.k.a MBA originated from the retail industry, it finds widespread application in Banking and Financial Services Industry, Telecom, Insurance and so on.

Essentially Market Basket Analysis would tell a retail about what two products would be bought together most frequently and this could become a basis for defining a strategy for

A. Determining store layouts to maximize revenue

B. Bundling strategies for products

C. Discounting strategies

And the list can go on and on…

Market Basket Analysis analyses the items that are bought as part of a single order and hence analyzes all the orders identifying frequently occurring items.

For the other industries such as Banking, a marketer can try to identify which products to recommend to a customer or in Telecom, the Telecom provider could identify bundling strategies for services. An insurance provider can look at unusual combinations of claims which could be a sign of fraud.

An interesting anecdote is that marketing researchers found that on weekends, Beer and Diapers had a good chance of being bought together. While this anecdote was debunked later, but it gives a good understanding of how marketers can identify hidden patterns in data. Beer and Diapers have nothing in common, but it is possible that in families who have newborn babies, the father gets the job of weekend shopping and since diapers are used frequently, they essentially figure in the shopping list and while the father is at the store, the father also picks up a pack of beer.

Market Basket Analysis uses something called as **Association Rules **which analyze the frequently occurring items in a shoppers cart. The most popular algorithm is called as **Apriori algorithm. **It would come up with rules such as

*“If the customer has picked up bread, then there is a good likelihood that the customer would also pickup butter”*

What this would mean for a retailer is that they can keep bread and butter together in a store so that the customer would invariably pickup butter along with bread or the retailer can also keep the two products at two extreme corners of the store so that the customer has to walk across the store to pick up the two products. What might happen is that the customer who came for just bread and butter would end up picking up a lot more items, which fancy the eye!

To understand Association Rules better, it is important to note and understand the three concepts of *Support, Confidence and Lift.*

**Support: **Support is defined as the probability that a particular product or bundle of products will be picked up. Mathematically it is defined as

Support = Frequency(Product A)/Total Number of Transactions

**Confidence: **Confidence is given by the ratio of frequency of Product A and Product B divided by the frequency of Product A. Mathematically it can also be given as the conditional probability of B given A.

Confidence = Frequency(A, B)/Frequency(A)

**Lift: **Lift gives the ratio of the support for Product A and B divided by the product of support for A and support for B.

Lift = Support(A,B)/(Support(A) * Support (B)

Lift in a way will tell the retailer if it is better to sell Product A and Product B together as compared to selling them independently.

Let’s take an example and it will clear out the above mentioned concepts

Transaction 1: Customer bought Milk, Bread and Butter

Transaction 2: Customer bought Milk, Butter and Eggs

Transaction 3: Customer bought Bread, Butter and Eggs

Transaction 4: Customer bought Milk, Eggs and Cheese

Transaction 5: Customer bought Bread, Butter and Cheese

Given the above mentioned transactions, we want to calculate support, lift and confidence for the rule which says that if the customer bought milk, the customer is likely to buy eggs!

**Support for Milk and Eggs** = No of times Milk and Eggs occur/Total Transactions = 2/5 {Transaction 2 and 4}

**Confidence for Eggs given Milk** = Frequency(Milk, Eggs)/Frequency (Milk) = 2/3 { T2 and T4 have Milk and eggs, Milk occurs in T1, T2 and T4) = 2/3

**Lift for Eggs and Milk** = Support for Eggs and Milk / (Support of Milk * Support of Eggs) = 2/5/(3/5*3/5)= 10/9

Now let’s try to understand how the rules are formed. For this we will consider again 5 transactions, each having a atleast 2 products.

We will try to create **frequent itemsets** which are nothing but grouping of products/items together and we will compute support for each such combination. We will reject any rule which has support below 40% (less than 2/5). This is also called the **Threshold Criteria for Support.**

So first we will create a table of individual products and compute support for each product individually.

We see that Product D has only 20% support and hence we remove product D. Now only A, B, C and E are remaining. We will create group of two products taking all possible combination from A, B, C, E.

We can eliminate A, B given that support is below 40%.

Repeating the process for three items, we get the following

After applying the same rules, we are left with two frequent itemsets {A, C, E} and {B, C, E}. We can go upto four products {A, B, C, E} but the support for that is just 1/5. So our process can stop here.

Since we had just five transactions, it was easy to compute support. However, consider hundreds of millions of transactions! Computing support becomes computationally challenging exercise and we can use a rule called **“Pruning”. **Pruning helps in eliminating groups of items. For e.g. in the above mentioned table {A,B,C} and {A,B,E} can be easily removed because the subset {A, B} was already eliminated earlier. If a subset of a frequent itemset does not meet the threshold for support, then the frequent itemset will also not meet. Logically a subset should have a better support than the parent set. If it is not the case, then the parent set would also not meet the threshold criteria.

Now that we have two frequent itemsets {A, C, E} and {B, C, E}, we can create now Association Rules.

To do this, we need to create all possible subsets of {A, C, E} and {B, C, E}. Let’s consider the first set which is {A, C, E}

Subsets of {A, C, E} are {A, C, E}, {A, C}, {A, E}, {C, E}, {A}, {C}, {E}.

So let’s say that we want to measure how effective the rule {A,C -> E}. Which means that if the customer has bought products A and C, then how likely is the customer going to buy product E.

To calculate the effectiveness, we again define a new **Threshold Criteria for Confidence**, saying that we will reject this rule if the confidence on the rule is less than or equal to 50%.

Confidence (A, C->E) = Support(A, C, E)/Support(A,C) = (2/5)/(3/5) = 66.67%. Hence we will accept the rule.

Confidence of (E -> A,C) = Support (A,C,E)/Support(E) = (2/5)/(4/5) = 50%. We will reject this rule, since this does not meet the Confidence Threshold criteria.

We can form such rules for all combinations and then figure out which rules give the most confidence and hence can be considered.

Market Basket Analysis essentially generates such rules and allows the marketer to look at these rules and derive insights. By themselves, the rules mean nothing. Most of the time, the rules could be either “**Actionable**” meaning that some action can be taken based on the rule, or “**Trivial**” meaning that it is a known fact and there is nothing much to do or there could be “**Inexplicable**” rules which make no sense. Hence it is upto the marketer to identify the right set of rules and incorporate them into marketing interventions such as creating customized campaigns etc.

**Challenges with Big Data**

It is likely that a retail store is selling tens of thousands of products/Stock Keeping Units and hence it is computationally difficult to identify such frequent itemsets from such ten thousand products or SKUs and considering billions of transactions. In such cases, association rules can be created at category level. This can also give good insights into associations between different categories and marketing interventions can be designed around such association rules.

**References:**

Data Mining Techniques for Marketing, Sales and Customer Relationship Management by Gordon S Linoff and Michael J.A. Berry