Market Basket Analysis in R

In this post, we will conduct a market basket analysis on the shopping habits of people at a grocery store. Remember that a market basket analysis provides insights through indicating relationships among items that are commonly purchased together.

The first thing we need to do is load the package that makes association rules, which is the “arules” package. Next, we need to load our dataset groceries. This dataset is commonly used as a demonstration for market basket analysis.

However, you don’t want to load this dataset as dataframe because it leads to several technical issues during the analysis. Rather you want to load it as a sparse matrix. The function for this is “read.transactions” and is available in the “arules” package

library(arules)
#make sparse matrix
groceries<-read.transactions("/home/darrin/Documents/R working directory/Machine-Learning-with-R-datasets-master/groceries.csv", sep = ",")

Please keep in mind that the location of the file on your computer will be different from my hard drive.

We will now explore the data set by using several different functions. First, we will use the “summary” function as indicated below.

summary(groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The output tells us the number of rows in our dataset (9835) columns (169) as well as the density, which is the percentage of columns that are not empty (2.6%). This may seem small but remember that the number of purchases varies from person to person so this affects how many empty columns there are.

Next, we have the most commonly purchased items. Milk and other vegetables were the two most common followed by other foods. After the most frequent items, we have the size of each transaction. For example, 2159 people purchased one item during a transaction. While one person purchased 32 items in a transaction.

Lastly, we summary statistics about transactions. On average, a person would purchase 4.4 items per transaction.

We will now look at the support of different items. Remember, that the support is the frequency of an item in the dataset. We will use the “itemFrequencyPlot” function to do this and we will add the argument “topN” to sort the items from most common to less for the 15 most frequent transactions. Below is the code

itemFrequencyPlot(groceries, topN=15)

download The plot that is produced gives you an idea of what people were purchasing. We will now attempt to develop association rules using the “apriori” function.

For now, we will use the default settings for support and confidence (confidence is the proportion of transactions that have the same item(s)). The default for support is 0.1 and for confidence it is 0.8. Below is the code.

apriori(groceries)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE     0.1      1     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 983 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## set of 0 rules

As you can see from the printout, nothing meets the criteria of a support of 0.1 and confidence of 0.8. How to play with these numbers is a matter of experience as there are few strong rules for this matter. Below, I set the support to 0.006, confidence to 0.25, and the minimum number of rules items to 2. The support of 0.006 means that this item must have been purchased at least 60 times out of 9835 items and the confidence of 0.25 means that rule needs to be accurate 25% of the time. Lastly, I want at least two items in each rule that is produced as indicated by minlen = 2. Below is the code with the “summary” as well.

groceriesrules<-apriori(groceries, parameter = list(support=0.006, confidence = 0.25, minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##        0.25    0.1    1 none FALSE            TRUE   0.006      2     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(groceriesrules)
## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence          lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :3.9565  
## 
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.006       0.25

Our current analysis has 463 rules. This is a major improvement from 0. We can also see how many rules contain 2 (150), 3 (297), and 4 items (16). We also have a summary of the average number of items per rule. The next is descriptive stats on the support and confidence of the rules generated.

Something that is new for us is the “lift” column. Lift is a measure how much more likely an item is to be purchased above its number rate. Anything above 1 means that the likelihood of purchase is higher than chance.

We are now going to look for useful rules from our dataset. We are looking for rules that we can use to make decisions. It takes industry experiences in the field of your data to really glean useful rules. For now, we can only determine this statistically by sorting the rules by their lift. Below is the code for this

inspect(sort(groceriesrules, by="lift")[1:7])
##   lhs             rhs                     support    confidence  lift
## 1 {herbs}         => {root vegetables}    0.007015760  0.4312500 3.956
## 2 {berries}       => {whipped/sour cream} 0.009049314  0.2721713 3.796
## 3 {other vegetables,                                                  ##    tropical fruit,                                                    
##    whole milk}    => {root vegetables}    0.007015760  0.4107143 3.768
## 4 {beef,                                                              
##    other vegetables} => {root vegetables} 0.007930859  0.4020619 3.688
## 5 {other vegetables,                                                  
##    tropical fruit} => {pip fruit}         0.009456024  0.2634561 3.482
## 6 {beef,                                                              
##    whole milk}    => {root vegetables}    0.008032537  0.3779904 3.467
## 7 {other vegetables,                                                  
##    pip fruit}     => {tropical fruit}     0.009456024  0.3618677 3.448

The first three rules are translated into simple English below as ordered by lift.

  1. If herbs are purchased then root vegetables are purchased
  2. If berries are purchased then whipped sour cream is purchased
  3. If other vegetables, tropical fruit, and whole milk are purchased then root vegetables are purchased

Conclusion

Since we are making no predictions, there is no way to really objectively improve the model. This is normal when the learning is unsupervised. If we had to make a recommendation based on the results we could say that the store should place all vegetables near each other.

The power of market basket analysis is allowing the researcher to identify relationships that may not have been noticed any other way. Naturally, insights gained from this approach must be used for practical actions in the setting in which they apply.

1 thought on “Market Basket Analysis in R

  1. Pingback: Market Basket Analysis in R | Education and Res...

Leave a Reply