In this post, we will conduct a market basket analysis on the shopping habits of people at a grocery store. Remember that a market basket analysis provides insights through indicating relationships among items that are commonly purchased together.
The first thing we need to do is load the package that makes association rules, which is the “arules” package. Next, we need to load our dataset groceries. This dataset is commonly used as a demonstration for market basket analysis.
However, you don’t want to load this dataset as dataframe because it leads to several technical issues during the analysis. Rather you want to load it as a sparse matrix. The function for this is “read.transactions” and is available in the “arules” package
library(arules)
#make sparse matrix
groceries<-read.transactions("/home/darrin/Documents/R working directory/Machine-Learning-with-R-datasets-master/groceries.csv", sep = ",")
Please keep in mind that the location of the file on your computer will be different from my hard drive.
We will now explore the data set by using several different functions. First, we will use the “summary” function as indicated below.
summary(groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
The output tells us the number of rows in our dataset (9835) columns (169) as well as the density, which is the percentage of columns that are not empty (2.6%). This may seem small but remember that the number of purchases varies from person to person so this affects how many empty columns there are.
Next, we have the most commonly purchased items. Milk and other vegetables were the two most common followed by other foods. After the most frequent items, we have the size of each transaction. For example, 2159 people purchased one item during a transaction. While one person purchased 32 items in a transaction.
Lastly, we summary statistics about transactions. On average, a person would purchase 4.4 items per transaction.
We will now look at the support of different items. Remember, that the support is the frequency of an item in the dataset. We will use the “itemFrequencyPlot” function to do this and we will add the argument “topN” to sort the items from most common to less for the 15 most frequent transactions. Below is the code
itemFrequencyPlot(groceries, topN=15)
The plot that is produced gives you an idea of what people were purchasing. We will now attempt to develop association rules using the “apriori” function.
For now, we will use the default settings for support and confidence (confidence is the proportion of transactions that have the same item(s)). The default for support is 0.1 and for confidence it is 0.8. Below is the code.
apriori(groceries)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 0.1 1 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 983
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 0 rules
As you can see from the printout, nothing meets the criteria of a support of 0.1 and confidence of 0.8. How to play with these numbers is a matter of experience as there are few strong rules for this matter. Below, I set the support to 0.006, confidence to 0.25, and the minimum number of rules items to 2. The support of 0.006 means that this item must have been purchased at least 60 times out of 9835 items and the confidence of 0.25 means that rule needs to be accurate 25% of the time. Lastly, I want at least two items in each rule that is produced as indicated by minlen = 2. Below is the code with the “summary” as well.
groceriesrules<-apriori(groceries, parameter = list(support=0.006, confidence = 0.25, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.25 0.1 1 none FALSE TRUE 0.006 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 59
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(groceriesrules)
## set of 463 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 150 297 16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.711 3.000 4.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.006101 Min. :0.2500 Min. :0.9932
## 1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:1.6229
## Median :0.008744 Median :0.3554 Median :1.9332
## Mean :0.011539 Mean :0.3786 Mean :2.0351
## 3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:2.3565
## Max. :0.074835 Max. :0.6600 Max. :3.9565
##
## mining info:
## data ntransactions support confidence
## groceries 9835 0.006 0.25
Our current analysis has 463 rules. This is a major improvement from 0. We can also see how many rules contain 2 (150), 3 (297), and 4 items (16). We also have a summary of the average number of items per rule. The next is descriptive stats on the support and confidence of the rules generated.
Something that is new for us is the “lift” column. Lift is a measure how much more likely an item is to be purchased above its number rate. Anything above 1 means that the likelihood of purchase is higher than chance.
We are now going to look for useful rules from our dataset. We are looking for rules that we can use to make decisions. It takes industry experiences in the field of your data to really glean useful rules. For now, we can only determine this statistically by sorting the rules by their lift. Below is the code for this
inspect(sort(groceriesrules, by="lift")[1:7])
## lhs rhs support confidence lift
## 1 {herbs} => {root vegetables} 0.007015760 0.4312500 3.956
## 2 {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796
## 3 {other vegetables, ## tropical fruit,
## whole milk} => {root vegetables} 0.007015760 0.4107143 3.768
## 4 {beef,
## other vegetables} => {root vegetables} 0.007930859 0.4020619 3.688
## 5 {other vegetables,
## tropical fruit} => {pip fruit} 0.009456024 0.2634561 3.482
## 6 {beef,
## whole milk} => {root vegetables} 0.008032537 0.3779904 3.467
## 7 {other vegetables,
## pip fruit} => {tropical fruit} 0.009456024 0.3618677 3.448
The first three rules are translated into simple English below as ordered by lift.
- If herbs are purchased then root vegetables are purchased
- If berries are purchased then whipped sour cream is purchased
- If other vegetables, tropical fruit, and whole milk are purchased then root vegetables are purchased
Conclusion
Since we are making no predictions, there is no way to really objectively improve the model. This is normal when the learning is unsupervised. If we had to make a recommendation based on the results we could say that the store should place all vegetables near each other.
The power of market basket analysis is allowing the researcher to identify relationships that may not have been noticed any other way. Naturally, insights gained from this approach must be used for practical actions in the setting in which they apply.
Pingback: Market Basket Analysis in R | Education and Res...