The markdoown version of this file is available at RPubs.

Association mining is commonly used to make product recommendations by identifying products that are frequently bought together. It is a common technique used to find associations between many variables. It is often used by grocery stores, e-commerce websites, and anyone with large transaction databases. A most common example that we encounter in our daily lives — Amazon knows what else you want to buy when you order something on their site.

Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets.

Associate mining in R

Michael Hahsler has authored and maintains two very useful R packages relating to association rule mining: the arules package and the arulesViz package.

library(tidyverse)
library(arulesViz)
library(arules)
library(kableExtra)

Data and EDA

There’s a public data of buying records in a grocery store. The data looks like below.

data <- read.transactions('groceries.csv', format = 'basket', sep=',')
inspect(head(data))
##     items                     
## [1] {citrus fruit,            
##      margarine,               
##      ready soups,             
##      semi-finished bread}     
## [2] {coffee,                  
##      tropical fruit,          
##      yogurt}                  
## [3] {whole milk}              
## [4] {cream cheese,            
##      meat spreads,            
##      pip fruit,               
##      yogurt}                  
## [5] {condensed milk,          
##      long life bakery product,
##      other vegetables,        
##      whole milk}              
## [6] {abrasive cleaner,        
##      butter,                  
##      rice,                    
##      whole milk,              
##      yogurt}

I want to find the most frequently bought items.

itemFrequencyPlot(data, topN = 20, type = "absolute")

I want to find the items that are bought frequently together

freq.items <- eclat(data, parameter = list(supp = 0.01, maxlen = 15))
inspect(head(freq.items))
##     items                          support    count
## [1] {hard cheese,whole milk}       0.01006609  99  
## [2] {butter milk,whole milk}       0.01159126 114  
## [3] {butter milk,other vegetables} 0.01037112 102  
## [4] {ham,whole milk}               0.01148958 113  
## [5] {sliced cheese,whole milk}     0.01077783 106  
## [6] {oil,whole milk}               0.01128622 111

Product recommendation rules

There are three parameters controlling the number of rules to be generated viz. Support, Lift and Confidence.

Support is an indication of how frequently the item set appears in the data set.
\[Support = \frac{Number\ of\ transactions\ with\ both\ A\ and\ B}{Total\ number\ of\ transactions} = P\left(A \cap B\right)\]

Confidence is an indication of how often the rule has been found to be true.
\[Confidence = \frac{Number\ of\ transactions\ with\ both\ A\ and\ B}{Total\ number\ of\ transactions\ with\ A} = \frac{P\left(A \cap B\right)}{P\left(A\right)}\]

Lift is the factor by which, the co-occurrence of A and B exceeds the expected probability of A and B co-occurring, had they been independent. So, higher the lift, higher the chance of A and B occurring together.
\[Lift = \frac{Confidence}{Expected\ Confidence} = \frac{P\left(A \cap B\right)}{P\left(A\right).P\left(B\right)}\]

rules <- apriori(data, parameter = list(support = 0.0015, confidence = 0.9))
rules_optim <- rules[]
inspect(rules_optim)
##     lhs                        rhs                    support confidence      lift count
## [1] {liquor,                                                                            
##      red/blush wine}        => {bottled beer}     0.001931876  0.9047619 11.235269    19
## [2] {flour,                                                                             
##      root vegetables,                                                                   
##      whipped/sour cream}    => {whole milk}       0.001728521  1.0000000  3.913649    17
## [3] {cream cheese,                                                                      
##      other vegetables,                                                                  
##      sugar}                 => {whole milk}       0.001525165  0.9375000  3.669046    15
## [4] {butter,                                                                            
##      pip fruit,                                                                         
##      whipped/sour cream}    => {whole milk}       0.001830198  0.9000000  3.522284    18
## [5] {domestic eggs,                                                                     
##      tropical fruit,                                                                    
##      whipped/sour cream}    => {whole milk}       0.001830198  0.9000000  3.522284    18
## [6] {fruit/vegetable juice,                                                             
##      tropical fruit,                                                                    
##      whipped/sour cream}    => {other vegetables} 0.001931876  0.9047619  4.675950    19
## [7] {root vegetables,                                                                   
##      sausage,                                                                           
##      tropical fruit,                                                                    
##      yogurt}                => {whole milk}       0.001525165  0.9375000  3.669046    15

lhs is “left hand side” and rhs is “right hand side”. In the first low, it’s for the result about “if a customer buy liquor and red/blush wine, which is in lhs column, will the customer buy bottled beer, which is in rhs column?”

To find the item pairs in descending order of support, lift and confidence.

inspect(sort(rules_optim, by="confidence", decreasing = T))
##     lhs                        rhs                    support confidence      lift count
## [1] {flour,                                                                             
##      root vegetables,                                                                   
##      whipped/sour cream}    => {whole milk}       0.001728521  1.0000000  3.913649    17
## [2] {cream cheese,                                                                      
##      other vegetables,                                                                  
##      sugar}                 => {whole milk}       0.001525165  0.9375000  3.669046    15
## [3] {root vegetables,                                                                   
##      sausage,                                                                           
##      tropical fruit,                                                                    
##      yogurt}                => {whole milk}       0.001525165  0.9375000  3.669046    15
## [4] {liquor,                                                                            
##      red/blush wine}        => {bottled beer}     0.001931876  0.9047619 11.235269    19
## [5] {fruit/vegetable juice,                                                             
##      tropical fruit,                                                                    
##      whipped/sour cream}    => {other vegetables} 0.001931876  0.9047619  4.675950    19
## [6] {butter,                                                                            
##      pip fruit,                                                                         
##      whipped/sour cream}    => {whole milk}       0.001830198  0.9000000  3.522284    18
## [7] {domestic eggs,                                                                     
##      tropical fruit,                                                                    
##      whipped/sour cream}    => {whole milk}       0.001830198  0.9000000  3.522284    18
inspect(sort(rules_optim, by="support", decreasing = T))
##     lhs                        rhs                    support confidence      lift count
## [1] {liquor,                                                                            
##      red/blush wine}        => {bottled beer}     0.001931876  0.9047619 11.235269    19
## [2] {fruit/vegetable juice,                                                             
##      tropical fruit,                                                                    
##      whipped/sour cream}    => {other vegetables} 0.001931876  0.9047619  4.675950    19
## [3] {butter,                                                                            
##      pip fruit,                                                                         
##      whipped/sour cream}    => {whole milk}       0.001830198  0.9000000  3.522284    18
## [4] {domestic eggs,                                                                     
##      tropical fruit,                                                                    
##      whipped/sour cream}    => {whole milk}       0.001830198  0.9000000  3.522284    18
## [5] {flour,                                                                             
##      root vegetables,                                                                   
##      whipped/sour cream}    => {whole milk}       0.001728521  1.0000000  3.913649    17
## [6] {cream cheese,                                                                      
##      other vegetables,                                                                  
##      sugar}                 => {whole milk}       0.001525165  0.9375000  3.669046    15
## [7] {root vegetables,                                                                   
##      sausage,                                                                           
##      tropical fruit,                                                                    
##      yogurt}                => {whole milk}       0.001525165  0.9375000  3.669046    15
inspect(sort(rules_optim, by="lift", decreasing = T))
##     lhs                        rhs                    support confidence      lift count
## [1] {liquor,                                                                            
##      red/blush wine}        => {bottled beer}     0.001931876  0.9047619 11.235269    19
## [2] {fruit/vegetable juice,                                                             
##      tropical fruit,                                                                    
##      whipped/sour cream}    => {other vegetables} 0.001931876  0.9047619  4.675950    19
## [3] {flour,                                                                             
##      root vegetables,                                                                   
##      whipped/sour cream}    => {whole milk}       0.001728521  1.0000000  3.913649    17
## [4] {cream cheese,                                                                      
##      other vegetables,                                                                  
##      sugar}                 => {whole milk}       0.001525165  0.9375000  3.669046    15
## [5] {root vegetables,                                                                   
##      sausage,                                                                           
##      tropical fruit,                                                                    
##      yogurt}                => {whole milk}       0.001525165  0.9375000  3.669046    15
## [6] {butter,                                                                            
##      pip fruit,                                                                         
##      whipped/sour cream}    => {whole milk}       0.001830198  0.9000000  3.522284    18
## [7] {domestic eggs,                                                                     
##      tropical fruit,                                                                    
##      whipped/sour cream}    => {whole milk}       0.001830198  0.9000000  3.522284    18
plot(rules_optim[1:7], method = "graph")

By changing the support and confidence cutoffs, we can get better recommendations.

Leave a Reply