Association Rules

We use transactional and non-transactional data to explore the association rules. We start from transactional data.

The data set is from a supermarket chain, unfortunatelly the data is disguised (to protect the chain’s proprietary data) but is more typical of large data sets. This data set comprises market baskets of items purchased together, where each record includes arbitrarily numbered item numbers without item descriptions.

In [10]:
library(arules)
library(arulesViz)
In [2]:
data("Groceries")
In [3]:
summary(Groceries)
transactions as itemMatrix in sparse format with
 9835 rows (elements/itemsets/transactions) and
 169 columns (items) and a density of 0.02609146 

most frequent items:
      whole milk other vegetables       rolls/buns             soda 
            2513             1903             1809             1715 
          yogurt          (Other) 
            1372            34055 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
  17   18   19   20   21   22   23   24   26   27   28   29   32 
  29   14   14    9   11    4    6    1    1    1    1    3    1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   4.409   6.000  32.000 

includes extended item information - examples:
       labels  level2           level1
1 frankfurter sausage meat and sausage
2     sausage sausage meat and sausage
3  liver loaf sausage meat and sausage

The summary() shows us that the data comprise $9,835$ transactions with $169$ unique items. Of those $1,662,115$ intersections, only $2.6\%$ have positive data (density) because most items are not purchased in most transactions. The whole milk appears the most frequently and occurs in 2513 baskets or more than quarter of all transactions. The plot of the first $20$ frequent items can be seen in the figure below.

Using inspect(head(Groceries)) we see a few examples from the baskets. For example, the second transaction includes fruit, yogurt, and coffee, while the third transaction is just a container of milk. In this output, notice that the item sets are structured with brackets, a visual clue that they reflect a new “transactions” data type that we examine in more detail below.

Lets explore the data before we make any rules:

In [7]:
itemFrequencyPlot(Groceries,topN=20,type="absolute")

Finding and Visualising Association Rules

We now use apriori(data, parameters=...) to find association rules with the “apriori” algorithm. At a conceptual level, the apriori algorithm searches through the item sets that occur frequently in a list of transactions.

To control the extent that apriori() searches, we use the parameter=list() control to instruct the algorithm to search rules that have a minimum support = 0.01 and minimum confidence = 0.3

In [5]:
groc.rules <- apriori(Groceries, parameter=list(supp=0.01, conf=0.3))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.3    0.1    1 none FALSE            TRUE       5    0.01      1
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 98 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [88 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [125 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

Note that the values for the support and confidence parameters are found largely by experience (in other words, by trial and error) and should be expected to vary from industry to industry and data set to data set. We arrived at the values of support=0.01 and confidence=0.3 after finding that they resulted in a modest number of rules suitable for an example. In real cases, you would adapt those values to your data and business case

To interpret the results of apriori() above, there are two key things to examine.

First, check the number of items going into the rules, which is shown on the output line “sorting and recoding items ...” and in this case tells us that the rules found are using $88$ of the total number of items. If this number is too small (only a tiny set of your items) or too large (almost all of them), then you might wish to adjust the support and confidence levels.

Next, check the number of rules found, as indicated on the “writing ...” line. In this case, the algorithm found 125 rules. Once again, if this number is too low, it suggests the need to lower the support or confidence levels; if it is too high (such as many more rules than items), you might increase the support or confidence levels.

To get a sense of the rule distribution, we load the arulesViz package and then plot() the rule set, which charts the rules according to confidence (Y axis) by support (X axis) and scales the darkness of points to indicate lift.

In [12]:
plot(groc.rules)
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

We see that most rules involve item combinations that occur infrequently (that is, they have low support) while confidence is relatively smoothly distributed.

Once we have a rule set from apriori(), we use inspect(rules) to examine the association rules. The complete list of $125$ from above is too long to examine here, so we select a subset of them with high lift,first for confidence >0.58 then for lift > 3.

In [24]:
inspect(subset(groc.rules, confidence > 0.58))
    lhs                                 rhs                support   
[1] {curd,yogurt}                    => {whole milk}       0.01006609
[2] {citrus fruit,root vegetables}   => {other vegetables} 0.01037112
[3] {tropical fruit,root vegetables} => {other vegetables} 0.01230300
    confidence lift     count
[1] 0.5823529  2.279125  99  
[2] 0.5862069  3.029608 102  
[3] 0.5845411  3.020999 121  

This rule tells us that the combination {tropical fruit,root vegetables,other vegetables} occurs in about $1\%$ of baskets (support=0.0123), and when it occurs it frequently includes {other vegetables} (confidence= 0.58). The combination occurs 3 times more often than we would expect from the individual incidence rates of {tropical fruit,root vegetables,other vegetables} and {other vegetables} considered separately (lift=3.03).

Such information could be used in various ways. If we pair the transactions with customer information, we could use this for targeted mailings or email suggestions. Or for items often sold together, we could adjust the price and margins together; for instance, to put one item on sale while increasing the price on the other.

In [6]:
inspect(subset(groc.rules, lift > 3))
    lhs                                  rhs                support   
[1] {beef}                            => {root vegetables}  0.01738688
[2] {citrus fruit,root vegetables}    => {other vegetables} 0.01037112
[3] {citrus fruit,other vegetables}   => {root vegetables}  0.01037112
[4] {tropical fruit,root vegetables}  => {other vegetables} 0.01230300
[5] {tropical fruit,other vegetables} => {root vegetables}  0.01230300
    confidence lift     count
[1] 0.3313953  3.040367 171  
[2] 0.5862069  3.029608 102  
[3] 0.3591549  3.295045 102  
[4] 0.5845411  3.020999 121  
[5] 0.3427762  3.144780 121  

We find that five of the rules in our set have lift greater than $3.0$: The first rule tells us that if a transaction contains {beef} then it is also relatively more likely ($33\%$ likely to be more specific) to contain {root vegetables}—a category that we assume includes items such as potatoes and onions. That combination appears in $1.7\%$ of baskets (“support”), and the lift tells us that combination is $3×$ more likely to occur together than one would expect from the individual rates of incidence alone.

Business Inference

A store might form several ideas on the basis of such information. For instance, the store might create a display for potatoes and onions near the beef counter to encourage shoppers who are examining beef to purchase those vegetables or consider recipes with them. It might also suggest putting coupons for beef in the root vegetable area, or featuring recipe cards somewhere in the store.

A common goal in market basket analysis is to find rules with high lift. We can find such rules easily by sorting the larger set of rules by lift. We extract the 50 rules with highest lift using sort() to order the rules by lift and taking 50 from the head():

Targeting Items

Suppose you want to answer to the question. What are customers likely to buy if they purchase whole milk?

We set the left hand side to be “whole milk” and find its antecedents. Note the following:

1. We set the cmonfidence to 0.1 since we get no rules with 0.8
2. We set a minimum length of 2 to avoid empty left hand side items
In [131]:
rules<-apriori(data=Groceries, parameter=list(supp=0.001,conf = 0.15,minlen=2), 
               appearance = list(default="rhs",lhs="whole milk"),
               control = list(verbose=FALSE))
rules<-sort(rules, decreasing=TRUE,by="confidence")
inspect(rules[1:5])
    lhs             rhs                support    confidence lift     count
[1] {whole milk} => {other vegetables} 0.07483477 0.2928770  1.513634 736  
[2] {whole milk} => {rolls/buns}       0.05663447 0.2216474  1.205032 557  
[3] {whole milk} => {yogurt}           0.05602440 0.2192598  1.571735 551  
[4] {whole milk} => {root vegetables}  0.04890696 0.1914047  1.756031 481  
[5] {whole milk} => {tropical fruit}   0.04229792 0.1655392  1.577595 416  

Visualising As Graph

In [53]:
rules<- head(sort(groc.rules, by="lift"), 10)
In [54]:
inspect(rules)
     lhs                                  rhs                support   
[1]  {citrus fruit,other vegetables}   => {root vegetables}  0.01037112
[2]  {tropical fruit,other vegetables} => {root vegetables}  0.01230300
[3]  {beef}                            => {root vegetables}  0.01738688
[4]  {citrus fruit,root vegetables}    => {other vegetables} 0.01037112
[5]  {tropical fruit,root vegetables}  => {other vegetables} 0.01230300
[6]  {other vegetables,whole milk}     => {root vegetables}  0.02318251
[7]  {whole milk,curd}                 => {yogurt}           0.01006609
[8]  {root vegetables,rolls/buns}      => {other vegetables} 0.01220132
[9]  {root vegetables,yogurt}          => {other vegetables} 0.01291307
[10] {tropical fruit,whole milk}       => {yogurt}           0.01514997
     confidence lift     count
[1]  0.3591549  3.295045 102  
[2]  0.3427762  3.144780 121  
[3]  0.3313953  3.040367 171  
[4]  0.5862069  3.029608 102  
[5]  0.5845411  3.020999 121  
[6]  0.3097826  2.842082 228  
[7]  0.3852140  2.761356  99  
[8]  0.5020921  2.594890 120  
[9]  0.5000000  2.584078 127  
[10] 0.3581731  2.567516 149  
In [55]:
saveAsGraph(rules, file = "rules.graphml")
In [56]:
g<-read_graph("rules.graphml",format ="graphml")
In [58]:
require(igraph)
plot(g,width=3,arrow.size=0.5)

A graph display of rules may be useful to seek higher level themes and patterns.

Rules in Non-Transactional Data: Exploring Segments

In [71]:
seg.df <- read.csv("http://goo.gl/qw303p")
In [72]:
summary(seg.df)
      age           gender        income            kids        ownHome   
 Min.   :19.26   Female:157   Min.   : -5183   Min.   :0.00   ownNo :159  
 1st Qu.:33.01   Male  :143   1st Qu.: 39656   1st Qu.:0.00   ownYes:141  
 Median :39.49                Median : 52014   Median :1.00               
 Mean   :41.20                Mean   : 50937   Mean   :1.27               
 3rd Qu.:47.90                3rd Qu.: 61403   3rd Qu.:2.00               
 Max.   :80.49                Max.   :114278   Max.   :7.00               
  subscribe         Segment   
 subNo :260   Moving up : 70  
 subYes: 40   Suburb mix:100  
              Travelers : 80  
              Urban hip : 50  
                              
                              

Association rules work with discrete data yet seg.df includes three continuous (or quasi-continuous) variables: age, income, and kids. It’s necessary to convert those to discrete factors to use with association rules in the arules package.

In [73]:
seg.fac <- seg.df
In [74]:
seg.fac$age <- cut(seg.fac$age,breaks=c(0,25,35,55,65,100),
                   labels=c("19-24", "25-34", "35-54", "55-64", "65+"),
                    right=FALSE, ordered_result=TRUE)
In [75]:
summary(seg.fac$age)
19-24
38
25-34
58
35-54
152
55-64
38
65+
14
In [76]:
seg.fac$income <- cut(seg.fac$income,
     breaks=c(-100000, 40000, 70000, 1000000),
     labels=c("Low", "Medium", "High"),
     right=FALSE, ordered_result=TRUE)
seg.fac$kids <- cut(seg.fac$kids,
     breaks=c(0, 1, 2, 3, 100),
     labels=c("No kids", "1 kid", "2 kids", "3+ kids"),
     right=FALSE, ordered_result=TRUE)
summary(seg.fac)
    age         gender       income         kids       ownHome     subscribe  
 19-24: 38   Female:157   Low   : 77   No kids:121   ownNo :159   subNo :260  
 25-34: 58   Male  :143   Medium:183   1 kid  : 70   ownYes:141   subYes: 40  
 35-54:152                High  : 40   2 kids : 51                            
 55-64: 38                             3+ kids: 58                            
 65+  : 14                                                                    
       Segment   
 Moving up : 70  
 Suburb mix:100  
 Travelers : 80  
 Urban hip : 50  
                 

A data frame in suitable discrete (factor) format can be converted to use in arules by using as(..., "transactions") to code it as transaction data:

In [78]:
seg.trans <- as(seg.fac, "transactions")
summary(seg.trans)
transactions as itemMatrix in sparse format with
 300 rows (elements/itemsets/transactions) and
 22 columns (items) and a density of 0.3181818 

most frequent items:
subscribe=subNo   income=Medium   ownHome=ownNo   gender=Female       age=35-54 
            260             183             159             157             152 
        (Other) 
           1189 

element (itemset/transaction) length distribution:
sizes
  7 
300 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      7       7       7       7       7       7 

includes extended item information - examples:
     labels variables levels
1 age=19-24       age  19-24
2 age=25-34       age  25-34
3 age=35-54       age  35-54

includes extended transaction information - examples:
  transactionID
1             1
2             2
3             3
In [80]:
seg.rules <- apriori(seg.trans, parameter=list(support=0.1, conf=0.4, target="rules"))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.4    0.1    1 none FALSE            TRUE       5     0.1      1
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 30 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[22 item(s), 300 transaction(s)] done [0.00s].
sorting and recoding items ... [21 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [579 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
In [81]:
summary(seg.rules)
set of 579 rules

rule length distribution (lhs + rhs):sizes
  1   2   3   4   5 
  8 109 263 174  25 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   3.000   3.171   4.000   5.000 

summary of quality measures:
    support         confidence          lift            count       
 Min.   :0.1000   Min.   :0.4026   Min.   :0.7941   Min.   : 30.00  
 1st Qu.:0.1100   1st Qu.:0.5200   1st Qu.:1.0000   1st Qu.: 33.00  
 Median :0.1300   Median :0.6522   Median :1.1002   Median : 39.00  
 Mean   :0.1632   Mean   :0.6847   Mean   :1.4715   Mean   : 48.95  
 3rd Qu.:0.1867   3rd Qu.:0.8421   3rd Qu.:1.4896   3rd Qu.: 56.00  
 Max.   :0.8667   Max.   :1.0000   Max.   :6.0000   Max.   :260.00  

mining info:
      data ntransactions support confidence
 seg.trans           300     0.1        0.4

Targeting Customers

In [103]:
rules<-apriori(data=seg.trans, parameter=list(supp=0.1,conf = 0.4), 
               appearance = list(default="lhs",rhs="income=Medium"),
               control = list(verbose=FALSE))
In [110]:
rules<-sort(rules, decreasing=TRUE,by="confidence")
inspect(rules[1:10])
     lhs                     rhs               support confidence     lift count
[1]  {age=35-54,                                                                
      kids=3+ kids}       => {income=Medium} 0.1166667  0.9459459 1.550731    35
[2]  {age=35-54,                                                                
      kids=3+ kids,                                                             
      subscribe=subNo}    => {income=Medium} 0.1066667  0.9411765 1.542912    32
[3]  {age=35-54,                                                                
      subscribe=subNo,                                                          
      Segment=Moving up}  => {income=Medium} 0.1000000  0.9375000 1.536885    30
[4]  {age=35-54,                                                                
      Segment=Moving up}  => {income=Medium} 0.1333333  0.9302326 1.524971    40
[5]  {Segment=Moving up}  => {income=Medium} 0.2033333  0.8714286 1.428571    61
[6]  {gender=Female,                                                            
      Segment=Moving up}  => {income=Medium} 0.1400000  0.8571429 1.405152    42
[7]  {subscribe=subNo,                                                          
      Segment=Moving up}  => {income=Medium} 0.1600000  0.8571429 1.405152    48
[8]  {age=35-54,                                                                
      ownHome=ownNo}      => {income=Medium} 0.2166667  0.8552632 1.402071    65
[9]  {age=35-54,                                                                
      gender=Male,                                                              
      Segment=Suburb mix} => {income=Medium} 0.1166667  0.8536585 1.399440    35
[10] {age=35-54,                                                                
      ownHome=ownNo,                                                            
      subscribe=subNo}    => {income=Medium} 0.1833333  0.8461538 1.387137    55