We use transactional and non-transactional data to explore the association rules. We start from transactional data.
The data set is from a supermarket chain, unfortunatelly the data is disguised (to protect the chain’s proprietary data) but is more typical of large data sets. This data set comprises market baskets of items purchased together, where each record includes arbitrarily numbered item numbers without item descriptions.
library(arules)
library(arulesViz)
data("Groceries")
summary(Groceries)
The summary() shows us that the data comprise $9,835$ transactions with $169$ unique items. Of those $1,662,115$ intersections, only $2.6\%$ have positive data (density) because most items are not purchased in most transactions. The whole milk appears the most frequently and occurs in 2513 baskets or more than quarter of all transactions. The plot of the first $20$ frequent items can be seen in the figure below.
Using inspect(head(Groceries)) we see a few examples from the baskets. For example, the second transaction includes fruit, yogurt, and coffee, while the third transaction is just a container of milk. In this output, notice that the item sets are structured with brackets, a visual clue that they reflect a new “transactions” data type that we examine in more detail below.
Lets explore the data before we make any rules:
itemFrequencyPlot(Groceries,topN=20,type="absolute")
We now use apriori(data, parameters=...) to find association rules with the “apriori” algorithm. At a conceptual level, the apriori algorithm searches through the item sets that occur frequently in a list of transactions.
To control the extent that apriori() searches, we use the parameter=list() control to instruct the algorithm to search rules that have a minimum support = 0.01 and minimum confidence = 0.3
groc.rules <- apriori(Groceries, parameter=list(supp=0.01, conf=0.3))
Note that the values for the support and confidence parameters are found largely by experience (in other words, by trial and error) and should be expected to vary from industry to industry and data set to data set. We arrived at the values of support=0.01 and confidence=0.3 after finding that they resulted in a modest number of rules suitable for an example. In real cases, you would adapt those values to your data and business case
To interpret the results of apriori() above, there are two key things to examine.
First, check the number of items going into the rules, which is shown on the output line “sorting and recoding items ...” and in this case tells us that the rules found are using $88$ of the total number of items. If this number is too small (only a tiny set of your items) or too large (almost all of them), then you might wish to adjust the support and confidence levels.
Next, check the number of rules found, as indicated on the “writing ...” line. In this case, the algorithm found 125 rules. Once again, if this number is too low, it suggests the need to lower the support or confidence levels; if it is too high (such as many more rules than items), you might increase the support or confidence levels.
To get a sense of the rule distribution, we load the arulesViz package and then plot() the rule set, which charts the rules according to confidence (Y axis) by support (X axis) and scales the darkness of points to indicate lift.
plot(groc.rules)
We see that most rules involve item combinations that occur infrequently (that is, they have low support) while confidence is relatively smoothly distributed.
Once we have a rule set from apriori(), we use inspect(rules) to examine the association rules. The complete list of $125$ from above is too long to examine here, so we select a subset of them with high lift,first for confidence >0.58 then for lift > 3.
inspect(subset(groc.rules, confidence > 0.58))
This rule tells us that the combination {tropical fruit,root vegetables,other vegetables} occurs in about $1\%$ of baskets (support=0.0123), and when it occurs it frequently includes {other vegetables} (confidence= 0.58). The combination occurs 3 times more often than we would expect from the individual incidence rates of {tropical fruit,root vegetables,other vegetables} and {other vegetables} considered separately (lift=3.03).
Such information could be used in various ways. If we pair the transactions with customer information, we could use this for targeted mailings or email suggestions. Or for items often sold together, we could adjust the price and margins together; for instance, to put one item on sale while increasing the price on the other.
inspect(subset(groc.rules, lift > 3))
We find that five of the rules in our set have lift greater than $3.0$: The first rule tells us that if a transaction contains {beef} then it is also relatively more likely ($33\%$ likely to be more specific) to contain {root vegetables}—a category that we assume includes items such as potatoes and onions. That combination appears in $1.7\%$ of baskets (“support”), and the lift tells us that combination is $3×$ more likely to occur together than one would expect from the individual rates of incidence alone.
A store might form several ideas on the basis of such information. For instance, the store might create a display for potatoes and onions near the beef counter to encourage shoppers who are examining beef to purchase those vegetables or consider recipes with them. It might also suggest putting coupons for beef in the root vegetable area, or featuring recipe cards somewhere in the store.
A common goal in market basket analysis is to find rules with high lift. We can find such rules easily by sorting the larger set of rules by lift. We extract the 50 rules with highest lift using sort() to order the rules by lift and taking 50 from the head():
Suppose you want to answer to the question. What are customers likely to buy if they purchase whole milk?
We set the left hand side to be “whole milk” and find its antecedents. Note the following:
1. We set the cmonfidence to 0.1 since we get no rules with 0.8
2. We set a minimum length of 2 to avoid empty left hand side items
rules<-apriori(data=Groceries, parameter=list(supp=0.001,conf = 0.15,minlen=2),
appearance = list(default="rhs",lhs="whole milk"),
control = list(verbose=FALSE))
rules<-sort(rules, decreasing=TRUE,by="confidence")
inspect(rules[1:5])
rules<- head(sort(groc.rules, by="lift"), 10)
inspect(rules)
saveAsGraph(rules, file = "rules.graphml")
g<-read_graph("rules.graphml",format ="graphml")
require(igraph)
plot(g,width=3,arrow.size=0.5)
A graph display of rules may be useful to seek higher level themes and patterns.
seg.df <- read.csv("http://goo.gl/qw303p")
summary(seg.df)
Association rules work with discrete data yet seg.df includes three continuous (or quasi-continuous) variables: age, income, and kids. It’s necessary to convert those to discrete factors to use with association rules in the arules package.
seg.fac <- seg.df
seg.fac$age <- cut(seg.fac$age,breaks=c(0,25,35,55,65,100),
labels=c("19-24", "25-34", "35-54", "55-64", "65+"),
right=FALSE, ordered_result=TRUE)
summary(seg.fac$age)
seg.fac$income <- cut(seg.fac$income,
breaks=c(-100000, 40000, 70000, 1000000),
labels=c("Low", "Medium", "High"),
right=FALSE, ordered_result=TRUE)
seg.fac$kids <- cut(seg.fac$kids,
breaks=c(0, 1, 2, 3, 100),
labels=c("No kids", "1 kid", "2 kids", "3+ kids"),
right=FALSE, ordered_result=TRUE)
summary(seg.fac)
A data frame in suitable discrete (factor) format can be converted to use in arules by using as(..., "transactions") to code it as transaction data:
seg.trans <- as(seg.fac, "transactions")
summary(seg.trans)
seg.rules <- apriori(seg.trans, parameter=list(support=0.1, conf=0.4, target="rules"))
summary(seg.rules)
rules<-apriori(data=seg.trans, parameter=list(supp=0.1,conf = 0.4),
appearance = list(default="lhs",rhs="income=Medium"),
control = list(verbose=FALSE))
rules<-sort(rules, decreasing=TRUE,by="confidence")
inspect(rules[1:10])