The data set has 3 columns and 38765 rows of purchase orders from grocery stores. I choose this data since Market Basket Analysis is one of the techniques to see associations between items by looking for combinations of items being purchased together frequently. In this project, I want to learn about Association Rules to analyze transaction data.
Association rules are statements that help to find relationships between data.
An association rule has two parts, an antecedent and a consequent. An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent. Association rules are created by analyzing data for frequent relationships and using support and confidence to identify the most important relationships. Support is an indication of how frequently the items being purchased. Confidence states the number of times the antecedent/consequent statements found. More detail on arule.
library("plyr")
library("Matrix")
library("arules", warn.conflicts = FALSE)
library("grid")
library("arulesViz")
library("colorspace")
groceries = read.csv("C:/Users/kanni/Downloads/Groceries_dataset.csv")
class(groceries)
## [1] "data.frame"
dim(groceries)
## [1] 38765 3
str(groceries)
## 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : int 1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
## $ Date : chr "21-07-2015" "05-01-2015" "19-09-2015" "12-12-2015" ...
## $ itemDescription: chr "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...
head(groceries)
## Member_number Date itemDescription
## 1 1808 21-07-2015 tropical fruit
## 2 2552 05-01-2015 whole milk
## 3 2300 19-09-2015 pip fruit
## 4 1187 12-12-2015 other vegetables
## 5 3037 01-02-2015 whole milk
## 6 4941 14-02-2015 rolls/buns
# check for NA
sum(is.na(groceries))
## [1] 0
# converting `Member_number`
sorting = groceries[order(groceries$Member_number),]
sorting$Member_number = as.numeric(sorting$Member_number)
str(sorting)
## 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
## $ Date : chr "27-05-2015" "24-07-2015" "15-03-2015" "25-11-2015" ...
## $ itemDescription: chr "soda" "canned beer" "sausage" "sausage" ...
# grouping same `Member_number` and `Date`
itemsList = ddply(sorting, c("Member_number", "Date"), function(df1) paste(df1$itemDescription, collapse = ","))
head(itemsList, 10)
## Member_number Date V1
## 1 1000 15-03-2015 sausage,whole milk,semi-finished bread,yogurt
## 2 1000 24-06-2014 whole milk,pastry,salty snack
## 3 1000 24-07-2015 canned beer,misc. beverages
## 4 1000 25-11-2015 sausage,hygiene articles
## 5 1000 27-05-2015 soda,pickled vegetables
## 6 1001 02-05-2015 frankfurter,curd
## 7 1001 07-02-2014 sausage,whole milk,rolls/buns
## 8 1001 12-12-2014 whole milk,soda
## 9 1001 14-04-2015 beef,white bread
## 10 1001 20-01-2015 frankfurter,soda,whipped/sour cream
# setting `Member_number` and `Date` to NULL
itemsList$Member_number <- NULL
itemsList$Date <- NULL
colnames(itemsList) <- c("itemsList")
# writing out itemsList
write.csv(itemsList,"ItemsList.csv", quote = FALSE, row.names = TRUE)
head(itemsList)
## itemsList
## 1 sausage,whole milk,semi-finished bread,yogurt
## 2 whole milk,pastry,salty snack
## 3 canned beer,misc. beverages
## 4 sausage,hygiene articles
## 5 soda,pickled vegetables
## 6 frankfurter,curd
# converting itemsList to basket type
basket = read.transactions(file="ItemsList.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1);
## distribution of transactions with duplicates:
## items
## 1 2 3 4
## 662 39 5 1
print(basket)
## transactions in sparse format with
## 14964 transactions (rows) and
## 168 items (columns)
# removing quotes
basket@itemInfo$labels = gsub("\"","",basket@itemInfo$labels)
# displaying frequent items purchased
itemFrequencyPlot(basket,topN=20,type="absolute",main="Item Frequency")
Support = Probability (A and B) Support = (# of transactions involving A and B) / (total # of transactions).
Confidence = Probability (A and B) = P(A) Confidence = (# of transactions involving A and B) / (total # of transactions that have A).
# using apriori algorithm to create rules
basketRules <- apriori(basket, parameter = list(minlen=2, sup = 0.001, conf = 0.06, target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.06 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.01s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [361 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# print basketRules generated
print(length(basketRules))
## [1] 361
summary(basketRules)
## set of 361 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 334 27
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.075 2.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001002 Min. :0.06014 Min. :0.005346 Min. :0.5195
## 1st Qu.:0.001337 1st Qu.:0.07473 1st Qu.:0.014836 1st Qu.:0.7575
## Median :0.001938 Median :0.08837 Median :0.021852 Median :0.8278
## Mean :0.002856 Mean :0.09562 Mean :0.031359 Mean :0.8883
## 3rd Qu.:0.003408 3rd Qu.:0.11392 3rd Qu.:0.037757 3rd Qu.:0.9547
## Max. :0.014836 Max. :0.25581 Max. :0.157912 Max. :2.1831
## count
## Min. : 15.00
## 1st Qu.: 20.00
## Median : 29.00
## Mean : 42.74
## 3rd Qu.: 51.00
## Max. :222.00
##
## mining info:
## data ntransactions support confidence
## basket 14964 0.001 0.06
inspect(basketRules[1:15])
## lhs rhs support confidence
## [1] {frozen fish} => {whole milk} 0.001069233 0.1568627
## [2] {seasonal products} => {rolls/buns} 0.001002406 0.1415094
## [3] {pot plants} => {other vegetables} 0.001002406 0.1282051
## [4] {pot plants} => {whole milk} 0.001002406 0.1282051
## [5] {pasta} => {whole milk} 0.001069233 0.1322314
## [6] {pickled vegetables} => {whole milk} 0.001002406 0.1119403
## [7] {packaged fruit/vegetables} => {rolls/buns} 0.001202887 0.1417323
## [8] {detergent} => {yogurt} 0.001069233 0.1240310
## [9] {detergent} => {rolls/buns} 0.001002406 0.1162791
## [10] {detergent} => {whole milk} 0.001403368 0.1627907
## [11] {semi-finished bread} => {other vegetables} 0.001002406 0.1056338
## [12] {semi-finished bread} => {whole milk} 0.001670676 0.1760563
## [13] {red/blush wine} => {rolls/buns} 0.001336541 0.1273885
## [14] {red/blush wine} => {other vegetables} 0.001136060 0.1082803
## [15] {flour} => {tropical fruit} 0.001069233 0.1095890
## coverage lift count
## [1] 0.006816359 0.9933534 16
## [2] 0.007083667 1.2864807 15
## [3] 0.007818765 1.0500611 15
## [4] 0.007818765 0.8118754 15
## [5] 0.008086073 0.8373723 16
## [6] 0.008954825 0.7088763 15
## [7] 0.008487036 1.2885066 18
## [8] 0.008620690 1.4443580 16
## [9] 0.008620690 1.0571081 15
## [10] 0.008620690 1.0308929 21
## [11] 0.009489441 0.8651911 15
## [12] 0.009489441 1.1148993 25
## [13] 0.010491847 1.1581057 20
## [14] 0.010491847 0.8868668 17
## [15] 0.009756750 1.6172489 16
# Rules visualization
plot(basketRules, jitter = 0)
# plotting first 40 rules
plot(basketRules[1:40], method="graph")
summary(basketRules)
shows there is total of 361 rules with mean support is equal to 0.28% and mean confidense is equal to 9.56% which states rule happens in 0.28% transactions and implication is 9.56%.
If lift = 1, then A and B are independent . If lift > 1, then A and B are positively correlated . If lift < 1, then A and B are negatively correlated.
Firstly let us try the eclat algorithm - to see most frequent itemsets. Below we will see the list of the most common items together with their individual support.
# use eclat to show most frequent items with support
frequentSets <- eclat(basket, parameter=list(supp=0.07, maxlen=10))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.07 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 1047
##
## create itemset ...
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.01s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating sparse bit matrix ... [5 row(s), 14964 column(s)] done [0.00s].
## writing ... [5 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequentSets)
## items support transIdenticalToItemsets count
## [1] {whole milk} 0.15791232 2363 2363
## [2] {other vegetables} 0.12209302 1827 1827
## [3] {rolls/buns} 0.10999733 1646 1646
## [4] {soda} 0.09709971 1453 1453
## [5] {yogurt} 0.08587276 1285 1285
sodaRules = subset(basketRules, items %in% "soda")
inspect(head(sodaRules))
## lhs rhs support confidence coverage lift
## [1] {chewing gum} => {soda} 0.001002406 0.08333333 0.01202887 0.8582244
## [2] {specialty bar} => {soda} 0.001269714 0.09090909 0.01396685 0.9362448
## [3] {hygiene articles} => {soda} 0.001069233 0.07804878 0.01369955 0.8038004
## [4] {candy} => {soda} 0.001269714 0.08837209 0.01436782 0.9101170
## [5] {sliced cheese} => {soda} 0.001136060 0.08095238 0.01403368 0.8337037
## [6] {ice cream} => {soda} 0.001202887 0.07929515 0.01516974 0.8166364
## count
## [1] 15
## [2] 19
## [3] 16
## [4] 19
## [5] 17
## [6] 18
plot(sodaRules,method="graph",interactive=FALSE,shading="lift")
## Warning in plot.rules(sodaRules, method = "graph", interactive = FALSE, : The
## parameter interactive is deprecated. Use engine='interactive' instead.
sausageRules = subset(basketRules, items %in% "sausage")
inspect(head(sort(sausageRules, by ="lift")))
## lhs rhs support confidence coverage
## [1] {whole milk,yogurt} => {sausage} 0.001470195 0.13173653 0.011160118
## [2] {sausage,whole milk} => {yogurt} 0.001470195 0.16417910 0.008954825
## [3] {sausage,yogurt} => {whole milk} 0.001470195 0.25581395 0.005747126
## [4] {beverages} => {sausage} 0.001537022 0.09274194 0.016573109
## [5] {soda,whole milk} => {sausage} 0.001069233 0.09195402 0.011627907
## [6] {curd} => {sausage} 0.002940390 0.08730159 0.033680834
## lift count
## [1] 2.183062 22
## [2] 1.911888 22
## [3] 1.619975 22
## [4] 1.536866 23
## [5] 1.523810 16
## [6] 1.446712 44
plot(sausageRules,method="graph",interactive=FALSE,shading="lift")
## Warning in plot.rules(sausageRules, method = "graph", interactive = FALSE, : The
## parameter interactive is deprecated. Use engine='interactive' instead.
sausageRules
can be interpreted as such:
Each of the matrix cells can have different red shade depending on the lift value. Numbers on the axes are corresponding to the items listed before the matrix. For instance, the most red cell corresponds to the rule {whole milk,yogurt} => {sausage}, so sausage is most likely to be bought with whole milk and yogurt.
plot(sausageRules, method="matrix", measure=c("support","confidence"))
## Itemsets in Antecedent (LHS)
## [1] "{whole milk,yogurt}" "{sausage,yogurt}"
## [3] "{beverages}" "{soda,whole milk}"
## [5] "{curd}" "{sausage,whole milk}"
## [7] "{rolls/buns,whole milk}" "{rolls/buns,sausage}"
## [9] "{sliced cheese}" "{frozen meals}"
## [11] "{grapes}" "{frozen vegetables}"
## [13] "{bottled beer}" "{sausage,soda}"
## [15] "{misc. beverages}" "{yogurt}"
## [17] "{dessert}" "{pastry}"
## [19] "{soda}" "{salty snack}"
## [21] "{sausage}"
## Itemsets in Consequent (RHS)
## [1] "{other vegetables}" "{rolls/buns}" "{soda}"
## [4] "{whole milk}" "{sausage}" "{yogurt}"
plot(sausageRules, method="grouped", measure="support", control=list(col=sequential_hcl(100)))
Grouped matrix plot has the same data on the axes. Yet, the size reflects the support of the rule and the color reflects the lift of the rule.
inspect(head(sort(basketRules, by = "lift")))
## lhs rhs support confidence
## [1] {whole milk,yogurt} => {sausage} 0.001470195 0.13173653
## [2] {sausage,whole milk} => {yogurt} 0.001470195 0.16417910
## [3] {specialty chocolate} => {citrus fruit} 0.001403368 0.08786611
## [4] {sausage,yogurt} => {whole milk} 0.001470195 0.25581395
## [5] {flour} => {tropical fruit} 0.001069233 0.10958904
## [6] {beverages} => {sausage} 0.001537022 0.09274194
## coverage lift count
## [1] 0.011160118 2.183062 22
## [2] 0.008954825 1.911888 22
## [3] 0.015971665 1.653872 21
## [4] 0.005747126 1.619975 22
## [5] 0.009756750 1.617249 16
## [6] 0.016573109 1.536866 23
inspect(tail(sort(basketRules, by = "lift")))
## lhs rhs support confidence
## [1] {butter milk} => {other vegetables} 0.001269714 0.07224335
## [2] {sugar} => {other vegetables} 0.001269714 0.07169811
## [3] {pastry} => {other vegetables} 0.003675488 0.07105943
## [4] {misc. beverages} => {whole milk} 0.001403368 0.08898305
## [5] {specialty chocolate} => {whole milk} 0.001336541 0.08368201
## [6] {long life bakery product} => {other vegetables} 0.001136060 0.06343284
## coverage lift count
## [1] 0.01757551 0.5917074 19
## [2] 0.01770917 0.5872417 19
## [3] 0.05172414 0.5820106 55
## [4] 0.01577118 0.5634966 21
## [5] 0.01597167 0.5299270 20
## [6] 0.01790965 0.5195451 17
According to this basketRule
, the highest lift value is 2.18 which means that customers are most likely to buy sausage than any other product, while already having whole milk and yogurt in basket. The lowest lift value is 0.519 which means that customers are least likely to buy other vegetables than any other product, while already having long life bakery product in basket.