Introduction
Durning last internship, I used apriori to improve the recommendation system of my company, KKday. KKday is a leading e-commerce travel platform in South East Asia. In this article, I am going to use the sales data from KKday to illustrate the performance and difference of apriori and distance-based recommendation system.
Data Preprocessing
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require('DT')
## Loading required package: DT
The Original Data
head(data,10)
## X prod_oid user_id
## 1 1 8332 132591
## 2 2 3598 132591
## 3 3 8332 58804
## 4 4 12808 58804
## 5 5 18073 55631
## 6 6 9987 55631
## 7 7 17772 55631
## 8 8 12049 142657
## 9 9 17756 142657
## 10 10 7505 198893
The data contains products in past 14 days which have been ordered by users. ‘product_oid’ is the code of each product, while user_id is an user who bought the product. For example, we can say user ‘132591’ has bought the product ‘8332’ and ‘3598’ together.
Combine Orders by Same User
df <- data %>%
group_by(user_id) %>%
summarise(prod_oid_paste = paste(prod_oid, collapse=" "),
n = n()) %>% filter(n >1) #remove order that only contain one product
head(df)
## # A tibble: 6 x 3
## user_id prod_oid_paste n
## <int> <chr> <int>
## 1 1 2808 11971 2
## 2 2 10999 2173 2
## 3 3 2689 2686 2
## 4 4 17696 13367 2
## 5 5 18350 2716 2
## 6 6 17975 18576 2
retail.list <- df
#Seperate by ""
retail.list <- sapply(retail.list$prod_oid_paste,strsplit, " ")
head(retail.list)
## $`2808 11971`
## [1] "2808" "11971"
##
## $`10999 2173`
## [1] "10999" "2173"
##
## $`2689 2686`
## [1] "2689" "2686"
##
## $`17696 13367`
## [1] "17696" "13367"
##
## $`18350 2716`
## [1] "18350" "2716"
##
## $`17975 18576`
## [1] "17975" "18576"
The data has to be transformed into ‘transaction’ type in order to fit in the packages, arules, which we will explore later.
Therefore, we have to group data by user_id and paste the orders together. Now, the original dataframe has been transformed into a list, and each row means a market basket ordered by a certain customer.
Transfrom Order Data into Transaction Data
require(arules)
## Loading required package: arules
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
retail.trans <- as(retail.list, "transactions")
summary(retail.trans)
## transactions as itemMatrix in sparse format with
## 222845 rows (elements/itemsets/transactions) and
## 5067 columns (items) and a density of 0.000498097
##
## most frequent items:
## 2674 2685 7423 8332 2173 (Other)
## 12115 11950 9853 9149 7985 511377
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 5 6 7 8 9 10 11
## 148855 47268 16913 6111 2241 899 331 114 54 22
## 12 13 14 15 16 17 23 26
## 14 10 5 2 2 2 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.524 3.000 26.000
##
## includes extended item information - examples:
## labels
## 1 10000
## 2 10005
## 3 10007
##
## includes extended transaction information - examples:
## transactionID
## 1 2808 11971
## 2 10999 2173
## 3 2689 2686
By transforming into transactions data and using the summary function, we can see product ‘2674’ is the most frequent product which appeared in 12115 customers’ orders. And the median product in customers’ orders is 2 -> at least 50 % of people have only two product in each order.
Implementing Association Rules
Depends on researcher’s experience and the purpose, we have to set three parameters in arules: confidence, support, and lift, to extract meaningful patterns.
Here we are going to set support, confidence as thresholds, which is common in most research.
Setting Parameters to extract frequent patterns
sup = 0.0001
conf = 0.1
retail.rules <- apriori(retail.trans, parameter=list(supp=sup, conf=conf))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 1e-04 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[5067 item(s), 222845 transaction(s)] done [0.07s].
## sorting and recoding items ... [1488 item(s)] done [0.01s].
## creating transaction tree ... done [0.10s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [4400 rule(s)] done [0.00s].
## creating S4 object ... done [0.04s].
Knowing that there are thousands of products on KKday, we set a conservative threshould to secure that we could have enough patterns to make recommendation. We get 4400 association rules eventually.
Visualize the support/confidence distribution with arulesViz
# install.packages("arulesViz")
library(arulesViz)
arulesViz::plotly_arules(retail.rules)
Which products are most frequently bought together
retail.conf <- head(sort(retail.rules, by="confidence"), 20)
inspect(retail.conf)
## lhs rhs support confidence lift count
## [1] {12225} => {11359} 0.0001256479 1.0000000 1714.19231 28
## [2] {1446,2358,2768} => {2914} 0.0001032108 0.9583333 60.44715 23
## [3] {1859} => {1853} 0.0001435976 0.9411765 134.53269 32
## [4] {2791} => {2143} 0.0001256479 0.9032258 426.43931 28
## [5] {11912} => {9735} 0.0002468083 0.9016393 252.73688 55
## [6] {1446,2612,7423} => {2914} 0.0001076982 0.8888889 56.06692 24
## [7] {1862} => {1853} 0.0009872333 0.8627451 123.32164 220
## [8] {12784,2674} => {2685} 0.0002782203 0.8611111 16.05810 62
## [9] {1446,2768,7423} => {2914} 0.0002064215 0.8518519 53.73080 46
## [10] {1446,2878,7423} => {2914} 0.0001211604 0.8437500 53.21978 27
## [11] {1446,2930} => {2914} 0.0001615473 0.8372093 52.80722 36
## [12] {4239,4627,5260} => {5925} 0.0001121856 0.8333333 61.22788 25
## [13] {1446,2768,2878} => {2914} 0.0001076982 0.8275862 52.20024 24
## [14] {8416} => {8427} 0.0002019341 0.8181818 302.36771 45
## [15] {5260} => {5925} 0.0069958940 0.8140992 59.81469 1559
## [16] {1446,1922} => {2914} 0.0002243712 0.8064516 50.86717 50
## [17] {12822,2674} => {2685} 0.0001256479 0.8000000 14.91849 28
## [18] {1446,2612} => {2914} 0.0003006574 0.7976190 50.31005 67
## [19] {2479,2843} => {2312} 0.0001750095 0.7959184 83.62396 39
## [20] {2583,2674} => {2685} 0.0008840225 0.7943548 14.81322 197
By sorting the rules from highest confidence, we can see that the product ‘12225’ has 100% chance being bought together with 11359, yet this combination only has been bought for 28 times, which only count for 0.01% of total orders. On the other hand, Product ‘5260’ has 81% chance being bought together with ‘5925’, and ‘1559’ people have bought the same bundle. This means that we could to recommend ‘1559’ to any those customer who has bought ‘5925’.
What are the patterns that contain most products
rules_length <- lapply(LIST(retail.rules@lhs), function(x) unlist(strsplit(x, " ")))
retail_long <- head(retail.rules[order(lengths(rules_length),retail.rules@quality$confidence,decreasing = TRUE)],20)
inspect(retail_long)
## lhs rhs support confidence lift count
## [1] {1446,2358,2768} => {2914} 0.0001032108 0.9583333 60.44715 23
## [2] {1446,2612,7423} => {2914} 0.0001076982 0.8888889 56.06692 24
## [3] {1446,2768,7423} => {2914} 0.0002064215 0.8518519 53.73080 46
## [4] {1446,2878,7423} => {2914} 0.0001211604 0.8437500 53.21978 27
## [5] {4239,4627,5260} => {5925} 0.0001121856 0.8333333 61.22788 25
## [6] {1446,2768,2878} => {2914} 0.0001076982 0.8275862 52.20024 24
## [7] {13903,13952,17927} => {13900} 0.0001615473 0.7826087 71.06782 36
## [8] {2459,2843,4016} => {2312} 0.0001211604 0.7714286 81.05092 27
## [9] {13900,13903,17927} => {13952} 0.0001615473 0.7659574 73.10055 36
## [10] {2467,2843,4016} => {2312} 0.0001525724 0.7391304 77.65748 34
## [11] {13903,13952,7452} => {13900} 0.0001525724 0.7391304 67.11961 34
## [12] {11731,13903,13952} => {13900} 0.0002557832 0.7307692 66.36034 57
## [13] {4227,4627,5260} => {5925} 0.0001032108 0.7187500 52.80905 23
## [14] {2459,2467,2843} => {2312} 0.0001211604 0.6923077 72.73800 27
## [15] {2459,2467,4016} => {2312} 0.0002198838 0.6901408 72.51034 49
## [16] {2287,2685,8332} => {2674} 0.0001929592 0.6825397 12.55473 43
## [17] {13900,13903,17688} => {13952} 0.0001346227 0.6521739 62.24141 30
## [18] {17756,2674,8332} => {2685} 0.0001211604 0.6428571 11.98808 27
## [19] {11731,13900,17688} => {13952} 0.0001032108 0.6388889 60.97353 23
## [20] {11847,18608,2322} => {18073} 0.0001480850 0.6226415 20.75580 33
We can see that the first 4 patterns of rhs are product ‘2914’, meaning that these products are often being bought together.
Network Graph
plot(retail.rules, method="graph", control=list(type="items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 100 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
## Warning: plot: Too many rules supplied. Only plotting the best 100 rules
## using 'support' (change control parameter max if needed)
The network graph shows associations between selected products. Larger circles imply higher support, while red circles imply higher lift.
The most popular order was of ‘2674’ and ‘2685’, another popular orders were ‘2689’ and ‘2685’
- If someone buys ‘17899’, he is likely to have bought 4051 as well
Relatively many people buy ‘5260’ along with ‘sliced cheese’5925’ (1559 times)
Try Pratical Work with Recommendation
Form the Recommendation Function
next_buy = function(new_basket){
it_new_basket = as(list(new_basket), "itemMatrix")
# find all rules, where the lhs is a subset of the current new_basket
rulesMatchLHS <- is.subset(retail.rules@lhs,it_new_basket)
# and the rhs is NOT a subset of the current new_basket (so that some items are left as potential recommendation)
suitableRules <- rulesMatchLHS & !(is.subset(retail.rules@rhs,it_new_basket))
possible_recomed = retail.rules[as.logical(suitableRules)]
if(length(possible_recomed)==0){
print('No association rules pass the threshold, consider other possible combination ')
} # Report no applicable association rules for the threshold
else{
lst1 <- lapply(LIST(possible_recomed@lhs), function(x) unlist(strsplit(x, " ")))
recommendations <- strsplit(LIST(possible_recomed@rhs)[[order(possible_recomed@quality$confidence, decreasing = TRUE)[1]]],split=" ")# Report the possible recommendation by the order of confidence
print("Potential recommendations are...")
inspect(possible_recomed[order(possible_recomed@quality$confidence, decreasing = TRUE),])
recommendations <- lapply(recommendations,function(x){paste(x,collapse=" ")})
recommendations <- as.character(recommendations)
print(paste("Best recommendation would be ",recommendations))
return(as.character(recommendations))
}
}
Enter any ProductID you are interested in
#Test any basket you like
new_basket = c('19252')
next_buy(new_basket)
## [1] "Potential recommendations are..."
## lhs rhs support confidence lift count
## [1] {19252} => {10759} 0.0006102897 0.2063733 11.51747 136
## [2] {19252} => {18565} 0.0004038682 0.1365706 17.23333 90
## [3] {19252} => {18566} 0.0003545065 0.1198786 28.35918 79
## [1] "Best recommendation would be 10759"
## [1] "10759"
We can see the best recommendation after buying product 19252 is 10759
Reverse Recommendation: What Products Induce the purchasement of the Objective??
target_one = 11731
rules<-apriori(data=retail.trans, parameter=list(supp=sup,conf = conf,minlen=2),
appearance = list(default="lhs",rhs=target_one),
control = list(verbose=F))
rules<-head(sort(rules, decreasing=TRUE,by="confidence"),20)
inspect(rules)
## lhs rhs support confidence lift count
## [1] {13903,17688,3014} => {11731} 0.0001166730 0.5306122 22.94669 26
## [2] {17688,3014} => {11731} 0.0002288586 0.4146341 17.93114 51
## [3] {12159,17688} => {11731} 0.0001615473 0.3711340 16.04994 36
## [4] {13903,17688,17927} => {11731} 0.0002512957 0.3636364 15.72570 56
## [5] {13903,17927,7452} => {11731} 0.0001256479 0.3636364 15.72570 28
## [6] {17927,1922} => {11731} 0.0001256479 0.3456790 14.94912 28
## [7] {13952,17688} => {11731} 0.0001794970 0.3418803 14.78485 40
## [8] {13903,13931,17688} => {11731} 0.0002602706 0.3411765 14.75441 58
## [9] {13900,13952,17688} => {11731} 0.0001032108 0.3333333 14.41523 23
## [10] {13903,13931,3014} => {11731} 0.0001166730 0.3291139 14.23276 26
## [11] {13903,17688} => {11731} 0.0010949315 0.3124200 13.51081 244
## [12] {17688,1922} => {11731} 0.0001076982 0.3076923 13.30636 24
## [13] {13900,17688} => {11731} 0.0001615473 0.3076923 13.30636 36
## [14] {13903,3014} => {11731} 0.0004352801 0.2912913 12.59709 97
## [15] {17688,18127} => {11731} 0.0001211604 0.2903226 12.55520 27
## [16] {17688,17927} => {11731} 0.0004577172 0.2857143 12.35591 102
## [17] {17688,7452} => {11731} 0.0002468083 0.2849741 12.32390 55
## [18] {13931,17688} => {11731} 0.0005070789 0.2783251 12.03636 113
## [19] {13903,1922} => {11731} 0.0002737329 0.2735426 11.82954 61
## [20] {13903,17927,18127} => {11731} 0.0001121856 0.2717391 11.75154 25
Model Comparisons: Cosine Distance vs. Apriori
Import Cosine Distance Recommendation unction
## [1] "500/5067"
## [1] "1000/5067"
## [1] "1500/5067"
## [1] "2000/5067"
## [1] "2500/5067"
## [1] "3000/5067"
## [1] "3500/5067"
## [1] "4000/5067"
## [1] "4500/5067"
## [1] "5000/5067"
## [1] "prod_user matrix finish"
## [1] "start calculating cosine_score_matrix"
## [1] "500/5067"
## [1] "1000/5067"
## [1] "1500/5067"
## [1] "2000/5067"
## [1] "2500/5067"
## [1] "3000/5067"
## [1] "3500/5067"
## [1] "4000/5067"
## [1] "4500/5067"
## [1] "5000/5067"
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2675367 142.9 4703850 251.3 4703850 251.3
## Vcells 33405769 254.9 106181062 810.1 110426324 842.5
## [1] "End of calculation cosine_score_matrix"
Experiments on Different Combination
#Apriori result
#Recommendation after buying
target_one = c(1446)
next_buy(target_one)
## [1] "Potential recommendations are..."
## lhs rhs support confidence lift count
## [1] {1446} => {2914} 0.0060131482 0.7600681 47.941514 1340
## [2] {1446} => {7423} 0.0011846799 0.1497448 3.386773 264
## [3] {1446} => {2768} 0.0010231327 0.1293250 7.054941 228
## [4] {1446} => {12245} 0.0008750477 0.1106069 19.750160 195
## [1] "Best recommendation would be 2914"
## [1] "2914"
#Cosine Result
head(DT_similar_prod[DT_similar_prod$prod_oid==target_one,],10)
## similar_prod_oid score prod_oid
## 361 2914 0.53691659 1446
## 362 12245 0.13146228 1446
## 363 11293 0.09779082 1446
## 364 2768 0.08495964 1446
## 365 7423 0.06334226 1446
## 366 2878 0.06225813 1446
## 367 2881 0.05335821 1446
## 368 11866 0.04767070 1446
## 369 2603 0.04125782 1446
## 370 2948 0.04000302 1446
We can see for product 1446, both models recommends the customer to buy 2914 first.In the top 4 product for each recommendation, only one product is different. Looks like apriori doesn’t give a surprise?
Experiments on Different Combination
#Apriori result
#Recommendation after buying
target_one = c(1446,2914)
head(next_buy(target_one),10)
## [1] "Potential recommendations are..."
## lhs rhs support confidence lift count
## [1] {2914} => {2768} 0.0028181023 0.1777526 9.696764 628
## [2] {2914} => {7423} 0.0026879670 0.1695443 3.834578 599
## [3] {1446,2914} => {7423} 0.0009199219 0.1529851 3.460059 205
## [4] {1446} => {7423} 0.0011846799 0.1497448 3.386773 264
## [5] {1446} => {2768} 0.0010231327 0.1293250 7.054941 228
## [6] {2914} => {2716} 0.0020417779 0.1287857 8.968518 455
## [7] {1446,2914} => {2768} 0.0007673495 0.1276119 6.961489 171
## [8] {1446} => {12245} 0.0008750477 0.1106069 19.750160 195
## [9] {2914} => {2358} 0.0016513720 0.1041608 4.734185 368
## [1] "Best recommendation would be 2768"
## [1] "2768"
#Cosine Result
head(DT_similar_prod[DT_similar_prod$prod_oid==target_one,],10)
## similar_prod_oid score prod_oid
## 361 2914 0.53691659 1446
## 363 11293 0.09779082 1446
## 365 7423 0.06334226 1446
## 367 2881 0.05335821 1446
## 369 2603 0.04125782 1446
## 371 2612 0.03873839 1446
## 373 2716 0.03536536 1446
## 375 1922 0.02978948 1446
## 377 2764 0.02885791 1446
## 379 3738 0.02644679 1446
Apriori is particularly powerful when recommending complements. When there is 1 item in basket, we could hardly find difference of it and other. But when we have two products in our baskets, such as 2914 and 1446, then it can recommend new product 7423 base on this set.