Introduction

Durning last internship, I used apriori to improve the recommendation system of my company, KKday. KKday is a leading e-commerce travel platform in South East Asia. In this article, I am going to use the sales data from KKday to illustrate the performance and difference of apriori and distance-based recommendation system.


Data Preprocessing

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require('DT')
## Loading required package: DT

The Original Data

head(data,10)
##     X prod_oid user_id
## 1   1     8332  132591
## 2   2     3598  132591
## 3   3     8332   58804
## 4   4    12808   58804
## 5   5    18073   55631
## 6   6     9987   55631
## 7   7    17772   55631
## 8   8    12049  142657
## 9   9    17756  142657
## 10 10     7505  198893

The data contains products in past 14 days which have been ordered by users. ‘product_oid’ is the code of each product, while user_id is an user who bought the product. For example, we can say user ‘132591’ has bought the product ‘8332’ and ‘3598’ together.

Combine Orders by Same User

df <- data %>%
  group_by(user_id) %>%
  summarise(prod_oid_paste = paste(prod_oid, collapse=" "),
            n = n()) %>% filter(n >1) #remove order that only contain one product
head(df)
## # A tibble: 6 x 3
##   user_id prod_oid_paste     n
##     <int> <chr>          <int>
## 1       1 2808 11971         2
## 2       2 10999 2173         2
## 3       3 2689 2686          2
## 4       4 17696 13367        2
## 5       5 18350 2716         2
## 6       6 17975 18576        2
retail.list <-  df
#Seperate by ""
retail.list <- sapply(retail.list$prod_oid_paste,strsplit, " ")
head(retail.list)
## $`2808 11971`
## [1] "2808"  "11971"
## 
## $`10999 2173`
## [1] "10999" "2173" 
## 
## $`2689 2686`
## [1] "2689" "2686"
## 
## $`17696 13367`
## [1] "17696" "13367"
## 
## $`18350 2716`
## [1] "18350" "2716" 
## 
## $`17975 18576`
## [1] "17975" "18576"

The data has to be transformed into ‘transaction’ type in order to fit in the packages, arules, which we will explore later.

Therefore, we have to group data by user_id and paste the orders together. Now, the original dataframe has been transformed into a list, and each row means a market basket ordered by a certain customer.

Transfrom Order Data into Transaction Data

require(arules)
## Loading required package: arules
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
retail.trans <- as(retail.list, "transactions")
summary(retail.trans)
## transactions as itemMatrix in sparse format with
##  222845 rows (elements/itemsets/transactions) and
##  5067 columns (items) and a density of 0.000498097 
## 
## most frequent items:
##    2674    2685    7423    8332    2173 (Other) 
##   12115   11950    9853    9149    7985  511377 
## 
## element (itemset/transaction) length distribution:
## sizes
##      2      3      4      5      6      7      8      9     10     11 
## 148855  47268  16913   6111   2241    899    331    114     54     22 
##     12     13     14     15     16     17     23     26 
##     14     10      5      2      2      2      1      1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.524   3.000  26.000 
## 
## includes extended item information - examples:
##   labels
## 1  10000
## 2  10005
## 3  10007
## 
## includes extended transaction information - examples:
##   transactionID
## 1    2808 11971
## 2    10999 2173
## 3     2689 2686

By transforming into transactions data and using the summary function, we can see product ‘2674’ is the most frequent product which appeared in 12115 customers’ orders. And the median product in customers’ orders is 2 -> at least 50 % of people have only two product in each order.

Implementing Association Rules

Depends on researcher’s experience and the purpose, we have to set three parameters in arules: confidence, support, and lift, to extract meaningful patterns.

Here we are going to set support, confidence as thresholds, which is common in most research.

Setting Parameters to extract frequent patterns

sup = 0.0001
conf = 0.1
retail.rules <- apriori(retail.trans, parameter=list(supp=sup, conf=conf))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   1e-04      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[5067 item(s), 222845 transaction(s)] done [0.07s].
## sorting and recoding items ... [1488 item(s)] done [0.01s].
## creating transaction tree ... done [0.10s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [4400 rule(s)] done [0.00s].
## creating S4 object  ... done [0.04s].

Knowing that there are thousands of products on KKday, we set a conservative threshould to secure that we could have enough patterns to make recommendation. We get 4400 association rules eventually.

Visualize the support/confidence distribution with arulesViz

# install.packages("arulesViz")
library(arulesViz)
arulesViz::plotly_arules(retail.rules)

This interactive visualization tools can help us determine the parameters. By observing the distribution and the number of rules, we can see whether to increase the threshold or not.

Which products are most frequently bought together

retail.conf <- head(sort(retail.rules, by="confidence"), 20)
inspect(retail.conf)
##      lhs                 rhs     support      confidence lift       count
## [1]  {12225}          => {11359} 0.0001256479 1.0000000  1714.19231   28 
## [2]  {1446,2358,2768} => {2914}  0.0001032108 0.9583333    60.44715   23 
## [3]  {1859}           => {1853}  0.0001435976 0.9411765   134.53269   32 
## [4]  {2791}           => {2143}  0.0001256479 0.9032258   426.43931   28 
## [5]  {11912}          => {9735}  0.0002468083 0.9016393   252.73688   55 
## [6]  {1446,2612,7423} => {2914}  0.0001076982 0.8888889    56.06692   24 
## [7]  {1862}           => {1853}  0.0009872333 0.8627451   123.32164  220 
## [8]  {12784,2674}     => {2685}  0.0002782203 0.8611111    16.05810   62 
## [9]  {1446,2768,7423} => {2914}  0.0002064215 0.8518519    53.73080   46 
## [10] {1446,2878,7423} => {2914}  0.0001211604 0.8437500    53.21978   27 
## [11] {1446,2930}      => {2914}  0.0001615473 0.8372093    52.80722   36 
## [12] {4239,4627,5260} => {5925}  0.0001121856 0.8333333    61.22788   25 
## [13] {1446,2768,2878} => {2914}  0.0001076982 0.8275862    52.20024   24 
## [14] {8416}           => {8427}  0.0002019341 0.8181818   302.36771   45 
## [15] {5260}           => {5925}  0.0069958940 0.8140992    59.81469 1559 
## [16] {1446,1922}      => {2914}  0.0002243712 0.8064516    50.86717   50 
## [17] {12822,2674}     => {2685}  0.0001256479 0.8000000    14.91849   28 
## [18] {1446,2612}      => {2914}  0.0003006574 0.7976190    50.31005   67 
## [19] {2479,2843}      => {2312}  0.0001750095 0.7959184    83.62396   39 
## [20] {2583,2674}      => {2685}  0.0008840225 0.7943548    14.81322  197

By sorting the rules from highest confidence, we can see that the product ‘12225’ has 100% chance being bought together with 11359, yet this combination only has been bought for 28 times, which only count for 0.01% of total orders. On the other hand, Product ‘5260’ has 81% chance being bought together with ‘5925’, and ‘1559’ people have bought the same bundle. This means that we could to recommend ‘1559’ to any those customer who has bought ‘5925’.

What are the patterns that contain most products

rules_length <- lapply(LIST(retail.rules@lhs), function(x) unlist(strsplit(x, " ")))
retail_long <- head(retail.rules[order(lengths(rules_length),retail.rules@quality$confidence,decreasing = TRUE)],20)
inspect(retail_long)
##      lhs                    rhs     support      confidence lift     count
## [1]  {1446,2358,2768}    => {2914}  0.0001032108 0.9583333  60.44715 23   
## [2]  {1446,2612,7423}    => {2914}  0.0001076982 0.8888889  56.06692 24   
## [3]  {1446,2768,7423}    => {2914}  0.0002064215 0.8518519  53.73080 46   
## [4]  {1446,2878,7423}    => {2914}  0.0001211604 0.8437500  53.21978 27   
## [5]  {4239,4627,5260}    => {5925}  0.0001121856 0.8333333  61.22788 25   
## [6]  {1446,2768,2878}    => {2914}  0.0001076982 0.8275862  52.20024 24   
## [7]  {13903,13952,17927} => {13900} 0.0001615473 0.7826087  71.06782 36   
## [8]  {2459,2843,4016}    => {2312}  0.0001211604 0.7714286  81.05092 27   
## [9]  {13900,13903,17927} => {13952} 0.0001615473 0.7659574  73.10055 36   
## [10] {2467,2843,4016}    => {2312}  0.0001525724 0.7391304  77.65748 34   
## [11] {13903,13952,7452}  => {13900} 0.0001525724 0.7391304  67.11961 34   
## [12] {11731,13903,13952} => {13900} 0.0002557832 0.7307692  66.36034 57   
## [13] {4227,4627,5260}    => {5925}  0.0001032108 0.7187500  52.80905 23   
## [14] {2459,2467,2843}    => {2312}  0.0001211604 0.6923077  72.73800 27   
## [15] {2459,2467,4016}    => {2312}  0.0002198838 0.6901408  72.51034 49   
## [16] {2287,2685,8332}    => {2674}  0.0001929592 0.6825397  12.55473 43   
## [17] {13900,13903,17688} => {13952} 0.0001346227 0.6521739  62.24141 30   
## [18] {17756,2674,8332}   => {2685}  0.0001211604 0.6428571  11.98808 27   
## [19] {11731,13900,17688} => {13952} 0.0001032108 0.6388889  60.97353 23   
## [20] {11847,18608,2322}  => {18073} 0.0001480850 0.6226415  20.75580 33

We can see that the first 4 patterns of rhs are product ‘2914’, meaning that these products are often being bought together.

Network Graph

plot(retail.rules, method="graph", control=list(type="items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main  =  Graph for 100 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE
## Warning: plot: Too many rules supplied. Only plotting the best 100 rules
## using 'support' (change control parameter max if needed)

The network graph shows associations between selected products. Larger circles imply higher support, while red circles imply higher lift.

Try Pratical Work with Recommendation

Form the Recommendation Function

next_buy = function(new_basket){
  it_new_basket = as(list(new_basket), "itemMatrix")
# find all rules, where the lhs is a subset of the current new_basket
  rulesMatchLHS <- is.subset(retail.rules@lhs,it_new_basket)
# and the rhs is NOT a subset of the current new_basket (so that some items are left as potential recommendation)
  suitableRules <-  rulesMatchLHS & !(is.subset(retail.rules@rhs,it_new_basket))
  possible_recomed = retail.rules[as.logical(suitableRules)]
  if(length(possible_recomed)==0){
    print('No association rules pass the threshold, consider other possible combination ')
  } # Report no applicable association rules for the threshold
  else{
    lst1 <- lapply(LIST(possible_recomed@lhs), function(x) unlist(strsplit(x, " ")))
    recommendations <- strsplit(LIST(possible_recomed@rhs)[[order(possible_recomed@quality$confidence, decreasing = TRUE)[1]]],split=" ")# Report the possible recommendation by the order of confidence
    print("Potential recommendations are...")
    inspect(possible_recomed[order(possible_recomed@quality$confidence, decreasing = TRUE),])
     recommendations <- lapply(recommendations,function(x){paste(x,collapse=" ")})
    recommendations <- as.character(recommendations)
    print(paste("Best recommendation would be ",recommendations))
    return(as.character(recommendations))
  }
}

Enter any ProductID you are interested in

#Test any basket you like
new_basket = c('19252')
next_buy(new_basket)
## [1] "Potential recommendations are..."
##     lhs        rhs     support      confidence lift     count
## [1] {19252} => {10759} 0.0006102897 0.2063733  11.51747 136  
## [2] {19252} => {18565} 0.0004038682 0.1365706  17.23333  90  
## [3] {19252} => {18566} 0.0003545065 0.1198786  28.35918  79  
## [1] "Best recommendation would be  10759"
## [1] "10759"

We can see the best recommendation after buying product 19252 is 10759

Reverse Recommendation: What Products Induce the purchasement of the Objective??

target_one = 11731
rules<-apriori(data=retail.trans, parameter=list(supp=sup,conf = conf,minlen=2), 
               appearance = list(default="lhs",rhs=target_one),
               control = list(verbose=F))
rules<-head(sort(rules, decreasing=TRUE,by="confidence"),20)
inspect(rules)
##      lhs                    rhs     support      confidence lift     count
## [1]  {13903,17688,3014}  => {11731} 0.0001166730 0.5306122  22.94669  26  
## [2]  {17688,3014}        => {11731} 0.0002288586 0.4146341  17.93114  51  
## [3]  {12159,17688}       => {11731} 0.0001615473 0.3711340  16.04994  36  
## [4]  {13903,17688,17927} => {11731} 0.0002512957 0.3636364  15.72570  56  
## [5]  {13903,17927,7452}  => {11731} 0.0001256479 0.3636364  15.72570  28  
## [6]  {17927,1922}        => {11731} 0.0001256479 0.3456790  14.94912  28  
## [7]  {13952,17688}       => {11731} 0.0001794970 0.3418803  14.78485  40  
## [8]  {13903,13931,17688} => {11731} 0.0002602706 0.3411765  14.75441  58  
## [9]  {13900,13952,17688} => {11731} 0.0001032108 0.3333333  14.41523  23  
## [10] {13903,13931,3014}  => {11731} 0.0001166730 0.3291139  14.23276  26  
## [11] {13903,17688}       => {11731} 0.0010949315 0.3124200  13.51081 244  
## [12] {17688,1922}        => {11731} 0.0001076982 0.3076923  13.30636  24  
## [13] {13900,17688}       => {11731} 0.0001615473 0.3076923  13.30636  36  
## [14] {13903,3014}        => {11731} 0.0004352801 0.2912913  12.59709  97  
## [15] {17688,18127}       => {11731} 0.0001211604 0.2903226  12.55520  27  
## [16] {17688,17927}       => {11731} 0.0004577172 0.2857143  12.35591 102  
## [17] {17688,7452}        => {11731} 0.0002468083 0.2849741  12.32390  55  
## [18] {13931,17688}       => {11731} 0.0005070789 0.2783251  12.03636 113  
## [19] {13903,1922}        => {11731} 0.0002737329 0.2735426  11.82954  61  
## [20] {13903,17927,18127} => {11731} 0.0001121856 0.2717391  11.75154  25

Model Comparisons: Cosine Distance vs. Apriori

Import Cosine Distance Recommendation unction

## [1] "500/5067"
## [1] "1000/5067"
## [1] "1500/5067"
## [1] "2000/5067"
## [1] "2500/5067"
## [1] "3000/5067"
## [1] "3500/5067"
## [1] "4000/5067"
## [1] "4500/5067"
## [1] "5000/5067"
## [1] "prod_user matrix finish"
## [1] "start calculating cosine_score_matrix"
## [1] "500/5067"
## [1] "1000/5067"
## [1] "1500/5067"
## [1] "2000/5067"
## [1] "2500/5067"
## [1] "3000/5067"
## [1] "3500/5067"
## [1] "4000/5067"
## [1] "4500/5067"
## [1] "5000/5067"
##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  2675367 142.9    4703850 251.3   4703850 251.3
## Vcells 33405769 254.9  106181062 810.1 110426324 842.5
## [1] "End of calculation cosine_score_matrix"

Experiments on Different Combination

#Apriori result
#Recommendation after buying

target_one = c(1446)
next_buy(target_one)
## [1] "Potential recommendations are..."
##     lhs       rhs     support      confidence lift      count
## [1] {1446} => {2914}  0.0060131482 0.7600681  47.941514 1340 
## [2] {1446} => {7423}  0.0011846799 0.1497448   3.386773  264 
## [3] {1446} => {2768}  0.0010231327 0.1293250   7.054941  228 
## [4] {1446} => {12245} 0.0008750477 0.1106069  19.750160  195 
## [1] "Best recommendation would be  2914"
## [1] "2914"
#Cosine Result
head(DT_similar_prod[DT_similar_prod$prod_oid==target_one,],10)
##     similar_prod_oid      score prod_oid
## 361             2914 0.53691659     1446
## 362            12245 0.13146228     1446
## 363            11293 0.09779082     1446
## 364             2768 0.08495964     1446
## 365             7423 0.06334226     1446
## 366             2878 0.06225813     1446
## 367             2881 0.05335821     1446
## 368            11866 0.04767070     1446
## 369             2603 0.04125782     1446
## 370             2948 0.04000302     1446

We can see for product 1446, both models recommends the customer to buy 2914 first.In the top 4 product for each recommendation, only one product is different. Looks like apriori doesn’t give a surprise?

Experiments on Different Combination

#Apriori result
#Recommendation after buying

target_one = c(1446,2914)
head(next_buy(target_one),10)
## [1] "Potential recommendations are..."
##     lhs            rhs     support      confidence lift      count
## [1] {2914}      => {2768}  0.0028181023 0.1777526   9.696764 628  
## [2] {2914}      => {7423}  0.0026879670 0.1695443   3.834578 599  
## [3] {1446,2914} => {7423}  0.0009199219 0.1529851   3.460059 205  
## [4] {1446}      => {7423}  0.0011846799 0.1497448   3.386773 264  
## [5] {1446}      => {2768}  0.0010231327 0.1293250   7.054941 228  
## [6] {2914}      => {2716}  0.0020417779 0.1287857   8.968518 455  
## [7] {1446,2914} => {2768}  0.0007673495 0.1276119   6.961489 171  
## [8] {1446}      => {12245} 0.0008750477 0.1106069  19.750160 195  
## [9] {2914}      => {2358}  0.0016513720 0.1041608   4.734185 368  
## [1] "Best recommendation would be  2768"
## [1] "2768"
#Cosine Result
head(DT_similar_prod[DT_similar_prod$prod_oid==target_one,],10)
##     similar_prod_oid      score prod_oid
## 361             2914 0.53691659     1446
## 363            11293 0.09779082     1446
## 365             7423 0.06334226     1446
## 367             2881 0.05335821     1446
## 369             2603 0.04125782     1446
## 371             2612 0.03873839     1446
## 373             2716 0.03536536     1446
## 375             1922 0.02978948     1446
## 377             2764 0.02885791     1446
## 379             3738 0.02644679     1446

Apriori is particularly powerful when recommending complements. When there is 1 item in basket, we could hardly find difference of it and other. But when we have two products in our baskets, such as 2914 and 1446, then it can recommend new product 7423 base on this set.