Categories
blog

Look-alike modelling

Using stratified sampling technique

Complete code available here: https://github.com/bassalat/lookalikes

Lookalike modelling is searching for a set of observations (i.e. customers in the marketing domain) that have a similar profile to a target group. In essence, we are trying the mimic the features of the target control to form a control group. This is useful for a variety of applications that include A/B tests, post-campaign analyses and finding a specific set of customers.

To illustrate this method, we’ll use a Credit card customer profile data-set available on Kaggle and use R to explore it:

https://www.kaggle.com/arjunbhasin2013/ccdata

CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE balance_bin purchases_bin cashadvance_bin payments_bin purchasestrx_bin combined_bin
C10001 40.90075 0.818182 95.40 0.00 95.40 0.000 0.166667 0.000000 0.083333 0.000000 0 2 1000 201.8021 139.5098 0.000000 12 (0,100] (0,100] (-1,0] (100,1e+03] (0,5] (0,100] (0,100] (-1,0]
C10002 3202.46742 0.909091 0.00 0.00 0.00 6442.945 0.000000 0.000000 0.000000 0.250000 4 0 7000 4103.0326 1072.3402 0.222222 12 (3e+03,5e+03] (-1,0] (5e+03,5e+04] (1e+03,5e+03] (-1,0] (3e+03,5e+03] (-1,0] (5e+03,5e+04]
C10003 2495.14886 1.000000 773.17 773.17 0.00 0.000 1.000000 1.000000 0.000000 0.000000 0 12 7500 622.0667 627.2848 0.000000 12 (2e+03,3e+03] (500,1e+03] (-1,0] (100,1e+03] (5,15] (2e+03,3e+03] (500,1e+03] (-1,0]
C10004 1666.67054 0.636364 1499.00 1499.00 0.00 205.788 0.083333 0.083333 0.000000 0.083333 1 1 7500 0.0000 NA 0.000000 12 (1e+03,2e+03] (1e+03,2e+03] (0,500] (-1,0] (0,5] (1e+03,2e+03] (1e+03,2e+03] (0,500]
C10005 817.71434 1.000000 16.00 16.00 0.00 0.000 0.083333 0.083333 0.000000 0.000000 0 1 1200 678.3348 244.7912 0.000000 12 (500,1e+03] (0,100] (-1,0] (100,1e+03] (0,5] (500,1e+03] (0,100] (-1,0]
C10006 1809.82875 1.000000 1333.28 0.00 1333.28 0.000 0.666667 0.000000 0.583333 0.000000 0 8 1800 1400.0578 2407.2460 0.000000 12 (1e+03,2e+03] (1e+03,2e+03] (-1,0] (1e+03,5e+03] (5,15] (1e+03,2e+03] (1e+03,2e+03] (-1,0]
C10007 627.26081 1.000000 7091.01 6402.63 688.38 0.000 1.000000 1.000000 1.000000 0.000000 0 64 13500 6354.3143 198.0659 1.000000 12 (500,1e+03] (5e+03,1e+04] (-1,0] (5e+03,1e+04] (15,100] (500,1e+03] (5e+03,1e+04] (-1,0]
C10008 1823.65274 1.000000 436.20 0.00 436.20 0.000 1.000000 0.000000 1.000000 0.000000 0 12 2300 679.0651 532.0340 0.000000 12 (1e+03,2e+03] (100,500] (-1,0] (100,1e+03] (5,15] (1e+03,2e+03] (100,500] (-1,0]
C10009 1014.92647 1.000000 861.49 661.49 200.00 0.000 0.333333 0.083333 0.250000 0.000000 0 5 7000 688.2786 311.9634 0.000000 12 (1e+03,2e+03] (500,1e+03] (-1,0] (100,1e+03] (0,5] (1e+03,2e+03] (500,1e+03] (-1,0]
C10010 152.22598 0.545455 1281.60 1281.60 0.00 0.000 0.166667 0.166667 0.000000 0.000000 0 3 11000 1164.7706 100.3023 0.000000 12 (100,500] (1e+03,2e+03] (-1,0] (1e+03,5e+03] (0,5] (100,500] (1e+03,2e+03] (-1,0]

Let's say a marketing campaign was run on the first 500 customers (target group) in the data-set. Now we need to find look-alike customers from the remaining part of the data-set. Choosing which variables will form the basis for lookalikes is a business domain question. Let's assume for this exercise, that we want to match the distribution of BALANCE, PURCHASES AND CASH_ADVANCE.

As the first step, we are going to get the joint distribution of these variables for the target group. Continuous variables will be binned to convert them to categorical variables so that all variables are of categorical type. Thereafter, a joint categorical variable will be formed and frequency distribution against it will be used.

The following steps will be used to make joint distribution bins:

# Make bins for chosen variables
df <- df %>% mutate(
balance_bin = cut(BALANCE,c(-1,0,100,500,1000,2000,3000,            5000,7000,10000,20000)),
purchases_bin = cut(PURCHASES,c(-1,0,100,500,1000,2000,5000,10000,50000)),
cashadvance_bin = cut(CASH_ADVANCE,c(-1,0,500,1000,2000,5000,50000)))

# Make a combined bin
df <- df %>% mutate(combined_bin = paste(balance_bin,
                                         purchases_bin,
                                         cashadvance_bin))

# Combined distribution of target group:
target_profile <- df[1:500,] %>% group_by(combined_bin) %>%
  summarise(customers=n()) %>%
  mutate(bin_wt = customers/nrow(df[1:500,]))
combined_bin customers bin_wt
(-1,0] (-1,0] (0,500] 1 0.002
(-1,0] (0,100] (-1,0] 1 0.002
(0,100] (-1,0] (0,500] 2 0.004
(0,100] (-1,0] (2e+03,5e+03] 2 0.004
(0,100] (0,100] (-1,0] 14 0.028
(0,100] (0,100] (1e+03,2e+03] 1 0.002
(0,100] (100,500] (-1,0] 30 0.060
(0,100] (100,500] (0,500] 1 0.002
(0,100] (1e+03,2e+03] (-1,0] 5 0.010
(0,100] (500,1e+03] (-1,0] 14 0.028
(0,100] (500,1e+03] (0,500] 1 0.002
(100,500] (-1,0] (1e+03,2e+03] 1 0.002
(100,500] (-1,0] (2e+03,5e+03] 1 0.002
(100,500] (-1,0] (500,1e+03] 3 0.006
(100,500] (-1,0] (5e+03,5e+04] 1 0.002
(100,500] (0,100] (-1,0] 1 0.002
(100,500] (0,100] (0,500] 1 0.002
(100,500] (100,500] (-1,0] 7 0.014
(100,500] (100,500] (1e+03,2e+03] 1 0.002
(100,500] (1e+03,2e+03] (-1,0] 23 0.046
(100,500] (1e+03,2e+03] (0,500] 2 0.004
(100,500] (1e+03,2e+03] (1e+03,2e+03] 1 0.002
(100,500] (2e+03,5e+03] (-1,0] 18 0.036
(100,500] (500,1e+03] (-1,0] 7 0.014
(100,500] (500,1e+03] (0,500] 1 0.002
(100,500] (5e+03,1e+04] (-1,0] 3 0.006
(1e+03,2e+03] (-1,0] (0,500] 16 0.032
(1e+03,2e+03] (-1,0] (1e+03,2e+03] 3 0.006
(1e+03,2e+03] (-1,0] (2e+03,5e+03] 3 0.006
(1e+03,2e+03] (-1,0] (500,1e+03] 2 0.004
(1e+03,2e+03] (0,100] (-1,0] 3 0.006
(1e+03,2e+03] (0,100] (0,500] 2 0.004
(1e+03,2e+03] (100,500] (-1,0] 7 0.014
(1e+03,2e+03] (100,500] (0,500] 6 0.012
(1e+03,2e+03] (100,500] (1e+03,2e+03] 2 0.004
(1e+03,2e+03] (100,500] (2e+03,5e+03] 1 0.002
(1e+03,2e+03] (1e+03,2e+03] (-1,0] 8 0.016
(1e+03,2e+03] (1e+03,2e+03] (0,500] 2 0.004
(1e+03,2e+03] (1e+03,2e+03] (1e+03,2e+03] 3 0.006
(1e+03,2e+03] (1e+03,2e+03] (2e+03,5e+03] 2 0.004
(1e+03,2e+03] (1e+03,2e+03] (500,1e+03] 1 0.002
(1e+03,2e+03] (2e+03,5e+03] (-1,0] 6 0.012
(1e+03,2e+03] (2e+03,5e+03] (0,500] 2 0.004
(1e+03,2e+03] (2e+03,5e+03] (1e+03,2e+03] 2 0.004
(1e+03,2e+03] (2e+03,5e+03] (500,1e+03] 1 0.002
(1e+03,2e+03] (2e+03,5e+03] (5e+03,5e+04] 1 0.002
(1e+03,2e+03] (500,1e+03] (-1,0] 13 0.026
(1e+03,2e+03] (500,1e+03] (0,500] 1 0.002
(1e+03,2e+03] (500,1e+03] (1e+03,2e+03] 2 0.004
(1e+03,2e+03] (500,1e+03] (2e+03,5e+03] 1 0.002
(1e+03,2e+03] (500,1e+03] (5e+03,5e+04] 1 0.002
(1e+03,2e+03] (5e+03,1e+04] (-1,0] 2 0.004
(1e+03,2e+03] (5e+03,1e+04] (5e+03,5e+04] 1 0.002
(1e+04,2e+04] (-1,0] (2e+03,5e+03] 2 0.004
(1e+04,2e+04] (100,500] (2e+03,5e+03] 1 0.002
(1e+04,2e+04] (100,500] (500,1e+03] 1 0.002
(1e+04,2e+04] (100,500] (5e+03,5e+04] 1 0.002
(1e+04,2e+04] (1e+04,5e+04] (-1,0] 1 0.002
(1e+04,2e+04] (1e+04,5e+04] (2e+03,5e+03] 1 0.002
(1e+04,2e+04] (2e+03,5e+03] (-1,0] 1 0.002
(1e+04,2e+04] (2e+03,5e+03] (2e+03,5e+03] 1 0.002
(1e+04,2e+04] (5e+03,1e+04] (-1,0] 1 0.002
(1e+04,2e+04] (5e+03,1e+04] (1e+03,2e+03] 1 0.002
(1e+04,2e+04] (5e+03,1e+04] (2e+03,5e+03] 1 0.002
(2e+03,3e+03] (-1,0] (0,500] 4 0.008
(2e+03,3e+03] (-1,0] (1e+03,2e+03] 1 0.002
(2e+03,3e+03] (-1,0] (2e+03,5e+03] 4 0.008
(2e+03,3e+03] (-1,0] (500,1e+03] 4 0.008
(2e+03,3e+03] (0,100] (-1,0] 1 0.002
(2e+03,3e+03] (0,100] (2e+03,5e+03] 1 0.002
(2e+03,3e+03] (100,500] (-1,0] 3 0.006
(2e+03,3e+03] (100,500] (0,500] 1 0.002
(2e+03,3e+03] (100,500] (1e+03,2e+03] 2 0.004
(2e+03,3e+03] (100,500] (5e+03,5e+04] 1 0.002
(2e+03,3e+03] (1e+03,2e+03] (-1,0] 3 0.006
(2e+03,3e+03] (1e+03,2e+03] (2e+03,5e+03] 1 0.002
(2e+03,3e+03] (1e+04,5e+04] (-1,0] 2 0.004
(2e+03,3e+03] (2e+03,5e+03] (-1,0] 4 0.008
(2e+03,3e+03] (2e+03,5e+03] (0,500] 3 0.006
(2e+03,3e+03] (2e+03,5e+03] (5e+03,5e+04] 1 0.002
(2e+03,3e+03] (500,1e+03] (-1,0] 5 0.010
(2e+03,3e+03] (500,1e+03] (1e+03,2e+03] 1 0.002
(2e+03,3e+03] (500,1e+03] (500,1e+03] 3 0.006
(2e+03,3e+03] (5e+03,1e+04] (-1,0] 5 0.010
(2e+03,3e+03] (5e+03,1e+04] (0,500] 1 0.002
(2e+03,3e+03] (5e+03,1e+04] (1e+03,2e+03] 1 0.002
(3e+03,5e+03] (-1,0] (0,500] 3 0.006
(3e+03,5e+03] (-1,0] (1e+03,2e+03] 8 0.016
(3e+03,5e+03] (-1,0] (2e+03,5e+03] 3 0.006
(3e+03,5e+03] (-1,0] (500,1e+03] 1 0.002
(3e+03,5e+03] (-1,0] (5e+03,5e+04] 4 0.008
(3e+03,5e+03] (0,100] (0,500] 1 0.002
(3e+03,5e+03] (0,100] (2e+03,5e+03] 1 0.002
(3e+03,5e+03] (100,500] (-1,0] 6 0.012
(3e+03,5e+03] (100,500] (0,500] 2 0.004
(3e+03,5e+03] (100,500] (1e+03,2e+03] 1 0.002
(3e+03,5e+03] (100,500] (2e+03,5e+03] 1 0.002
(3e+03,5e+03] (100,500] (500,1e+03] 1 0.002
(3e+03,5e+03] (100,500] (5e+03,5e+04] 1 0.002
(3e+03,5e+03] (1e+03,2e+03] (-1,0] 8 0.016
(3e+03,5e+03] (1e+03,2e+03] (0,500] 1 0.002
(3e+03,5e+03] (1e+03,2e+03] (1e+03,2e+03] 1 0.002
(3e+03,5e+03] (1e+03,2e+03] (5e+03,5e+04] 1 0.002
(3e+03,5e+03] (1e+04,5e+04] (-1,0] 1 0.002
(3e+03,5e+03] (2e+03,5e+03] (-1,0] 5 0.010
(3e+03,5e+03] (2e+03,5e+03] (0,500] 1 0.002
(3e+03,5e+03] (2e+03,5e+03] (1e+03,2e+03] 1 0.002
(3e+03,5e+03] (2e+03,5e+03] (2e+03,5e+03] 1 0.002
(3e+03,5e+03] (2e+03,5e+03] (5e+03,5e+04] 2 0.004
(3e+03,5e+03] (500,1e+03] (-1,0] 3 0.006
(3e+03,5e+03] (500,1e+03] (1e+03,2e+03] 1 0.002
(3e+03,5e+03] (500,1e+03] (2e+03,5e+03] 2 0.004
(3e+03,5e+03] (500,1e+03] (500,1e+03] 1 0.002
(3e+03,5e+03] (500,1e+03] (5e+03,5e+04] 2 0.004
(500,1e+03] (-1,0] (0,500] 4 0.008
(500,1e+03] (-1,0] (1e+03,2e+03] 3 0.006
(500,1e+03] (-1,0] (2e+03,5e+03] 4 0.008
(500,1e+03] (-1,0] (500,1e+03] 2 0.004
(500,1e+03] (0,100] (-1,0] 3 0.006
(500,1e+03] (0,100] (1e+03,2e+03] 1 0.002
(500,1e+03] (0,100] (2e+03,5e+03] 1 0.002
(500,1e+03] (100,500] (-1,0] 8 0.016
(500,1e+03] (100,500] (1e+03,2e+03] 2 0.004
(500,1e+03] (100,500] (500,1e+03] 2 0.004
(500,1e+03] (1e+03,2e+03] (-1,0] 4 0.008
(500,1e+03] (1e+03,2e+03] (0,500] 1 0.002
(500,1e+03] (1e+03,2e+03] (1e+03,2e+03] 2 0.004
(500,1e+03] (1e+03,2e+03] (2e+03,5e+03] 1 0.002
(500,1e+03] (1e+04,5e+04] (1e+03,2e+03] 1 0.002
(500,1e+03] (2e+03,5e+03] (-1,0] 5 0.010
(500,1e+03] (2e+03,5e+03] (0,500] 1 0.002
(500,1e+03] (500,1e+03] (-1,0] 7 0.014
(500,1e+03] (500,1e+03] (0,500] 2 0.004
(500,1e+03] (500,1e+03] (500,1e+03] 2 0.004
(500,1e+03] (5e+03,1e+04] (-1,0] 5 0.010
(500,1e+03] (5e+03,1e+04] (1e+03,2e+03] 1 0.002
(5e+03,7e+03] (-1,0] (0,500] 2 0.004
(5e+03,7e+03] (-1,0] (1e+03,2e+03] 1 0.002
(5e+03,7e+03] (-1,0] (2e+03,5e+03] 8 0.016
(5e+03,7e+03] (-1,0] (500,1e+03] 2 0.004
(5e+03,7e+03] (0,100] (2e+03,5e+03] 1 0.002
(5e+03,7e+03] (100,500] (1e+03,2e+03] 1 0.002
(5e+03,7e+03] (100,500] (2e+03,5e+03] 5 0.010
(5e+03,7e+03] (100,500] (500,1e+03] 2 0.004
(5e+03,7e+03] (100,500] (5e+03,5e+04] 4 0.008
(5e+03,7e+03] (1e+03,2e+03] (-1,0] 5 0.010
(5e+03,7e+03] (1e+03,2e+03] (2e+03,5e+03] 2 0.004
(5e+03,7e+03] (1e+04,5e+04] (-1,0] 1 0.002
(5e+03,7e+03] (2e+03,5e+03] (-1,0] 2 0.004
(5e+03,7e+03] (2e+03,5e+03] (1e+03,2e+03] 1 0.002
(5e+03,7e+03] (2e+03,5e+03] (2e+03,5e+03] 3 0.006
(5e+03,7e+03] (2e+03,5e+03] (5e+03,5e+04] 1 0.002
(5e+03,7e+03] (500,1e+03] (-1,0] 1 0.002
(5e+03,7e+03] (5e+03,1e+04] (-1,0] 2 0.004
(5e+03,7e+03] (5e+03,1e+04] (0,500] 1 0.002
(7e+03,1e+04] (-1,0] (2e+03,5e+03] 2 0.004
(7e+03,1e+04] (-1,0] (5e+03,5e+04] 4 0.008
(7e+03,1e+04] (100,500] (-1,0] 1 0.002
(7e+03,1e+04] (100,500] (2e+03,5e+03] 3 0.006
(7e+03,1e+04] (100,500] (5e+03,5e+04] 2 0.004
(7e+03,1e+04] (1e+03,2e+03] (2e+03,5e+03] 1 0.002
(7e+03,1e+04] (1e+03,2e+03] (500,1e+03] 1 0.002
(7e+03,1e+04] (1e+03,2e+03] (5e+03,5e+04] 1 0.002
(7e+03,1e+04] (1e+04,5e+04] (-1,0] 1 0.002
(7e+03,1e+04] (2e+03,5e+03] (-1,0] 3 0.006
(7e+03,1e+04] (2e+03,5e+03] (500,1e+03] 1 0.002
(7e+03,1e+04] (2e+03,5e+03] (5e+03,5e+04] 2 0.004
(7e+03,1e+04] (500,1e+03] (2e+03,5e+03] 1 0.002
(7e+03,1e+04] (500,1e+03] (5e+03,5e+04] 2 0.004
(7e+03,1e+04] (5e+03,1e+04] (0,500] 1 0.002
(7e+03,1e+04] (5e+03,1e+04] (1e+03,2e+03] 1 0.002
(7e+03,1e+04] (5e+03,1e+04] (5e+03,5e+04] 1 0.002

The table above shows the final distribution across the three variables that we will replicate for the control group. The following piece of code samples out customers from the control group in the same frequency distribution as above:

#### Generate lookalikes

control_n = 500 # desired number of customers in the control group
j = 0 # counter
df_control = data.frame() # dataframe where control group will be formed
for (i in target_profile$combined_bin){
  j=j+1 
  print(j)
  df_control_bin <- df[(control_n+1):nrow(df),] %>% filter(combined_bin==i)
  df_control_bin_sample <- sample(1:nrow(df_control_bin),
                            size= as.integer(target_profile %>%
                                               filter(combined_bin==i) %>%
                                               select(bin_wt)*control_n), replace = TRUE)
  df_control_bin_sample <- df_control_bin[df_control_bin_sample,] # pick actual rows from sample index values
  df_control <- rbind(df_control,
                      df_control_bin_sample) # keep adding the main control group
                                              #data frame with each iteration
}

A point to note here is that we are sampling with replacement as there are some bins where the number of customers required for the control group are not sufficient to match the bin weight of the target group. Therefore, for some bins, customers are counted more than once (there are 478 distinct customers forming the 500 rows control group). For larger datasets, this problem would not arise and sampling without replacement should be used.

Finally, we can match the distribution across the three variables for both groups and verify that they match: