Using stratified sampling technique
Complete code available here: https://github.com/bassalat/lookalikes
Lookalike modelling is searching for a set of observations (i.e. customers in the marketing domain) that have a similar profile to a target group. In essence, we are trying the mimic the features of the target control to form a control group. This is useful for a variety of applications that include A/B tests, post-campaign analyses and finding a specific set of customers.
To illustrate this method, we’ll use a Credit card customer profile data-set available on Kaggle and use R to explore it:
https://www.kaggle.com/arjunbhasin2013/ccdata
CUST_ID | BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | balance_bin | purchases_bin | cashadvance_bin | payments_bin | purchasestrx_bin | combined_bin |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
C10001 | 40.90075 | 0.818182 | 95.40 | 0.00 | 95.40 | 0.000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000 | 201.8021 | 139.5098 | 0.000000 | 12 | (0,100] | (0,100] | (-1,0] | (100,1e+03] | (0,5] | (0,100] (0,100] (-1,0] |
C10002 | 3202.46742 | 0.909091 | 0.00 | 0.00 | 0.00 | 6442.945 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000 | 4103.0326 | 1072.3402 | 0.222222 | 12 | (3e+03,5e+03] | (-1,0] | (5e+03,5e+04] | (1e+03,5e+03] | (-1,0] | (3e+03,5e+03] (-1,0] (5e+03,5e+04] |
C10003 | 2495.14886 | 1.000000 | 773.17 | 773.17 | 0.00 | 0.000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500 | 622.0667 | 627.2848 | 0.000000 | 12 | (2e+03,3e+03] | (500,1e+03] | (-1,0] | (100,1e+03] | (5,15] | (2e+03,3e+03] (500,1e+03] (-1,0] |
C10004 | 1666.67054 | 0.636364 | 1499.00 | 1499.00 | 0.00 | 205.788 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500 | 0.0000 | NA | 0.000000 | 12 | (1e+03,2e+03] | (1e+03,2e+03] | (0,500] | (-1,0] | (0,5] | (1e+03,2e+03] (1e+03,2e+03] (0,500] |
C10005 | 817.71434 | 1.000000 | 16.00 | 16.00 | 0.00 | 0.000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200 | 678.3348 | 244.7912 | 0.000000 | 12 | (500,1e+03] | (0,100] | (-1,0] | (100,1e+03] | (0,5] | (500,1e+03] (0,100] (-1,0] |
C10006 | 1809.82875 | 1.000000 | 1333.28 | 0.00 | 1333.28 | 0.000 | 0.666667 | 0.000000 | 0.583333 | 0.000000 | 0 | 8 | 1800 | 1400.0578 | 2407.2460 | 0.000000 | 12 | (1e+03,2e+03] | (1e+03,2e+03] | (-1,0] | (1e+03,5e+03] | (5,15] | (1e+03,2e+03] (1e+03,2e+03] (-1,0] |
C10007 | 627.26081 | 1.000000 | 7091.01 | 6402.63 | 688.38 | 0.000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0 | 64 | 13500 | 6354.3143 | 198.0659 | 1.000000 | 12 | (500,1e+03] | (5e+03,1e+04] | (-1,0] | (5e+03,1e+04] | (15,100] | (500,1e+03] (5e+03,1e+04] (-1,0] |
C10008 | 1823.65274 | 1.000000 | 436.20 | 0.00 | 436.20 | 0.000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0 | 12 | 2300 | 679.0651 | 532.0340 | 0.000000 | 12 | (1e+03,2e+03] | (100,500] | (-1,0] | (100,1e+03] | (5,15] | (1e+03,2e+03] (100,500] (-1,0] |
C10009 | 1014.92647 | 1.000000 | 861.49 | 661.49 | 200.00 | 0.000 | 0.333333 | 0.083333 | 0.250000 | 0.000000 | 0 | 5 | 7000 | 688.2786 | 311.9634 | 0.000000 | 12 | (1e+03,2e+03] | (500,1e+03] | (-1,0] | (100,1e+03] | (0,5] | (1e+03,2e+03] (500,1e+03] (-1,0] |
C10010 | 152.22598 | 0.545455 | 1281.60 | 1281.60 | 0.00 | 0.000 | 0.166667 | 0.166667 | 0.000000 | 0.000000 | 0 | 3 | 11000 | 1164.7706 | 100.3023 | 0.000000 | 12 | (100,500] | (1e+03,2e+03] | (-1,0] | (1e+03,5e+03] | (0,5] | (100,500] (1e+03,2e+03] (-1,0] |
Let's say a marketing campaign was run on the first 500 customers (target group) in the data-set. Now we need to find look-alike customers from the remaining part of the data-set. Choosing which variables will form the basis for lookalikes is a business domain question. Let's assume for this exercise, that we want to match the distribution of BALANCE, PURCHASES AND CASH_ADVANCE
.
As the first step, we are going to get the joint distribution of these variables for the target group. Continuous variables will be binned to convert them to categorical variables so that all variables are of categorical type. Thereafter, a joint categorical variable will be formed and frequency distribution against it will be used.
The following steps will be used to make joint distribution bins:
# Make bins for chosen variables
df <- df %>% mutate(
balance_bin = cut(BALANCE,c(-1,0,100,500,1000,2000,3000, 5000,7000,10000,20000)),
purchases_bin = cut(PURCHASES,c(-1,0,100,500,1000,2000,5000,10000,50000)),
cashadvance_bin = cut(CASH_ADVANCE,c(-1,0,500,1000,2000,5000,50000)))
# Make a combined bin
df <- df %>% mutate(combined_bin = paste(balance_bin,
purchases_bin,
cashadvance_bin))
# Combined distribution of target group:
target_profile <- df[1:500,] %>% group_by(combined_bin) %>%
summarise(customers=n()) %>%
mutate(bin_wt = customers/nrow(df[1:500,]))
combined_bin | customers | bin_wt |
---|---|---|
(-1,0] (-1,0] (0,500] | 1 | 0.002 |
(-1,0] (0,100] (-1,0] | 1 | 0.002 |
(0,100] (-1,0] (0,500] | 2 | 0.004 |
(0,100] (-1,0] (2e+03,5e+03] | 2 | 0.004 |
(0,100] (0,100] (-1,0] | 14 | 0.028 |
(0,100] (0,100] (1e+03,2e+03] | 1 | 0.002 |
(0,100] (100,500] (-1,0] | 30 | 0.060 |
(0,100] (100,500] (0,500] | 1 | 0.002 |
(0,100] (1e+03,2e+03] (-1,0] | 5 | 0.010 |
(0,100] (500,1e+03] (-1,0] | 14 | 0.028 |
(0,100] (500,1e+03] (0,500] | 1 | 0.002 |
(100,500] (-1,0] (1e+03,2e+03] | 1 | 0.002 |
(100,500] (-1,0] (2e+03,5e+03] | 1 | 0.002 |
(100,500] (-1,0] (500,1e+03] | 3 | 0.006 |
(100,500] (-1,0] (5e+03,5e+04] | 1 | 0.002 |
(100,500] (0,100] (-1,0] | 1 | 0.002 |
(100,500] (0,100] (0,500] | 1 | 0.002 |
(100,500] (100,500] (-1,0] | 7 | 0.014 |
(100,500] (100,500] (1e+03,2e+03] | 1 | 0.002 |
(100,500] (1e+03,2e+03] (-1,0] | 23 | 0.046 |
(100,500] (1e+03,2e+03] (0,500] | 2 | 0.004 |
(100,500] (1e+03,2e+03] (1e+03,2e+03] | 1 | 0.002 |
(100,500] (2e+03,5e+03] (-1,0] | 18 | 0.036 |
(100,500] (500,1e+03] (-1,0] | 7 | 0.014 |
(100,500] (500,1e+03] (0,500] | 1 | 0.002 |
(100,500] (5e+03,1e+04] (-1,0] | 3 | 0.006 |
(1e+03,2e+03] (-1,0] (0,500] | 16 | 0.032 |
(1e+03,2e+03] (-1,0] (1e+03,2e+03] | 3 | 0.006 |
(1e+03,2e+03] (-1,0] (2e+03,5e+03] | 3 | 0.006 |
(1e+03,2e+03] (-1,0] (500,1e+03] | 2 | 0.004 |
(1e+03,2e+03] (0,100] (-1,0] | 3 | 0.006 |
(1e+03,2e+03] (0,100] (0,500] | 2 | 0.004 |
(1e+03,2e+03] (100,500] (-1,0] | 7 | 0.014 |
(1e+03,2e+03] (100,500] (0,500] | 6 | 0.012 |
(1e+03,2e+03] (100,500] (1e+03,2e+03] | 2 | 0.004 |
(1e+03,2e+03] (100,500] (2e+03,5e+03] | 1 | 0.002 |
(1e+03,2e+03] (1e+03,2e+03] (-1,0] | 8 | 0.016 |
(1e+03,2e+03] (1e+03,2e+03] (0,500] | 2 | 0.004 |
(1e+03,2e+03] (1e+03,2e+03] (1e+03,2e+03] | 3 | 0.006 |
(1e+03,2e+03] (1e+03,2e+03] (2e+03,5e+03] | 2 | 0.004 |
(1e+03,2e+03] (1e+03,2e+03] (500,1e+03] | 1 | 0.002 |
(1e+03,2e+03] (2e+03,5e+03] (-1,0] | 6 | 0.012 |
(1e+03,2e+03] (2e+03,5e+03] (0,500] | 2 | 0.004 |
(1e+03,2e+03] (2e+03,5e+03] (1e+03,2e+03] | 2 | 0.004 |
(1e+03,2e+03] (2e+03,5e+03] (500,1e+03] | 1 | 0.002 |
(1e+03,2e+03] (2e+03,5e+03] (5e+03,5e+04] | 1 | 0.002 |
(1e+03,2e+03] (500,1e+03] (-1,0] | 13 | 0.026 |
(1e+03,2e+03] (500,1e+03] (0,500] | 1 | 0.002 |
(1e+03,2e+03] (500,1e+03] (1e+03,2e+03] | 2 | 0.004 |
(1e+03,2e+03] (500,1e+03] (2e+03,5e+03] | 1 | 0.002 |
(1e+03,2e+03] (500,1e+03] (5e+03,5e+04] | 1 | 0.002 |
(1e+03,2e+03] (5e+03,1e+04] (-1,0] | 2 | 0.004 |
(1e+03,2e+03] (5e+03,1e+04] (5e+03,5e+04] | 1 | 0.002 |
(1e+04,2e+04] (-1,0] (2e+03,5e+03] | 2 | 0.004 |
(1e+04,2e+04] (100,500] (2e+03,5e+03] | 1 | 0.002 |
(1e+04,2e+04] (100,500] (500,1e+03] | 1 | 0.002 |
(1e+04,2e+04] (100,500] (5e+03,5e+04] | 1 | 0.002 |
(1e+04,2e+04] (1e+04,5e+04] (-1,0] | 1 | 0.002 |
(1e+04,2e+04] (1e+04,5e+04] (2e+03,5e+03] | 1 | 0.002 |
(1e+04,2e+04] (2e+03,5e+03] (-1,0] | 1 | 0.002 |
(1e+04,2e+04] (2e+03,5e+03] (2e+03,5e+03] | 1 | 0.002 |
(1e+04,2e+04] (5e+03,1e+04] (-1,0] | 1 | 0.002 |
(1e+04,2e+04] (5e+03,1e+04] (1e+03,2e+03] | 1 | 0.002 |
(1e+04,2e+04] (5e+03,1e+04] (2e+03,5e+03] | 1 | 0.002 |
(2e+03,3e+03] (-1,0] (0,500] | 4 | 0.008 |
(2e+03,3e+03] (-1,0] (1e+03,2e+03] | 1 | 0.002 |
(2e+03,3e+03] (-1,0] (2e+03,5e+03] | 4 | 0.008 |
(2e+03,3e+03] (-1,0] (500,1e+03] | 4 | 0.008 |
(2e+03,3e+03] (0,100] (-1,0] | 1 | 0.002 |
(2e+03,3e+03] (0,100] (2e+03,5e+03] | 1 | 0.002 |
(2e+03,3e+03] (100,500] (-1,0] | 3 | 0.006 |
(2e+03,3e+03] (100,500] (0,500] | 1 | 0.002 |
(2e+03,3e+03] (100,500] (1e+03,2e+03] | 2 | 0.004 |
(2e+03,3e+03] (100,500] (5e+03,5e+04] | 1 | 0.002 |
(2e+03,3e+03] (1e+03,2e+03] (-1,0] | 3 | 0.006 |
(2e+03,3e+03] (1e+03,2e+03] (2e+03,5e+03] | 1 | 0.002 |
(2e+03,3e+03] (1e+04,5e+04] (-1,0] | 2 | 0.004 |
(2e+03,3e+03] (2e+03,5e+03] (-1,0] | 4 | 0.008 |
(2e+03,3e+03] (2e+03,5e+03] (0,500] | 3 | 0.006 |
(2e+03,3e+03] (2e+03,5e+03] (5e+03,5e+04] | 1 | 0.002 |
(2e+03,3e+03] (500,1e+03] (-1,0] | 5 | 0.010 |
(2e+03,3e+03] (500,1e+03] (1e+03,2e+03] | 1 | 0.002 |
(2e+03,3e+03] (500,1e+03] (500,1e+03] | 3 | 0.006 |
(2e+03,3e+03] (5e+03,1e+04] (-1,0] | 5 | 0.010 |
(2e+03,3e+03] (5e+03,1e+04] (0,500] | 1 | 0.002 |
(2e+03,3e+03] (5e+03,1e+04] (1e+03,2e+03] | 1 | 0.002 |
(3e+03,5e+03] (-1,0] (0,500] | 3 | 0.006 |
(3e+03,5e+03] (-1,0] (1e+03,2e+03] | 8 | 0.016 |
(3e+03,5e+03] (-1,0] (2e+03,5e+03] | 3 | 0.006 |
(3e+03,5e+03] (-1,0] (500,1e+03] | 1 | 0.002 |
(3e+03,5e+03] (-1,0] (5e+03,5e+04] | 4 | 0.008 |
(3e+03,5e+03] (0,100] (0,500] | 1 | 0.002 |
(3e+03,5e+03] (0,100] (2e+03,5e+03] | 1 | 0.002 |
(3e+03,5e+03] (100,500] (-1,0] | 6 | 0.012 |
(3e+03,5e+03] (100,500] (0,500] | 2 | 0.004 |
(3e+03,5e+03] (100,500] (1e+03,2e+03] | 1 | 0.002 |
(3e+03,5e+03] (100,500] (2e+03,5e+03] | 1 | 0.002 |
(3e+03,5e+03] (100,500] (500,1e+03] | 1 | 0.002 |
(3e+03,5e+03] (100,500] (5e+03,5e+04] | 1 | 0.002 |
(3e+03,5e+03] (1e+03,2e+03] (-1,0] | 8 | 0.016 |
(3e+03,5e+03] (1e+03,2e+03] (0,500] | 1 | 0.002 |
(3e+03,5e+03] (1e+03,2e+03] (1e+03,2e+03] | 1 | 0.002 |
(3e+03,5e+03] (1e+03,2e+03] (5e+03,5e+04] | 1 | 0.002 |
(3e+03,5e+03] (1e+04,5e+04] (-1,0] | 1 | 0.002 |
(3e+03,5e+03] (2e+03,5e+03] (-1,0] | 5 | 0.010 |
(3e+03,5e+03] (2e+03,5e+03] (0,500] | 1 | 0.002 |
(3e+03,5e+03] (2e+03,5e+03] (1e+03,2e+03] | 1 | 0.002 |
(3e+03,5e+03] (2e+03,5e+03] (2e+03,5e+03] | 1 | 0.002 |
(3e+03,5e+03] (2e+03,5e+03] (5e+03,5e+04] | 2 | 0.004 |
(3e+03,5e+03] (500,1e+03] (-1,0] | 3 | 0.006 |
(3e+03,5e+03] (500,1e+03] (1e+03,2e+03] | 1 | 0.002 |
(3e+03,5e+03] (500,1e+03] (2e+03,5e+03] | 2 | 0.004 |
(3e+03,5e+03] (500,1e+03] (500,1e+03] | 1 | 0.002 |
(3e+03,5e+03] (500,1e+03] (5e+03,5e+04] | 2 | 0.004 |
(500,1e+03] (-1,0] (0,500] | 4 | 0.008 |
(500,1e+03] (-1,0] (1e+03,2e+03] | 3 | 0.006 |
(500,1e+03] (-1,0] (2e+03,5e+03] | 4 | 0.008 |
(500,1e+03] (-1,0] (500,1e+03] | 2 | 0.004 |
(500,1e+03] (0,100] (-1,0] | 3 | 0.006 |
(500,1e+03] (0,100] (1e+03,2e+03] | 1 | 0.002 |
(500,1e+03] (0,100] (2e+03,5e+03] | 1 | 0.002 |
(500,1e+03] (100,500] (-1,0] | 8 | 0.016 |
(500,1e+03] (100,500] (1e+03,2e+03] | 2 | 0.004 |
(500,1e+03] (100,500] (500,1e+03] | 2 | 0.004 |
(500,1e+03] (1e+03,2e+03] (-1,0] | 4 | 0.008 |
(500,1e+03] (1e+03,2e+03] (0,500] | 1 | 0.002 |
(500,1e+03] (1e+03,2e+03] (1e+03,2e+03] | 2 | 0.004 |
(500,1e+03] (1e+03,2e+03] (2e+03,5e+03] | 1 | 0.002 |
(500,1e+03] (1e+04,5e+04] (1e+03,2e+03] | 1 | 0.002 |
(500,1e+03] (2e+03,5e+03] (-1,0] | 5 | 0.010 |
(500,1e+03] (2e+03,5e+03] (0,500] | 1 | 0.002 |
(500,1e+03] (500,1e+03] (-1,0] | 7 | 0.014 |
(500,1e+03] (500,1e+03] (0,500] | 2 | 0.004 |
(500,1e+03] (500,1e+03] (500,1e+03] | 2 | 0.004 |
(500,1e+03] (5e+03,1e+04] (-1,0] | 5 | 0.010 |
(500,1e+03] (5e+03,1e+04] (1e+03,2e+03] | 1 | 0.002 |
(5e+03,7e+03] (-1,0] (0,500] | 2 | 0.004 |
(5e+03,7e+03] (-1,0] (1e+03,2e+03] | 1 | 0.002 |
(5e+03,7e+03] (-1,0] (2e+03,5e+03] | 8 | 0.016 |
(5e+03,7e+03] (-1,0] (500,1e+03] | 2 | 0.004 |
(5e+03,7e+03] (0,100] (2e+03,5e+03] | 1 | 0.002 |
(5e+03,7e+03] (100,500] (1e+03,2e+03] | 1 | 0.002 |
(5e+03,7e+03] (100,500] (2e+03,5e+03] | 5 | 0.010 |
(5e+03,7e+03] (100,500] (500,1e+03] | 2 | 0.004 |
(5e+03,7e+03] (100,500] (5e+03,5e+04] | 4 | 0.008 |
(5e+03,7e+03] (1e+03,2e+03] (-1,0] | 5 | 0.010 |
(5e+03,7e+03] (1e+03,2e+03] (2e+03,5e+03] | 2 | 0.004 |
(5e+03,7e+03] (1e+04,5e+04] (-1,0] | 1 | 0.002 |
(5e+03,7e+03] (2e+03,5e+03] (-1,0] | 2 | 0.004 |
(5e+03,7e+03] (2e+03,5e+03] (1e+03,2e+03] | 1 | 0.002 |
(5e+03,7e+03] (2e+03,5e+03] (2e+03,5e+03] | 3 | 0.006 |
(5e+03,7e+03] (2e+03,5e+03] (5e+03,5e+04] | 1 | 0.002 |
(5e+03,7e+03] (500,1e+03] (-1,0] | 1 | 0.002 |
(5e+03,7e+03] (5e+03,1e+04] (-1,0] | 2 | 0.004 |
(5e+03,7e+03] (5e+03,1e+04] (0,500] | 1 | 0.002 |
(7e+03,1e+04] (-1,0] (2e+03,5e+03] | 2 | 0.004 |
(7e+03,1e+04] (-1,0] (5e+03,5e+04] | 4 | 0.008 |
(7e+03,1e+04] (100,500] (-1,0] | 1 | 0.002 |
(7e+03,1e+04] (100,500] (2e+03,5e+03] | 3 | 0.006 |
(7e+03,1e+04] (100,500] (5e+03,5e+04] | 2 | 0.004 |
(7e+03,1e+04] (1e+03,2e+03] (2e+03,5e+03] | 1 | 0.002 |
(7e+03,1e+04] (1e+03,2e+03] (500,1e+03] | 1 | 0.002 |
(7e+03,1e+04] (1e+03,2e+03] (5e+03,5e+04] | 1 | 0.002 |
(7e+03,1e+04] (1e+04,5e+04] (-1,0] | 1 | 0.002 |
(7e+03,1e+04] (2e+03,5e+03] (-1,0] | 3 | 0.006 |
(7e+03,1e+04] (2e+03,5e+03] (500,1e+03] | 1 | 0.002 |
(7e+03,1e+04] (2e+03,5e+03] (5e+03,5e+04] | 2 | 0.004 |
(7e+03,1e+04] (500,1e+03] (2e+03,5e+03] | 1 | 0.002 |
(7e+03,1e+04] (500,1e+03] (5e+03,5e+04] | 2 | 0.004 |
(7e+03,1e+04] (5e+03,1e+04] (0,500] | 1 | 0.002 |
(7e+03,1e+04] (5e+03,1e+04] (1e+03,2e+03] | 1 | 0.002 |
(7e+03,1e+04] (5e+03,1e+04] (5e+03,5e+04] | 1 | 0.002 |
The table above shows the final distribution across the three variables that we will replicate for the control group. The following piece of code samples out customers from the control group in the same frequency distribution as above:
#### Generate lookalikes
control_n = 500 # desired number of customers in the control group
j = 0 # counter
df_control = data.frame() # dataframe where control group will be formed
for (i in target_profile$combined_bin){
j=j+1
print(j)
df_control_bin <- df[(control_n+1):nrow(df),] %>% filter(combined_bin==i)
df_control_bin_sample <- sample(1:nrow(df_control_bin),
size= as.integer(target_profile %>%
filter(combined_bin==i) %>%
select(bin_wt)*control_n), replace = TRUE)
df_control_bin_sample <- df_control_bin[df_control_bin_sample,] # pick actual rows from sample index values
df_control <- rbind(df_control,
df_control_bin_sample) # keep adding the main control group
#data frame with each iteration
}
A point to note here is that we are sampling with replacement as there are some bins where the number of customers required for the control group are not sufficient to match the bin weight of the target group. Therefore, for some bins, customers are counted more than once (there are 478
distinct customers forming the 500 rows control group). For larger datasets, this problem would not arise and sampling without replacement should be used.
Finally, we can match the distribution across the three variables for both groups and verify that they match: