patchwork – Pinch of Data

Using stratified sampling technique

Complete code available here: https://github.com/bassalat/lookalikes

Lookalike modelling is searching for a set of observations (i.e. customers in the marketing domain) that have a similar profile to a target group. In essence, we are trying the mimic the features of the target control to form a control group. This is useful for a variety of applications that include A/B tests, post-campaign analyses and finding a specific set of customers.

To illustrate this method, we’ll use a Credit card customer profile data-set available on Kaggle and use R to explore it:

https://www.kaggle.com/arjunbhasin2013/ccdata

CUST_ID	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE	balance_bin	purchases_bin	cashadvance_bin	payments_bin	purchasestrx_bin	combined_bin
C10001	40.90075	0.818182	95.40	0.00	95.40	0.000	0.166667	0.000000	0.083333	0.000000	0	2	1000	201.8021	139.5098	0.000000	12	(0,100]	(0,100]	(-1,0]	(100,1e+03]	(0,5]	(0,100] (0,100] (-1,0]
C10002	3202.46742	0.909091	0.00	0.00	0.00	6442.945	0.000000	0.000000	0.000000	0.250000	4	0	7000	4103.0326	1072.3402	0.222222	12	(3e+03,5e+03]	(-1,0]	(5e+03,5e+04]	(1e+03,5e+03]	(-1,0]	(3e+03,5e+03] (-1,0] (5e+03,5e+04]
C10003	2495.14886	1.000000	773.17	773.17	0.00	0.000	1.000000	1.000000	0.000000	0.000000	0	12	7500	622.0667	627.2848	0.000000	12	(2e+03,3e+03]	(500,1e+03]	(-1,0]	(100,1e+03]	(5,15]	(2e+03,3e+03] (500,1e+03] (-1,0]
C10004	1666.67054	0.636364	1499.00	1499.00	0.00	205.788	0.083333	0.083333	0.000000	0.083333	1	1	7500	0.0000	NA	0.000000	12	(1e+03,2e+03]	(1e+03,2e+03]	(0,500]	(-1,0]	(0,5]	(1e+03,2e+03] (1e+03,2e+03] (0,500]
C10005	817.71434	1.000000	16.00	16.00	0.00	0.000	0.083333	0.083333	0.000000	0.000000	0	1	1200	678.3348	244.7912	0.000000	12	(500,1e+03]	(0,100]	(-1,0]	(100,1e+03]	(0,5]	(500,1e+03] (0,100] (-1,0]
C10006	1809.82875	1.000000	1333.28	0.00	1333.28	0.000	0.666667	0.000000	0.583333	0.000000	0	8	1800	1400.0578	2407.2460	0.000000	12	(1e+03,2e+03]	(1e+03,2e+03]	(-1,0]	(1e+03,5e+03]	(5,15]	(1e+03,2e+03] (1e+03,2e+03] (-1,0]
C10007	627.26081	1.000000	7091.01	6402.63	688.38	0.000	1.000000	1.000000	1.000000	0.000000	0	64	13500	6354.3143	198.0659	1.000000	12	(500,1e+03]	(5e+03,1e+04]	(-1,0]	(5e+03,1e+04]	(15,100]	(500,1e+03] (5e+03,1e+04] (-1,0]
C10008	1823.65274	1.000000	436.20	0.00	436.20	0.000	1.000000	0.000000	1.000000	0.000000	0	12	2300	679.0651	532.0340	0.000000	12	(1e+03,2e+03]	(100,500]	(-1,0]	(100,1e+03]	(5,15]	(1e+03,2e+03] (100,500] (-1,0]
C10009	1014.92647	1.000000	861.49	661.49	200.00	0.000	0.333333	0.083333	0.250000	0.000000	0	5	7000	688.2786	311.9634	0.000000	12	(1e+03,2e+03]	(500,1e+03]	(-1,0]	(100,1e+03]	(0,5]	(1e+03,2e+03] (500,1e+03] (-1,0]
C10010	152.22598	0.545455	1281.60	1281.60	0.00	0.000	0.166667	0.166667	0.000000	0.000000	0	3	11000	1164.7706	100.3023	0.000000	12	(100,500]	(1e+03,2e+03]	(-1,0]	(1e+03,5e+03]	(0,5]	(100,500] (1e+03,2e+03] (-1,0]

Let's say a marketing campaign was run on the first 500 customers (target group) in the data-set. Now we need to find look-alike customers from the remaining part of the data-set. Choosing which variables will form the basis for lookalikes is a business domain question. Let's assume for this exercise, that we want to match the distribution of BALANCE, PURCHASES AND CASH_ADVANCE.

As the first step, we are going to get the joint distribution of these variables for the target group. Continuous variables will be binned to convert them to categorical variables so that all variables are of categorical type. Thereafter, a joint categorical variable will be formed and frequency distribution against it will be used.

The following steps will be used to make joint distribution bins:

# Make bins for chosen variables
df <- df %>% mutate(
balance_bin = cut(BALANCE,c(-1,0,100,500,1000,2000,3000,            5000,7000,10000,20000)),
purchases_bin = cut(PURCHASES,c(-1,0,100,500,1000,2000,5000,10000,50000)),
cashadvance_bin = cut(CASH_ADVANCE,c(-1,0,500,1000,2000,5000,50000)))

# Make a combined bin
df <- df %>% mutate(combined_bin = paste(balance_bin,
                                         purchases_bin,
                                         cashadvance_bin))

# Combined distribution of target group:
target_profile <- df[1:500,] %>% group_by(combined_bin) %>%
  summarise(customers=n()) %>%
  mutate(bin_wt = customers/nrow(df[1:500,]))

combined_bin	customers	bin_wt
(-1,0] (-1,0] (0,500]	1	0.002
(-1,0] (0,100] (-1,0]	1	0.002
(0,100] (-1,0] (0,500]	2	0.004
(0,100] (-1,0] (2e+03,5e+03]	2	0.004
(0,100] (0,100] (-1,0]	14	0.028
(0,100] (0,100] (1e+03,2e+03]	1	0.002
(0,100] (100,500] (-1,0]	30	0.060
(0,100] (100,500] (0,500]	1	0.002
(0,100] (1e+03,2e+03] (-1,0]	5	0.010
(0,100] (500,1e+03] (-1,0]	14	0.028
(0,100] (500,1e+03] (0,500]	1	0.002
(100,500] (-1,0] (1e+03,2e+03]	1	0.002
(100,500] (-1,0] (2e+03,5e+03]	1	0.002
(100,500] (-1,0] (500,1e+03]	3	0.006
(100,500] (-1,0] (5e+03,5e+04]	1	0.002
(100,500] (0,100] (-1,0]	1	0.002
(100,500] (0,100] (0,500]	1	0.002
(100,500] (100,500] (-1,0]	7	0.014
(100,500] (100,500] (1e+03,2e+03]	1	0.002
(100,500] (1e+03,2e+03] (-1,0]	23	0.046
(100,500] (1e+03,2e+03] (0,500]	2	0.004
(100,500] (1e+03,2e+03] (1e+03,2e+03]	1	0.002
(100,500] (2e+03,5e+03] (-1,0]	18	0.036
(100,500] (500,1e+03] (-1,0]	7	0.014
(100,500] (500,1e+03] (0,500]	1	0.002
(100,500] (5e+03,1e+04] (-1,0]	3	0.006
(1e+03,2e+03] (-1,0] (0,500]	16	0.032
(1e+03,2e+03] (-1,0] (1e+03,2e+03]	3	0.006
(1e+03,2e+03] (-1,0] (2e+03,5e+03]	3	0.006
(1e+03,2e+03] (-1,0] (500,1e+03]	2	0.004
(1e+03,2e+03] (0,100] (-1,0]	3	0.006
(1e+03,2e+03] (0,100] (0,500]	2	0.004
(1e+03,2e+03] (100,500] (-1,0]	7	0.014
(1e+03,2e+03] (100,500] (0,500]	6	0.012
(1e+03,2e+03] (100,500] (1e+03,2e+03]	2	0.004
(1e+03,2e+03] (100,500] (2e+03,5e+03]	1	0.002
(1e+03,2e+03] (1e+03,2e+03] (-1,0]	8	0.016
(1e+03,2e+03] (1e+03,2e+03] (0,500]	2	0.004
(1e+03,2e+03] (1e+03,2e+03] (1e+03,2e+03]	3	0.006
(1e+03,2e+03] (1e+03,2e+03] (2e+03,5e+03]	2	0.004
(1e+03,2e+03] (1e+03,2e+03] (500,1e+03]	1	0.002
(1e+03,2e+03] (2e+03,5e+03] (-1,0]	6	0.012
(1e+03,2e+03] (2e+03,5e+03] (0,500]	2	0.004
(1e+03,2e+03] (2e+03,5e+03] (1e+03,2e+03]	2	0.004
(1e+03,2e+03] (2e+03,5e+03] (500,1e+03]	1	0.002
(1e+03,2e+03] (2e+03,5e+03] (5e+03,5e+04]	1	0.002
(1e+03,2e+03] (500,1e+03] (-1,0]	13	0.026
(1e+03,2e+03] (500,1e+03] (0,500]	1	0.002
(1e+03,2e+03] (500,1e+03] (1e+03,2e+03]	2	0.004
(1e+03,2e+03] (500,1e+03] (2e+03,5e+03]	1	0.002
(1e+03,2e+03] (500,1e+03] (5e+03,5e+04]	1	0.002
(1e+03,2e+03] (5e+03,1e+04] (-1,0]	2	0.004
(1e+03,2e+03] (5e+03,1e+04] (5e+03,5e+04]	1	0.002
(1e+04,2e+04] (-1,0] (2e+03,5e+03]	2	0.004
(1e+04,2e+04] (100,500] (2e+03,5e+03]	1	0.002
(1e+04,2e+04] (100,500] (500,1e+03]	1	0.002
(1e+04,2e+04] (100,500] (5e+03,5e+04]	1	0.002
(1e+04,2e+04] (1e+04,5e+04] (-1,0]	1	0.002
(1e+04,2e+04] (1e+04,5e+04] (2e+03,5e+03]	1	0.002
(1e+04,2e+04] (2e+03,5e+03] (-1,0]	1	0.002
(1e+04,2e+04] (2e+03,5e+03] (2e+03,5e+03]	1	0.002
(1e+04,2e+04] (5e+03,1e+04] (-1,0]	1	0.002
(1e+04,2e+04] (5e+03,1e+04] (1e+03,2e+03]	1	0.002
(1e+04,2e+04] (5e+03,1e+04] (2e+03,5e+03]	1	0.002
(2e+03,3e+03] (-1,0] (0,500]	4	0.008
(2e+03,3e+03] (-1,0] (1e+03,2e+03]	1	0.002
(2e+03,3e+03] (-1,0] (2e+03,5e+03]	4	0.008
(2e+03,3e+03] (-1,0] (500,1e+03]	4	0.008
(2e+03,3e+03] (0,100] (-1,0]	1	0.002
(2e+03,3e+03] (0,100] (2e+03,5e+03]	1	0.002
(2e+03,3e+03] (100,500] (-1,0]	3	0.006
(2e+03,3e+03] (100,500] (0,500]	1	0.002
(2e+03,3e+03] (100,500] (1e+03,2e+03]	2	0.004
(2e+03,3e+03] (100,500] (5e+03,5e+04]	1	0.002
(2e+03,3e+03] (1e+03,2e+03] (-1,0]	3	0.006
(2e+03,3e+03] (1e+03,2e+03] (2e+03,5e+03]	1	0.002
(2e+03,3e+03] (1e+04,5e+04] (-1,0]	2	0.004
(2e+03,3e+03] (2e+03,5e+03] (-1,0]	4	0.008
(2e+03,3e+03] (2e+03,5e+03] (0,500]	3	0.006
(2e+03,3e+03] (2e+03,5e+03] (5e+03,5e+04]	1	0.002
(2e+03,3e+03] (500,1e+03] (-1,0]	5	0.010
(2e+03,3e+03] (500,1e+03] (1e+03,2e+03]	1	0.002
(2e+03,3e+03] (500,1e+03] (500,1e+03]	3	0.006
(2e+03,3e+03] (5e+03,1e+04] (-1,0]	5	0.010
(2e+03,3e+03] (5e+03,1e+04] (0,500]	1	0.002
(2e+03,3e+03] (5e+03,1e+04] (1e+03,2e+03]	1	0.002
(3e+03,5e+03] (-1,0] (0,500]	3	0.006
(3e+03,5e+03] (-1,0] (1e+03,2e+03]	8	0.016
(3e+03,5e+03] (-1,0] (2e+03,5e+03]	3	0.006
(3e+03,5e+03] (-1,0] (500,1e+03]	1	0.002
(3e+03,5e+03] (-1,0] (5e+03,5e+04]	4	0.008
(3e+03,5e+03] (0,100] (0,500]	1	0.002
(3e+03,5e+03] (0,100] (2e+03,5e+03]	1	0.002
(3e+03,5e+03] (100,500] (-1,0]	6	0.012
(3e+03,5e+03] (100,500] (0,500]	2	0.004
(3e+03,5e+03] (100,500] (1e+03,2e+03]	1	0.002
(3e+03,5e+03] (100,500] (2e+03,5e+03]	1	0.002
(3e+03,5e+03] (100,500] (500,1e+03]	1	0.002
(3e+03,5e+03] (100,500] (5e+03,5e+04]	1	0.002
(3e+03,5e+03] (1e+03,2e+03] (-1,0]	8	0.016
(3e+03,5e+03] (1e+03,2e+03] (0,500]	1	0.002
(3e+03,5e+03] (1e+03,2e+03] (1e+03,2e+03]	1	0.002
(3e+03,5e+03] (1e+03,2e+03] (5e+03,5e+04]	1	0.002
(3e+03,5e+03] (1e+04,5e+04] (-1,0]	1	0.002
(3e+03,5e+03] (2e+03,5e+03] (-1,0]	5	0.010
(3e+03,5e+03] (2e+03,5e+03] (0,500]	1	0.002
(3e+03,5e+03] (2e+03,5e+03] (1e+03,2e+03]	1	0.002
(3e+03,5e+03] (2e+03,5e+03] (2e+03,5e+03]	1	0.002
(3e+03,5e+03] (2e+03,5e+03] (5e+03,5e+04]	2	0.004
(3e+03,5e+03] (500,1e+03] (-1,0]	3	0.006
(3e+03,5e+03] (500,1e+03] (1e+03,2e+03]	1	0.002
(3e+03,5e+03] (500,1e+03] (2e+03,5e+03]	2	0.004
(3e+03,5e+03] (500,1e+03] (500,1e+03]	1	0.002
(3e+03,5e+03] (500,1e+03] (5e+03,5e+04]	2	0.004
(500,1e+03] (-1,0] (0,500]	4	0.008
(500,1e+03] (-1,0] (1e+03,2e+03]	3	0.006
(500,1e+03] (-1,0] (2e+03,5e+03]	4	0.008
(500,1e+03] (-1,0] (500,1e+03]	2	0.004
(500,1e+03] (0,100] (-1,0]	3	0.006
(500,1e+03] (0,100] (1e+03,2e+03]	1	0.002
(500,1e+03] (0,100] (2e+03,5e+03]	1	0.002
(500,1e+03] (100,500] (-1,0]	8	0.016
(500,1e+03] (100,500] (1e+03,2e+03]	2	0.004
(500,1e+03] (100,500] (500,1e+03]	2	0.004
(500,1e+03] (1e+03,2e+03] (-1,0]	4	0.008
(500,1e+03] (1e+03,2e+03] (0,500]	1	0.002
(500,1e+03] (1e+03,2e+03] (1e+03,2e+03]	2	0.004
(500,1e+03] (1e+03,2e+03] (2e+03,5e+03]	1	0.002
(500,1e+03] (1e+04,5e+04] (1e+03,2e+03]	1	0.002
(500,1e+03] (2e+03,5e+03] (-1,0]	5	0.010
(500,1e+03] (2e+03,5e+03] (0,500]	1	0.002
(500,1e+03] (500,1e+03] (-1,0]	7	0.014
(500,1e+03] (500,1e+03] (0,500]	2	0.004
(500,1e+03] (500,1e+03] (500,1e+03]	2	0.004
(500,1e+03] (5e+03,1e+04] (-1,0]	5	0.010
(500,1e+03] (5e+03,1e+04] (1e+03,2e+03]	1	0.002
(5e+03,7e+03] (-1,0] (0,500]	2	0.004
(5e+03,7e+03] (-1,0] (1e+03,2e+03]	1	0.002
(5e+03,7e+03] (-1,0] (2e+03,5e+03]	8	0.016
(5e+03,7e+03] (-1,0] (500,1e+03]	2	0.004
(5e+03,7e+03] (0,100] (2e+03,5e+03]	1	0.002
(5e+03,7e+03] (100,500] (1e+03,2e+03]	1	0.002
(5e+03,7e+03] (100,500] (2e+03,5e+03]	5	0.010
(5e+03,7e+03] (100,500] (500,1e+03]	2	0.004
(5e+03,7e+03] (100,500] (5e+03,5e+04]	4	0.008
(5e+03,7e+03] (1e+03,2e+03] (-1,0]	5	0.010
(5e+03,7e+03] (1e+03,2e+03] (2e+03,5e+03]	2	0.004
(5e+03,7e+03] (1e+04,5e+04] (-1,0]	1	0.002
(5e+03,7e+03] (2e+03,5e+03] (-1,0]	2	0.004
(5e+03,7e+03] (2e+03,5e+03] (1e+03,2e+03]	1	0.002
(5e+03,7e+03] (2e+03,5e+03] (2e+03,5e+03]	3	0.006
(5e+03,7e+03] (2e+03,5e+03] (5e+03,5e+04]	1	0.002
(5e+03,7e+03] (500,1e+03] (-1,0]	1	0.002
(5e+03,7e+03] (5e+03,1e+04] (-1,0]	2	0.004
(5e+03,7e+03] (5e+03,1e+04] (0,500]	1	0.002
(7e+03,1e+04] (-1,0] (2e+03,5e+03]	2	0.004
(7e+03,1e+04] (-1,0] (5e+03,5e+04]	4	0.008
(7e+03,1e+04] (100,500] (-1,0]	1	0.002
(7e+03,1e+04] (100,500] (2e+03,5e+03]	3	0.006
(7e+03,1e+04] (100,500] (5e+03,5e+04]	2	0.004
(7e+03,1e+04] (1e+03,2e+03] (2e+03,5e+03]	1	0.002
(7e+03,1e+04] (1e+03,2e+03] (500,1e+03]	1	0.002
(7e+03,1e+04] (1e+03,2e+03] (5e+03,5e+04]	1	0.002
(7e+03,1e+04] (1e+04,5e+04] (-1,0]	1	0.002
(7e+03,1e+04] (2e+03,5e+03] (-1,0]	3	0.006
(7e+03,1e+04] (2e+03,5e+03] (500,1e+03]	1	0.002
(7e+03,1e+04] (2e+03,5e+03] (5e+03,5e+04]	2	0.004
(7e+03,1e+04] (500,1e+03] (2e+03,5e+03]	1	0.002
(7e+03,1e+04] (500,1e+03] (5e+03,5e+04]	2	0.004
(7e+03,1e+04] (5e+03,1e+04] (0,500]	1	0.002
(7e+03,1e+04] (5e+03,1e+04] (1e+03,2e+03]	1	0.002
(7e+03,1e+04] (5e+03,1e+04] (5e+03,5e+04]	1	0.002

The table above shows the final distribution across the three variables that we will replicate for the control group. The following piece of code samples out customers from the control group in the same frequency distribution as above:

#### Generate lookalikes

control_n = 500 # desired number of customers in the control group
j = 0 # counter
df_control = data.frame() # dataframe where control group will be formed
for (i in target_profile$combined_bin){
  j=j+1 
  print(j)
  df_control_bin <- df[(control_n+1):nrow(df),] %>% filter(combined_bin==i)
  df_control_bin_sample <- sample(1:nrow(df_control_bin),
                            size= as.integer(target_profile %>%
                                               filter(combined_bin==i) %>%
                                               select(bin_wt)*control_n), replace = TRUE)
  df_control_bin_sample <- df_control_bin[df_control_bin_sample,] # pick actual rows from sample index values
  df_control <- rbind(df_control,
                      df_control_bin_sample) # keep adding the main control group
                                              #data frame with each iteration
}

A point to note here is that we are sampling with replacement as there are some bins where the number of customers required for the control group are not sufficient to match the bin weight of the target group. Therefore, for some bins, customers are counted more than once (there are 478 distinct customers forming the 500 rows control group). For larger datasets, this problem would not arise and sampling without replacement should be used.

Finally, we can match the distribution across the three variables for both groups and verify that they match: