Lookalike Model

Introduction

In this project, we explore building a minimal “lookalike” model. The aim of such a model is to find new positive samples among a population of samples of unknown status. A concrete example would be to find new potential customers by targeting a select population among a vast number of possible targets.

One of the characteristics of this type of task is that the available data does not split straightforwardly in positive and negative samples as it would be for a simple binary classification task. Instead, the dataset consists of positives samples, in the example above the profiles of the current customers, and samples with unknown labels from which we wish to find other positive samples, such as profiles that would become customer if targeted, say, by our marketing campaign.

A way to approach this problem consists in looking for similarities between elements in the test sample, i.e. of unknown labels, and the known positives samples. It is hoped that the similarities capture enough of information that this would allow for a targeted intervention instead of an indiscriminate intervention.

Evaluating such a model at training time is difficult as the real performance of the model is available only after intervention.

We will demonstrate how we leveraged nearest neighbour models in order to compute similarities.

The data

The cleaned data consisted of a mixture of categorical columns and numerical columns. We were able to use 1-hot encoders to convert the categorical columns into numerical values that can be used in a k-NN model.

We also used t-SNE visualisations. This allowed us to check the distribution of the data and of the predictions of the models.

Model evaluation

In order to evaluate the trained models, we ordered the samples by decreasing predicted probabilities of being a positive sample. When the frequency of positive samples is small, a good model should rank known positive samples among the top.

Such a model can thus be evaluated using a discrete average precision score based on the relative rankings of the positive samples: $$ \frac{1}{N} \sum_{n=1}^{N} \frac{\# \left\{ \mbox{positive samples ranked} \leq n \right\}}{n}. $$

We also introduce a minimal baseline model which consists of picking the ranking of samples randomly. Any model would have to have a better score than such a random ranking.

Models

The models that we trained are based on scikit-learn k-NN classifier. They have been fine-tuned using Optuna, optimising over: the number of neighbours, the metric and the weighting on the neighbours.

The first model we produced is the k-NN model trained on the encoded features. We wanted to improve its performance. For that we tried different strategies, each of which involved the binning of numerical features:

  • Using weight of evidence features instead of the original features;
  • Combining the original features with the weight of evidence values.
We also trained a model using only the binned features in order to assess the extent of improvements obtained from using weight of evidence.

Weight of evidence and information value

The weight of evidence, WoE, is a real number computed from the differences between the distributions of a categorical variable between the positive and negative sample sets. It can also be computed for numerical variables after applying some binning. The value of the weight of evidence is computed for each value $ c $ of a categorical variable with the following formula: $$ \operatorname{WoE}(c) = \log \frac{p_c}{q_c}, $$ where $ p_c $ and $ q_c $ are the respective frequencies of category/bin $ c $ among the positive/negative samples.

The information value of a variable is computed from the weight of evidence $$ \operatorname{IV} = \sum_{c} (p_c – q_c) \operatorname{WoE}(c), $$ where the sum is over all the possible values of the variable. This is a positive number that measures how different the distributions of positive and negative samples are for that variable. If this is zero, the variable is uninformative and can be discarded. Computing the information value of each feature is a simple way of selecting important features.

One can use the WoE value as follows: instead of using the original features of the dataset, replace them entirely by the corresponding WoE values. Now the dataset consists entirely of numerical features. Replacing the features by their respective WoE is a way to inform the model on how much each specific feature should influence the probability of the outcome to predict. Doing so, the similarity between samples will put more weight on features with more information relative to the classification task at hand. An example of this approach can be found in this project, that we used as an inspiration.

In our context, we do not have negative samples and instead use the unknown/test samples as a replacement for the negative class.

Replacing the features by their respective values of weight of evidence did improve the performance of the models. But because it was based on binning the numerical features, we wanted to check how much the binning would have an impact on its own. As it is visible on the score plot below, binning degraded the score of the model compared to the original encoding, whilst the use of weight of evidence features improved its score.

A model that combined original encoded features with weight of evidence did worse than the weight of evidence only model. It still outperformed the baseline k-NN model.

Predictions

We plotted the distributions of different sample sets using a two-dimensional t-SNE projection. Our aim was to assess the validity of the model predictions by comparing the distributions of potentially negative and positive samples.

In this figure, we can see the distributions of samples in the test set, the positive set and the samples that each model classified as positive that were not in the original positive sample set. The models under consideration here are the k-NN model with no feature engineering and the WoE model.

We can see on these plots that the distributions of the predictions of both models are different from the distribution of the test samples. While the model with no feature engineering shows a distribution similar to the distribution of the positive sample, the resemblance is better for the WoE model, where we can see that it covers a wider part of the support of the positive sample distribution and seems to capture its peaks.

Conclusion

We have used different strategies to create a basic lookalike model using tools such as the weight of evidence and visualisations. This is only the beginning. The set of selected features here was relatively small and could be extended and refined using the information value. Explainability could also be added to the models, which could help create a higher understanding of the set of positive samples.