How to convince AI to help us boost our profit
Recently, we have been challenged to build an ML model that will boost company revenue by increasing click-through and conversion rates. The company has a web page with properties that one can book. The current approaches they tried, even though very reasonable, weren’t able to produce the expected result. They needed a fresh look into the problem, and this is how they found us.
Picking the best image
We started with a click-through rate problem. We needed to pick the best image of a property to present as a miniature. We needed to pick from two images. Let’s look at those examples:
It is hard to tell by only looking at images which one would be more attractive to potential clients. This is precisely the challenge the AI should help us with!
Apart from images, we had additional information about the properties, such as price, region, type of property, max pets, Wi-Fi and parking availability, etc. In total, we had 49 columns.
To summarize the problem: We have two images and additional data about properties, and we need to decide which one will have higher click through rate. We are given historical CTRs for images from A/B testing. The graphic below presents the highest level view of the process:
Data preparation
We have started with the analysis of additional data for properties to pick up the best features. We have picked 20 features (based on their correlation with the target metric) and did the following preprocessing:
- standardization for numerical features
- one hot encoding for categorical features
- mean imputation for numerical features
- most frequent imputation for categorical data
Regression model
In the first approach, we built a model that takes an image and additional data to predict the click-through rate (CTR). The data came from SnapTrip, which conducted A/B tests—showing different webpage versions to different user groups to measure which one performs better. Their previous models were likely ineffective, leading to inaccurate results.
For the images, we used a pre-trained EfficientNetB4 network and fine-tuned it. The categorical data was processed through a five-layer fully connected network. The outputs were concatenated at the end. We used mean squared error as the loss function and performed hyper-parameter tuning to optimize the model architecture. All images were processed through an augmentation layer with standard operations. Here is a code snippet:
def augument(image):
image = tf.image.random_flip_left_right(image)
image = tf.image.random_hue(image, 0.08)
image = tf.image.random_saturation(image, 0.6, 1.6)
image = tf.image.random_brightness(image, 0.05)
image = tf.image.random_contrast(image, 0.7, 1.3)
return image
Using this approach, we got the following result: The accuracy is calculated as a proportion of correctly selected images. The correct image is an image with a higher CTR. We run images through the model and pick the one with the highest accuracy.
Model name | Accuracy |
Regression model | 0.9208 |
The visualization of this approach looks like this:
Classification model
In the second approach, we used both images as input (together with additional data) as we expected that seeing both images could help the model make better decisions. The loss function wasn’t mean squared error, but it was standard cross entropy as we treated this as a classification problem where the model needed to decide whether the image first or second should be selected. We had two neurons as an output, and when the first neuron was activated them, the first image should be picked and vice versa. We performed the same image augmentation as in the previous approach, and additionally, we swapped image positions in every case. So, the same images were presented to the model twice in swapped order.
We created a dedicated Keras layer for the images that looks like this:
class EfficientNetV2B1_TF(layers.Layer):
def __init__(self, **kwargs):
super().__init__(**kwargs)
image_size = 260
self.base_model = keras.applications.EfficientNetV2B2(include_top=False,
weights='imagenet',
input_shape=(image_size, image_size, 3)
)
self.base_model.trainable = False
self.global_pooling = tf.keras.layers.GlobalAveragePooling2D()
self.embedding_model = keras.Sequential(
[
layers.Input(shape=(2816)),
tf.keras.layers.Dropout(0.25),
layers.Dense(2),
]
)
def set_fine_tuning(self, fine_tune_at=3):
self.base_model.trainable = True
for layer in self.base_model.layers[:-fine_tune_at]:
layer.trainable = False
def call(self, image_1, image_2):
image_1_base = self.base_model(image_1)
image_2_base = self.base_model(image_2)
embb_1 = self.global_pooling(image_1_base)
embb_2 = self.global_pooling(image_2_base)
images = tf.concat([embb_1, embb_2], axis=1)
output = self.embedding_model(images)
return output
Using this approach, we got the following result. It is better, than the one from regression.
Model name | Accuracy |
Regression model | 0.9411 |
The visualization of this approach looks like this:
Having developed those models, we decided to conduct an A/B test for the classification model. After deploying it to production and waiting a week for results, we observed that the CTR did not increase as much as we expected. Detailed investigation revealed that many images with higher CTRs in our training data now had lower CTRs. The issue was that we received an A/B test dataset from a specific time period, where certain images had higher CTRs and served as our reference. We trained the model to discover why some images were better than others, and it performed well on that dataset.
After conducting the A/B test again, we found that users often preferred different images than before, rendering our model ineffective. We are not entirely sure why user preferences changed suddenly, but it is likely due to seasonal variations that affected the A/B test data outcomes. Therefore, we recommend clients consider the time sensitivity and changing tastes of their potential customers.
To address this, one of the models we provided did not choose between two images but instead returned a specific CTR for each image. In theory, this model could process all images for a given property and select the one with the highest CTR. We decided to collect data over longer periods and try again in the future.
Not only A/B tests but also customer data would likely help, similar to how other sites operate by checking what users viewed or booked previously and basing recommendations on that.
Note: Access to comprehensive data is crucial for effective modeling and accurate predictions.
Sorting search results
The second problem we wanted to solve using AI was search result sorting. After the user specified what properties they were looking for, for what dates, and for how many people, we presented our offer. The problem is which properties we should show first, as usually users only open the first few so that there are maximum chances for a successful booking.
Data preparation
Here, the data was similar to the previous example. The images and property information were the same, with the new addition of textual data in the form of property descriptions. We investigated the tabular data with property information and selected 15 features that best correlated with the target value. These features were chosen based on statistical models that identify correlations between features. The preprocessing for this data was the same as in the previous problem.
We’ve pre-process description data using nltk. We carried out the following operations:
- lower casing
- html tag removal
- urls removal
- word tokenization
- punctuation removal
- stop words removal
- lemmatization
The target value was the historical conversion rate for each property.
Creating models
xgboost
The first model we created was based on XGBoost. It received input consisting solely of tabular data from SnapTrip’s historical A/B tests. We performed hyper-parameter tuning to optimize the following parameters (the numbers represent the best parameters found):
- colsample_bytree=0.7
- learning_rate=0.01
- max_depth=3
- n_estimators=1000
Here is the result we got with this approach:
Model name | Mean Squared Error |
Neural network conversion rate | 41 |
Neural network
We started creating a second model, the neural network model, by combining the tabular and textual data. We decided not to complicate it by including the images from the beginning. As we didn’t have that much data (something like 4000 rows), we started with shallow networks. We used hyperparameter tuning to find the best network architecture for us. Here is the code snipped with all configurable parameters.
def build_keras_hp_model(hp):
model = Sequential()
kernel_regularizer = None
kernel_regularizer_value = hp.Choice('kernel_regularizer', values=[0.01, 0.02, 0.04, 0.06, 0.08])
type = hp.Choice('kernel_regularizer_type', values=["L1", "L2", "L1L2", "None"])
if type == "L1":
kernel_regularizer = regularizers.L1(kernel_regularizer_value)
elif type == "L2":
kernel_regularizer = regularizers.L2(kernel_regularizer_value)
elif type == "L1L2":
kernel_regularizer = regularizers.L1L2(kernel_regularizer_value)
activity_regularizer = None
activity_regularizer_value = hp.Choice('activity_regularizer', values=[0.01, 0.02, 0.04, 0.06, 0.08])
type = hp.Choice('activity_regularizer_type', values=["L1", "L2", "L1L2", "None"])
if type == "L1":
activity_regularizer = regularizers.L1(activity_regularizer_value)
elif type == "L2":
activity_regularizer = regularizers.L2(activity_regularizer_value)
elif type == "L1L2":
activity_regularizer = regularizers.L1L2(activity_regularizer_value)
for i in range(hp.Int('layers', 4, 12)):
model.add(Dense(hp.Choice(
'units', [50, 60, 70, 80 ,90, 100]),
kernel_initializer=hp.Choice(
'kernel_initializer', ['glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform', 'lecun_normal', 'lecun_uniform', 'random_normal']),
kernel_regularizer = kernel_regularizer,
activity_regularizer = activity_regularizer
))
if not hp.Boolean("do_batch_norm_after_dropout") and hp.Boolean("do_batch_norm_before_dropout"):
model.add(BatchNormalization())
model.add(Activation(hp.Choice('activation', ['elu', "relu", 'selu', 'gelu', 'leaky_relu', 'swish'])))
model.add(Dropout(hp.Choice('dropout', values=[0.0, 0.1, 0.15, 0.2, 0.25, 0.3])))
if not hp.Boolean("do_batch_norm_before_dropout") and hp.Boolean("do_batch_norm_after_dropout"):
model.add(BatchNormalization())
model.add(Dense(1))
model.compile(
optimizer=tf.keras.optimizers.Adam(hp.Choice('learning_rate', values=[1e-2, 1e-4, 1e-5, 1e-6])),
loss='mean_squared_error',
metrics=[tf.keras.metrics.RootMeanSquaredError()])
return model
The best-performing number of layers was 8, and the number of neurons was 70. The model also liked a small dropout (0.1) and L1 regularization with a value of 0.01.
We used mean squared error as a target value.
Here are the results for this model:
Model name | Mean Squared Error |
Neural network conversion rate | 38 |
Ensemble model
Finally, we decided to combine both models via a simple ensemble. We ran our data through both models and averaged the results. We got these results using this approach, which are slightly better than any of the individual ones.
Model name | Mean Squared Error |
Neural network conversion rate | 36 |
Results
Our decision to push this model into production was not without its challenges. After a few weeks of A/B testing, we were thrilled to see a conversion rate increase of over 10% on several pages. However, we also encountered a roadblock on two pages, where the target customers had different tastes and needs. To address this, we had to find an alternative approach, which involved developing dedicated models for these pages.
Note: We initially received a single dataset without information on its collection process. For optimal results, it’s important to have separate datasets for different client pages, as preferences can vary significantly. Dedicated models for each page type can help achieve better conversion rates. Therefore, we delivered customized models to address these variations and improve overall performance.