Predicting new subscriptions with Machine Learning & Magicsheets

In this article we learn how to build a simple time series model in Python to predict the number of new newsletter subscribers in Mailchimp. We then demonstrate how this process can be greatly simplified with Magicsheets.

  • NOTE 1: You can directly implement the code in this notebook yourself: just go to the link below
  • NOTE 2: This notebook can be used directly with your own Mailchimp dataset. Just save it down on your laptop together with your own dataset called data.csv, and run the code again.

Motivation

Regardless of whether you

  • are a subscription-based business,
  • manage a Slack channel,
  • manage your start-ups newsletter,
  • etc….

one of your key performance metrics to think about is new subscriber numbers.

For any community manager, knowing how big the new user traffic in the coming days is likely to be is key for effective management of resources and efficient use of your time.

Getting your data into nice charts for analysis is one thing, but how do we predict the new subscriber number?

One way to do this is to use a time series model.

Our model will learn to spot a pattern in the number of daily signups and extend this pattern by the extra prediction period of 7 days.

Let’s walk through an example to understand what this means.

The Mailchimp subscriber forecasting problem

We will try to answer the question:

How many new newsletter subscribers can I expect to get in the next week?

We will do so by applying Machine Learning to find a pattern in our Mailchimp subscriber dataset and extend this pattern to formulate a prediction. (You can read more about Mailchimp and what it can do on their website.)

First, let’s get the past subscriber dataset for the past 3 months
from our Mailchimp. (You can read about exporting your contacts here.)

Export your Mailchimp dataset and save it in the same folder as this Jupyter notebook. Alternatively, you can save it down to the cloud using e.g. Github or Google Drive. In this tutorial, we are using a “fake” (randomly generated) Magicsheets.io’s Mailchimp newsletter subscriber dataset.

Once we have the dataset saved down in the same folder, we load it using the Pandas package.

Mailchimp provides us with tonnes of useful information, such as he IP address of the subscriber, the location of the IP address, the subscriber's time zone, etc.

For us, however, the only useful piece of information for now will be the time when the person actually became a subscriber.

The key column for us is thus “CONFIRM_TIME” column which tells us when the subscriber became a subscriber.

Once we isolated the desired “times” dataset, we want to group the subscription numbers by date.

That is, we want to know how many new subscribers there were on each day.

This shows you the total of new subscribers on each day.

CONFIRM_TIME  count
0 2021-07-26 46
1 2021-07-27 59
2 2021-07-28 50
3 2021-07-29 57
4 2021-07-30 62
.. ... ...
87 2021-10-21 55
88 2021-10-22 56
89 2021-10-23 44
90 2021-10-24 56
91 2021-10-25 65

Visualizing the dataset

Let’s plot the new subscriber counts as a function of time. We will use the famous matplotlib library for it.

This looks… messy, but some simple periodic patterns can be spotted even with “naked eye”.

Hopefully Machine Learning can help us make sense of these patterns and actually predict the number of subscribers we should expect over coming days.

Regardless of what model we train in the end, we need to be able to assess how good it is.

A reliable way to do this is to train your model only on *some* available data, and then evaluate it on the rest.

This way,

  • the model first trains on *most* of the information, for example 70% of the dataset.
  • you then ask the model: what do you predict for the remaining 30% of the dataset?
  • this prediction you can now compare with the real data that you kept hidden away from the model at training.

For example, if you have 100 days’ worth of data, you can train the model on the first 70 days. You then look at what this model predicts for the remaining 30 days, and compare this prediction with the actual, true subscriber data for this period.

Let’s split the dataset. We can use a simple, predefined function, imported from the sktime library like so:

The predictive model: first attempt

We now build the time series model. This can also be loaded directly from sktime library.

The model we are going to use is called Exponential Smoothing. You can read more about the theory behind it here.

Training the model

We train our model using the time series dataset, by directly applying the *fit* function, and feeding the *count* column of the times_day_count variable to the model.

We will now generate predictions for the period that the model was not trained on.

This would be the remaining days corresponding to the data points in data_set.

We first need to tell Python that the remaining data points belong to the Forecasting Horizon which is a fancy name for the prediction period.

We now need some way of telling “how far” the values the model predicts for the prediction period are from the real number of new subscribers on those dates.

Let’s first compare these two visually, by plotting them on the same graph.

The model as is not extremely helpful: there is no variation in the predictions. However, it does predict the average future value of the future predictions, as we can validate:

mean of the actual data:  54.47826086956522
prediction value: 69 54.304351
70 54.304351
71 54.304351
72 54.304351
73 54.304351
74 54.304351
75 54.304351
76 54.304351
77 54.304351
78 54.304351
79 54.304351
80 54.304351
81 54.304351
82 54.304351
83 54.304351
84 54.304351
85 54.304351
86 54.304351
87 54.304351
88 54.304351
89 54.304351
90 54.304351
91 54.304351
dtype: float64

Let’s quantify how “bad” this model is so we can compare with other models soon.

A useful metric to quantify how good our predictions is called MSE (mean squared error). MSE can be high or low for good and bad models, but as a rule of thumb the lower it is, the better.

You can read more about MSE here.

51.323250355315075

Improving the predictive model

Before you throw your model (Exponential Smoothing model, in our case) out of the window, one way to make it work better and product more reliable predictions is to adjust the model’s hyperparameters.

You can think of the model as a *general framework*, and the hyperparameters as defining the actual shape of the model. For example, a “parabola model” is any function in the form y=ax²+bx+c. Different values of hyperparameters a, b, c will give you different shapes of the parabolic family.

In practice, we might have to spend considerable amount of time picking these and experimenting with different setups, and this process can be indeed daunting (that’s why data scientists are so well paid!).

Certain things happen periodically. We might for example know that new subscribers are more likely to sign up for our newsletter on Mondays, Tuesdays and Wednesdays, and much less likely to join e.g. on weekends.

We can communicate this to our model to incorporate this periodicity by specifying the appropriate hyperparamter, like so:

There are many more hyperparameters that you can specify and use to boost your model for better predictions. You can view the full list here.

Let’s see how this new, improved model does when we train it:

This already seems to capture more of the variability in the dataset.

We can also compare the MSE (error value) for model_2 and the previous model that did not look at periodicity at all:

the new model's error is =  47.13314112736905
the new model's error is a 8.164154060659747 % improvement over the old model

Adding periodicity already improves the predictive power of the model!

In practice, we would have to spend much more time to understand and experiment with various hyperparameter setup combinations until we get a great model that produces even more satisfying results.

(In comparison, using Magicsheets this process would be 100% automated.)

Predicting new Mailchimp subscriber numbers

We can now finally move on to predicting the new Mailchimp subscriber numbers that we originally set out to do.

Here we are focusing on a 1 week prediction, but you can use any number of days you wish.

Note: The farther ahead you are looking, the less accurate the predictions will be. You should also be careful when picking the size of the prediction period relative to your dataset size. For example, if you have a dataset encompassing 3 months, a week’s worth of predictions might make sense, but making a 5-year forecast based on the same data is probably not going to be very accurate!

day   prediction
69 51.888901
70 53.800013
71 57.400023
72 53.500012
73 55.100009
74 56.100020
75 52.100006

Because we are dealing with integer (non-fraction) things (number of new subscribers) these numbers should be rounded to nearest integer, so that 1.96≈2.0, 1.01≈1.0, etc.

print(np.floor(predictions))
day prediction
69 51.0
70 53.0
71 57.0
72 53.0
73 55.0
74 56.0
75 52.0

Finally, we can plot these predictions as a forecasting extension of the dataset.

Automating the subscriber predictions with Magicsheets

We have gone through a lot here, so let’s sum it all up. In order to get Mailchimp new subscriber volume predictions, we had to go through the following steps:

  1. Log into your Mailchimp, find your Audience signup data and download the dataset to your computer.
  2. Load the dataset into Python with Pandas.
  3. Identify and select the relevant data columns.
  4. Build the time series model (in our case, this is Exponential Smoothing model).
  5. Train the model on the training set.
  6. Test the model on the testing set.
  7. Calculate the MSE, plot the predicitons, and adjust hyperparameters.
  8. Repeat 4→7 until you are happy with your model’s MSE.
  9. Make the predictions
  10. The next day (or week, or month): rinse, wash, repeat!

This looks — and indeed is — daunting.

That’s why we built Magicsheets, which simplifies this process to just 2 steps:

  1. Log into your Mailchimp and select audience for predictions
  2. Re-train the model and generate new predictions any time you want in a dedicated Slack channel with typing a simple command ‘/run-magicpipe’

Apart from being much faster than building the model yourself, Magicsheets also automatically selects the most suitable hyperparameter setup for your dataset, so you can have confidence in your predictions.

You can watch a short demo of how Magicsheets works here.

We are building a low-code version of the product that you can deploy in ~5min if you know basic Python right now. If you would like it, join the waitlist.

Adding magic to your spreadsheets, one click at a time.