A Simple Sales Model – Kaggle Rossmann Competition
I like Kaggle.com, not for the competitive element, but for the datasets you find there to play with and learn from. I took the data from last autumn’s Rossmann competition and I wanted to show you how far you can get with some old-fashioned modeling. Machine learning is all the rage right now, and for good reason. Most of the top submissions in competitions like the ones hosted by Kaggle likely employ machine-learning methods. But how far can we get with a straightforward model, based on a simple exploration of the data?
In the Rossmann competition we are dealing with sales data. Here’s a small sample of what it looks like.
Store,DayOfWeek,Date,Sales,Customers,Open,Promo, \ StateHoliday,SchoolHoliday,WeekNumber 1,5,2015-07-31,5263,555,1,1,0,1,31 2,5,2015-07-31,6064,625,1,1,0,1,31
I have put it through some preliminary postprocessing that made the data purely numerical, and I added the WeekNumber column because I want to use this information in my model. The WeekNumber is simply the number of the week that the record’s date falls in.
Let’s just plot the data and see if we can distinguish any interesting features by eye.
What we have here is all the sales data as a function of time for the first store in the dataset. I’ve aggregated the data by week to clean up the plot. The first thing that jumps out is that this store has a yearly spike in sales in the holiday season. More generally, the data is roughly periodical with the same trend happening every year.
This gives us the first idea for a model. If we want to predict the number of sales for a future date, we can look at that same date in the previous years and make a prediction based on that. However, since we only have about two and a half years worth of data, that would not make for very good statistics. Instead, we can look at the average number of sales in the same week. That at least gives us a few more datapoint to average over.
However, such a model would predict the same number of sales no matter what day of the week we are looking at. That is probably not a very good approximation. For example, you might expect many more sales on a Saturday than on a Tuesday.
To deal with that, we can also look at the average number of sales a store makes on any given day of the week.
The plot above shows the number of sales for this store against the day of the week (this particular store is closed on Sundays). The line shows the averages. We see that this average does depend on the day of the week, although for this particular store not too strongly. Still, let’s take it into account. Our model is as follows:
What I mean by that is that the sales () depend on which week () of the year and which day () of the week it is. We take the average sales () in that week as a basis, and then we modulate it with the average number of sales () on that day of the week. For the latter, we need to subtract the overall average number of sales () to get a value that correctly modulates the week’s average up or down.
This model is extremely simple, utilizing only some very simple statistics and a bit of common sense. This simplicity is the beauty of it and it makes for a very understandable model. But of course, we have to see how well it actually performs.
To do so, I gathered all the averages needed, and made predictions for all the training data. I calculated the (relative) root mean square deviation (RMSD) between my predictions and the actual sales values for each of the stores. This is a measure for the typical error in the predictions, relative to the actual value. Averaging over all the stores, I find an average deviation of 0.19. This means that the error in the predictions made by this model are typically about 19%. Is that good? Well, let’s look at the current leaderboard for this Kaggle competition. (At time of writing, the competition was still ongoing.) The scores there are calculated on the test data, not on the training data, but it’s a ballpark to keep in mind. We will look at the test data later, where we can expect to do a little worse there than on the training data, but for now a rough comparison will suffice.
The current top 10 entries manage to reach an error of 9-10%. That’s significantly better than our 19%. The models used by the top 10 perform about twice as well as our model does.
However, this factor of 2 is actually relatively small if we consider once more how simple our model is. It took minutes to come up with; a few hours of work to clean up the data, extract our parameters, and code up the model; and a negligible amount of CPU time to execute. I’d say that 19% is not a bad result at all.
There should be plenty of room for improvements with such a simple model. Let’s look for ways to improve our model by looking at where it performs poorly. If we look at the RMSD values for individual stores, we find a wide range of values. For some, the model works well, reaching about 8% accuracy. The worst performance is an RMSD of 2.9 — that’s an average error of 290%! Something is going wrong for that store. Let’s take a look.
We have again plotted the average sales per day of the week. The spread around the average is huge, and especially the data for Monday seems to have two peaks, and the average falls right in a gap. This average thus poorly describes essentially all of the data. It’s not surprise the model doesn’t perform well here. But what’s going on?
Here we have the sales data on some consecutive Mondays. What’s interesting is the the data is really spiky. One Monday, this store has 5000 EUR in sales. The next, it’s suddenly twice as big! Luckily there’s a reason for this, which is clear from the dataset. In green are marked the days when this store has a specific promotion running. On the red Mondays, this promotion is not running. We immediately see where the separation comes from: for this store, the promotion is apparently very successful, leading to a clear split in the sales.
We can accommodate this additional information by taking the average sales not just per day, but separately for the days when this promotion is running and when it is not, giving us 14 parameters instead of 7.
There we go, much better. We now have averages that correctly approximate their respective class. This refinement gets rid of all the very poorly performing stores and pulls the overal RMSD on the training data down to 15%. A significant improvement, again achieved with some simple common sense. But of course the real test is the training data. A prediction with this extremely simple model was submitted to the Kaggle competition and finished with a score of 0.16872. That’s nowhere close to winning the competition – the final score of the winner was 0.10021 – but it goes to show how far you can get in a short amount of time with a very simple model.