With COVID-19 keeping us all at home, I found myself with almost 2 weeks of vacation time to use up in December 2020. Which I didn’t mind – I like longer staycations, when I can really get into personal projects.
So, when I came across a Reddit post about a competition to predict motor insurance claims, I jumped right in. I’d always been curious about how insurance pricing is done. I suppose I see it as “a path I could have taken”, given I studied statistics.
The training data consisted of 228k annual claims records over 4 years and 57k policies, and just 23 (anonymized) features to use for prediction.
The main objective was to achieve maximum profit in a market consisting of the rest of the competitors. A profit leaderboard was updated weekly (on different test samples), while immediate feedback was available on a claims RMSE leaderboard. Importantly, competing on the RMSE leaderboard was a simpler matter of predicting claim amount, while the ultimate goal/profit leaderboard had a huge element of pricing strategy.
We can see from the figures above that the distribution is very right-skewed. Only 10.2% of all the claims are non-zero, and more than half of those are sub-$1000. A miniscule 0.18% of non-zero claims are over $10,000.
This makes it a bit tricky to predict claims (at least, on an RMSE basis), as non-zero claims don’t represent a large part of the data set AND there’s a handful of massive claims that can easily blow up a squared error.
Partway through the competition, they announced that claims in the test set would be capped at $50,000 (that is, we have reinsurance), so that decreased the skew.
I didn’t really keep track, but most of the time I spent on this competition was front-loaded with minor to moderate tweaking afterwards. I estimate:
Github repo – coming soon!
Like many other competitors, I made the mistake of spending too much time on the RMSE leaderboard, even after it became clear within weeks that there was little headwind to be made in trying to more accurately predict claim amount. While the RMSE leaderboard is no longer available, the situation was such that a model that simply predicted the average claim amount seen in training performed around 504 RMSE (IIRC), while the best sophisticated model anyone could come up with performed around 497 RMSE (or was that 499.7?).
I actually tried a couple things here that mirrored what some of the top competitors did:
As neither of these approaches improved my RMSE, I discarded them fairly early on and didn’t think to revisit them once I started working on pricing…
My final claims-prediction model consisted of a GBM (CatBoost) trained with hyperparameters that I tuned with cross-validation outside of Google Colab (the latter being the only submission method I could get to work). I cleaned up the features a bit and coded a few dummy variables for missing values, but otherwise left the training data as it was. Even though reinsurance capped claims at $50k, I opted to cap claims during training to $10k (only 41 claims were larger than this). My best score on the RMSE leaderboard was 500.79. :(
Many people commented on how challenging profit-optimization was, given that the leaderboard was only updated weekly AND you were competing against others who constantly updated their pricing strategy. It was a highly reflexive, rapidly-changing environment. Unlike real life, you had zero visibility into the premia others were charging. Furthermore, the private test data changed every week, and you were only allowed a single submission each time, making it difficult to compare strategies. Wild gyrations in weekly ranking were seen, even when people did not change their model at all – you could go from -$50k profit on average to rank 1 with $20k profit, just because of how the test data changed!
Honestly, it was a pretty brutal ask. The folks at AICrowd were did create a personalized report each week for our profit leaderboard submission so we would have some idea about what our algos were doing and what they might be failing on.
I realized, along with probably everyone else, that the biggest danger and profit-killer was in accidentally offering the lowest premium (in a market of 10 randomly selected competitors) for a really large claim. My strategy was essentially to price policies based on the decile of predicted claims that they fell into, with higher multipliers for higher deciles. I did a bit of fine-tuning on the weekly basis to increase my premia for demographics that were associated with my lowest profits, and increase my minimum premium offered to offset inevitable claims, but did not change my underlying claims-prediction model.
So I did pretty poorly on the final profit leaderboard, and I think it boils down to a couple of things:
Below, we see my final evaluation printout. Basically, I “won” too many large claims, tanking my profit.
AICrowd wrapped up the competition with a Town Hall with several presenters from among the top participants, which were absolutely fantastic. Some things I learned:
1st solution: https://github.com/davidlkl/Insurance-Pricing-Game
2nd solution: https://github.com/glep/pricing_game_submit
3rd solution: https://discourse.aicrowd.com/t/3rd-place-solution/5201
Feature engineering is an incredibly important step to extracting the maximum value from your data. In hindsight, I suspect it’s something I under-emphasize because the type of data I work with at my day job is high-dimensional and not straightforward in interpretation. There’s no way to manually pick out, say, pairs of genes from a pool of 50,000 genes, and multiple or divide or square their expression values in a meaningful way. The high dimensionality means trying every combination is not feasible, at least not with typical sample sizes. Feature engineering ends up taking more the form of dimensionality reduction, be that through PCA, latent factor analysis, gene set enrichment, etc.
Seeing the top contestants manually create obviously meaningful features that a tree-based algorithm would take multiple splits to represent, was embarrassingly eye-opening.
I’d heard of GAMs before, but never really looked into them. But them seem like something that could be useful even for my work, as they don’t introduce too much complexity but are more flexible than GLMs. Importantly, they allow for non-linearity, and really there’s no particular reason to expect gene expression to vary linearly/log-linearly with a given trait.
I think the figure here sums up the trade-offs well:
Again, one of those things that I haven’t done a lot of, because the nature of most biomedical research begets small sample sizes. Small samples sizes mean large confidence intervals, which means it can be difficult to show that your complex model outperforms your simple model at a statistically significant level.
During the Town Hall, one presenter mentioned that “black box” models can be more acceptable to regulators if they are but one component in a stacked model (consisting of more widely accepted models like GLMs), and their relative weighting can be tweaked. I find that pretty awesome, I don’t know why.
ICE plots are similar to partial dependence plots (which show how a model’s predictions change as the value of a particular feature changes), but for every sample individually. Kind of like SHAP values, which I’ve used for GBMs, but for GLMs?
I think explainable AI/models in general are really important, because no model is perfect and we need to understand what they are doing, why they are doing it, and how to fix them if need be.