Building a predictive model — or any complex actuarial model — is a big job. And it leaves you with a big question: Can the model do the work it was created to do?
Answering that question was the focus of a session titled, prosaically, “How to Pick a Better Model,” at the CAS Ratemaking and Product Management Seminar in San Diego in March.
Actuaries started building predictive models about 20 years ago to develop rating plans. Now models are spreading to help underwriters, adjusters and other insurance personnel work more efficiently, and actuaries are at the forefront of building those models.
A model needs to be tested to ensure that it doesn’t have fatal characteristics, said Hernan Medina, senior principal data scientist at ISO. Some fundamental flaws of a model could include that it might:
- Be underfit, poorly explaining the data it is modeling.
- Be overfit, explaining the data it is modeling well, but failing to predict accurately when given new data.
- Not perform as well as models already in use.
Medina and Dan Tevet, who is an actuary at Liberty Mutual Insurance, discussed the three key issues that tell whether a model is doing its job well:
- Lift — can the model distinguish between good risks and bad risks?
- Goodness of fit — does the model do a good job of explaining the data collected?
- Stability — is the model likely to stand the test of time?
Medina focused on model lift and Tevet discussed goodness of fit and stability. Hoi Leung, director of predictive analytics at AIG, served as moderator.
“Lift” is sometimes called the economic value of a model, Medina said, because it concerns the model’s ability to distinguish between risks.
Models are built based on a set of historical data. The analysts noted that to properly build and test a model, the data should be randomly split, usually into two parts.
One part will be the data, called the “training data,” that the actuary uses to create the model. The other part contains the “validation data,” the data against which the new data will get its first test run. . A good model will predict about as well on the validation data as it did on the training data. Often data are split into three parts. The training and validation samples may be used to fit and evaluate a few new model alternatives, leaving a “holdout” sample to obtain the final selected model.
Model Lift
“Lift” is sometimes called the economic value of a model, Medina said, because it concerns the model’s ability to distinguish between risks. The classic actuarial example might be the way that many carriers assess the value of insurance credit score. As the score goes up, the expected loss of a risk falls.
The relationship just described is quite similar to one of the tests that Medina cited — the loss ratio test. Loss ratios of actual data are ranked by the loss cost the model predicted. If the model is true, the loss ratio rises as the predicted loss cost rises.
This isn’t the most statistically rigorous test, Medina said, because the premium underlying the loss ratios are results of the current rating plan, not the proposed plan, and the current plan’s inadequacies could distort the analysis. Still, it is the most straightforward way to display data that most insurance professionals can understand.
Actuaries can analyze results better running the model against the validation data and plotting actual results versus what the model predicted. As the prediction rises, the actual results should rise. This type of chart is called a quantile plot.
One could also use a double quantile plot. The term “double” here means that instead of just plotting one set of predictions against actual results, the analyst plots two sets of predictions. Usually one set of predictions is from the current rating plan and the other set comes from the newly created model. Charting the two models together makes it easier to distinguish at a glance which one performs best.
Goodness of Fit
Any statistician will tell you a model needs to fit the data used to design it. The methods of measuring how close the model fits the data are called goodness-of-fit tests, and understanding how they work can help an analyst improve the model.
Statisticians look at how much the model’s prediction differs from the actual data point. This is called the “error term” or the residual.
The most common methods of measuring residuals (squared error, absolute error) aren’t appropriate for insurance models, Tevet said. Both work best on normally distributed data — data that accumulates into a bell curve.
Insurance data rarely fits a normal distribution, Tevet said. Using standard goodness-of-fit tests can lead to adopting the wrong model.
Tevet said that it is better to look at a measure known as the deviance of a model. This is similar to using squared error. In fact deviance in a normal distribution is measured by the weighted sum of squared error. Other distributions measure the statistic differently. The deviance of the insurance-friendly Tweedie distribution is
where p is the shape parameter of the distribution.
The deviance of model results can be tweaked into a measure known as the deviance residual. This measure shows the amount by which a model missed its target, but has been adjusted so that all of the deviance residuals, taken together, should form a normal distribution. So each deviance residual misses its target by a random amount.
So a visualization of deviance residuals — a scatterplot of them vs. the predicted variable — should look like a random cloud, with no discernable pattern.
Stability
The final steps Tevet discussed involved determining how robust the model is — making sure it is stable (the parameters that drive the model don’t change too quickly) and is not overfit (the model does well on the data it was trained on but not so well on anything else).
“You might sacrifice a little bit of lift for a model that is more stable over time,” Tevet said.
Models should also be tested “out of time,” meaning testing the model on data gathered from a later time period. That is important in insurance, he said, since the training and testing data, both random subsets of a larger data set, might both contain losses from the same catastrophe or harsh winter.
The methods of measuring how close the model fits the data are called goodness-of-fit tests, and understanding how they work can help an analyst improve the model.
To protect against overfitting, Tevet suggested a technique known as cross-validation. This is an alternative to creating training and validation datasets.
For example, a modeler could split a data set into five equally sized pieces, known as “folds,” take a random sample from each, then fit the model five times. Tevet suggested using cross-validation when the data set being modeled is thinly populated.
Another way to improve the value of the dataset is by creating a new data set from it using a process known as bootstrapping. The new data set is created by randomly selecting data with replacement from the old data set.
“Each random sample can be thought of as an alternative reality,” Tevet said.
For example, if you had a bag with 100 marbles, some blue and some red, you could create a virtual marble bag by picking a marble from the actual bag, noting its color then putting it back in the bag it came from. After doing this 100 times, you have a virtual marble bag.
The main advantage of bootstrapping is that the modeler can use the results to create statistical confidence intervals, Tevet said. Then the modeler can better tell if the difference between the model’s predictions and reality is from a weakness in the model or just due to chance.
In the end, the speakers agreed, the statistical procedures of picking a model have a sound mathematical basis, but the business knowledge that actuaries add to the process is also crucial.