Actuarial Expertise
Fresh Look

Bailey and Simon Minimum Bias Reexamined, Part 1

Actuarial Review introduces a new column, Fresh Look, that aims to reassess core areas in actuarial science with more current tools and practices. Part 2 will appear in Actuarial Review January-February 2021.

Bias

Society is beset with problems of bias, inequity and unfairness. Insurance, in particular, relies on the perception and reality of fairness. Insureds will only pool their risk with one another when they believe everyone pays a fair and unbiased rate. So insurers must not only treat all insureds equitably, but they must also be able to demonstrate that they do so. However, their complex and granular rating plans make that more challenging. As a result, media and regulators have begun to question and investigate rating models. The National Association of Insurance Commissioners’ (NAIC) Casualty Actuarial and Statistical Task Force drafted a white paper describing the Regulatory Review of Predictive Models and completed a Price Optimization White Paper in 2015. Following the lead of the industry, which is always taking a step back to review and modify existing literature or white papers in the face of new evidence and studies, the CAS continues to evolve and modify its communication and education on emerging topics and trends. Against this backdrop, now seems a good time to reexamine two very famous papers published in the Proceedings of Casualty Actuarial Society, the precursor to Variance, that propose minimizing bias in insurance rates.

Insurance rates should be based on data and not prejudice. Establishing fairness is challenging and encompasses many issues, such as the use of proxy variables and differential impact. The modeler must use a rigorous and transparent framework that avoids arbitrary, unnecessary or hidden assumptions. This column explains that a generalized linear model (GLM), the natural outgrowth of minimum bias methods, satisfies these requirements, providing an ideal model-building platform. While it is possible to build a flawed GLM, it is reassuring to know it provides a neutral starting point.

Insurers must not only treat all insureds equitably, but they must also be able to demonstrate that they do so. However, their complex and granular rating plans make that more challenging. As a result, media and regulators have begun to question and investigate rating models.

 

It is important to remember that residual error is a modeling fact of life. An oft-quoted aphorism states, “All models are wrong; some are useful.” Models simplify to be useful, but by simplifying, they omit details and cannot perfectly replicate the real world. A statistical model balances fidelity to sample data with out-of-sample predictive power to maximize its usefulness.

An actuarial statistical model creates a rating plan to predict expected loss costs and distributions for each insured. Various standards are used to judge if a rating plan is acceptable. U.S. actuaries are familiar with the CAS Ratemaking Principle that rates should be “reasonable, not excessive, not inadequate and not unfairly discriminatory.”

Another set of criteria, almost as well known and pre-dating the CAS principles by nearly 30 years, was written down by Robert Bailey and LeRoy Simon in their 1960 Proceedings paper “Two Studies in Automobile Insurance Ratemaking.” A 1963 follow-up by Bailey, “Insurance Rates with Minimum Bias,” developed them further. It is instructive to reexamine Bailey and Simon’s criteria in light of what we have learned since then and the issues we currently face as the front-line guardians of fair insurance rates.

Criteria

Bailey and Simon’s concern was personal automobile ratemaking in Canada. At the time, pricing used a two-way classification plan combining a (very coarse) class plan and a merit (experience) rating plan. Their four criteria are as follows, with italics in the original. A set of pricing relativities is acceptable if:

  1. It reproduces the experience for each class and merit rating class and also the overall experience, i.e., is balanced for each class and in total.
  2. It reflects the relative credibility of the various groups involved.
  3. It provides a minimal amount of departure from the raw data for the maximum number of people.
  4. It produces a rate for each subgroup of risks that is close enough to the experience so that the differences can reasonably be caused by chance.

Assumptions

The Bailey-Simon criteria rely on several assumptions.

Balanced by class means the average rate equals each class’s experience rate, summed over the remaining classes. This formulation gives particular prominence to the average, or mean, and uses the difference from the average to measure balance (residual error). It also implies that each class is large enough to be fully credible.

The discussion of relative credibility appeals to the general statistical principle of weighting an estimate in inverse proportion to its standard error. Bailey and Simon give each cell’s experience a weight proportional to the square root of its expected number of losses because they assume the variance of experience loss grows with its expected value.

Bailey and Simon frame the third criterion in terms of “inequity” or deviation from experience. It is worth quoting their discussion because of its topical relevance.

“Anyone who has dealt directly with insureds at the time of a rate increase, knows that you can be much more positive when the rate for his class is very close to the indications of experience. The more persons involved in a given sized inequity, the more important it is.”

The ability to explain rates was as necessary in 1960 as it is today! Bailey and Simon quantified the departure criteria using the average absolute deviation.

Bailey and Simon addressed chance, the fourth criterion, using weighted Χ2 statistic. Based on Canadian experience, they determined that the difference between the actual and expected relative loss ratio, scaled by the former’s standard deviation, is approximately represented by a standard normal distribution, justifying their selections. They then derived a minimum bias iterative scheme to solve for the minimum Χ2 relativities and show that the result is balanced.

Bailey’s 1963 paper generalized the minimum bias iterative scheme and discussed additive (cents) and multiplicative (percents) models, as well as the need for a measure of model fit distinct from average model bias (which is zero, by design). He proposed minimum square error and minimum absolute error measures for this purpose.

Bailey and Simon’s principal innovation was to calculate all class relativities at once, reflecting different mixes across each variable. Until their work, rating factors were computed one at a time, in a series of univariate analyses. (This is different from considering interactions between rating factors. Their two-factor rating plan was too simple to allow for interactions.) The minimum bias method was, and remains, very appealing: It is easy to explain and intuitively reasonable (who doesn’t want their rating plan to be balanced by class?) and is simple to program. It is no wonder it proved so popular.

Critique

Certain aspects of Bailey and Simon’s work may be tricky for today’s statistically trained actuary to follow. The use of the word bias is nonstandard. In statistics, an estimator is unbiased if its expected value over samples gives the correct value. Bailey and Simon use bias to mean residual error, the difference between a fitted value and an observation, and as a measure of overall model fit. Balance is also used to describe the residual error in a fitted value.

The focus on the sample mean as a sufficient statistic for the population mean needs no explanation.

The concept of balance by class relies on the form of the linear model underlying the classification. Bailey and Simon use a two-way classification model. The rate for risks in class (i,j) is xi+yj, in the additive model. The underlying design matrix only has elements 0 and 1. In a more general setting, including continuous covariates, the design matrix would be more complex. Some analog of balance would still apply, but it would be more complicated to explain.

Bailey and Simon place great emphasis on the concept of fully credible rating classes, meaning ones where the model rate should exactly equal the experience rate. A statistical approach quantifies the outcome distribution explicitly and produces tighter and tighter confidence intervals for the model rate, rather than insist on equality. Some sampling error or posterior uncertainty remains for the largest cells, even if very small.

The claim that the variance of experience grows linearly with expected losses in each class is most interesting for the modeler. It reflects a traditional actuarial compound Poisson claims generating process. A severity distribution and an annual frequency characterize each risk cell. The distribution of aggregate losses has a Poisson frequency distribution, with mean proportional to expected losses, and a fixed severity distribution. Its variance is proportional to its mean. These assumptions can fail in at least two ways.

First, there can be common risk drivers between insureds, such as macroeconomic conditions or weather. These result in a correlation between insureds. A negative binomial frequency captures the effect, replacing the Poisson. The resulting aggregate distribution has a variance of the form μ(a+bμ) for constants a and b, where μ is the mean. The variance of a large portfolio grows almost quadratically with its mean.

Second, a quadratic mean-variance relationship can arise for catastrophe risks, where portfolio growth corresponds to paying a greater proportion of losses over a fixed set of events. The actuary’s understanding of the loss generating process informs the possible relationship between the mean and the variance of losses in a cell. It should fall somewhere between linear and quadratic.

Bailey and Simon test the fourth criterion, that each subgroup’s experience should be close enough to its rate that differences could reasonably be caused by chance, using an aggregate Χ2 statistic. There is a clear opportunity to enhance model assessment using a granular, cell-by-cell evaluation of chance deviation, based on the modeled distribution of losses.

The minimum bias method was, and remains, very appealing: It is easy to explain and intuitively reasonable (who doesn’t want their rating plan to be balanced by class?) and is simple to program. It is no wonder it proved so popular.

 

Finally, the discussion of both the third and fourth criteria introduce modeler discretion: Which measure of overall model bias should be employed? Least squares, minimum absolute deviation and minimum Χ2 are all mooted. The modeler should avoid unnecessary choices. Is there a better way to select a measure of model fit?

Homework

In the next issue, we will see how modern statistics has developed the ideas presented so far. As a former college professor, I would be remiss if I didn’t give you some homework to prepare. Although data science deals with massive data sets and builds very complex models, you can understand its fundamental problems by considering straightforward examples. Here are two that capture our essential conundrum. It would help if you considered how to solve them before reading the sequel.

The first is a two-way classification, with each level taking two values. As an example, imagine auto liability experience, with factors  youthful operator yes/no and prior accidents yes/no. The table  shows the pure premium in each cell. You want to fit an additive linear model.

The second is a simple linear regression problem. You want to fit a line through the following data, which could represent severity over time. The covariates are a constant (not shown) and date. Dates are equally spaced and have been replaced by 0, 1 and 2.

In both examples, assume the same volume of data underlies each observation, so there is no need for weights. In the first, make the Bailey and Simon assumption that the total experience across each level of each dimension is credible, i.e., the row and column totals are credible.

For partial credit, start by laying out the first question so it looks more like the second one.

The difficulty is clear: There are fewer parameters than data points, so the requested model will not fit exactly. How should you apportion the model miss? Obviously, with a clever selection of response function you can create many models that do fit exactly — or over-fit exactly. Please resist the urge to expound upon these and focus on the stated question.


Stephen J. Mildenhall, Ph.D., FCAS, CERA, ASA, is a consultant with Convex Risk LLC.