It’s called text mining for a reason.
Like the search for precious metal, text mining consists of a tremendous amount of preparatory work: taking random assortments of data, digging through them, sifting and sifting through them, and then refining them until what’s left is extremely valuable.
In the physical world, you start with a promising patch of land, and the result is gold or silver, whose value is obvious. In the cyber world, you start with written text that can vary from a claims file to a series of tweets. The result is actionable information and can be just as valuable.
Two actuaries described the mining process at the virtual CAS Ratemaking, Product and Modeling seminar in late July.
Louise Francis, FCAS, CSPA, MAAA, consulting principal for Francis Analytics & Actuarial Data Mining, presented a step-by-step look at the digging and sifting.
Roosevelt Mosley, FCAS, CSPA, MAAA, a principal and consulting actuary at Pinnacle Actuarial Resources, examined two real-life insurance examples: in one, searching claims files for insight on homeowner claims and in the other, measuring consumer sentiment from a pile of individual Twitter messages.
Together, they presented the business case for a skill that seems a logical extension of the traditional actuarial toolkit.
Anyone familiar with data analysis knows the 80/20 rule — creating the data set is a lot of work and takes 80% of the time. Textual analysis is even more lopsided.
Her first example uses the “bag of words” approach, where words are accumulated without retaining the context they held when originally written.
“Free-form text is far more challenging than structural data,” Francis said. “There has been no effort to standardize the text before the analyst digs in.”
Francis described two approaches. Her first example uses the “bag of words” approach, where words are accumulated without retaining the context they held when originally written. Her example uses R statistical language, but Python can also be used to mine text, she said, as well as less common languages such as Perl.
“In ‘bag of words,’ things like semantics and sentence structure don’t matter,” she said.
Her alternative approach is natural language processing, which tries to capture the subtleties of human communication.
The general process is to cull extraneous bits of text like punctuation and to standardize what remains.
Using an example from a dataset of workers’ compensation claims and after some preliminaries, Francis focused on a field where an adjuster describes what happened. She described the steps she had taken to normalize the text and the R code necessary to do so.
She had to fix misspellings. She removed punctuation, though she noted sometimes an analyst would want to retain it. She culled the spaces surrounding words. She replaced synonyms. She removed stems like “-ing” and “-ed”. She took out stop words like “the,” “is” and “on,” as they carry little meaning.
The process was iterative. At one point she displayed a list of terms that included “accid,” “accident” and “acciden.” Those were brought together into one term, “accident.”
The data that resulted from the cleaning and preprocessing was converted to a document-term matrix. That’s where exploration and analysis can begin and where Mosley’s presentation picked up.
Mosley described two different analyses. In the first, he looked for severity trends in 14,000 homeowners’ claims comprising 85,000 transactions. As in Francis’ example, he focused on the claim description field. He began by picking out words that appeared frequently, e.g., water, insured, damage, tree, basement.
“You can see how some of the issues may come into play with homeowner claims,” he said.
He ended up with 94 separate terms to analyze.
He began the analysis by looking at claim severity by indicator, though he noted that the result “is not too surprising.” The word “fire” was associated with the largest losses. Claims where the word “stolen” appeared were smaller.
Clustering verb clus·ter | \ ˈklə-stər : finding groups of words that tend to appear together and analyzing the claims where they appear. |
Association analysis noun as·so·ci·a·tion anal·y·sis | \ ə-ˌsō-sē-ˈā-shən ə-ˈna-lə-səs : a way of finding words that appear together and determining what other words are likely to appear with them. |
Next, he looked for correlations between pairs of words. “Wind” appeared with “blew” a lot, and “tree” appeared with “fell.” Claim size spiked when the word “flooded” appeared with the word “basement,” more so than when the word “basement” was not accompanied by the term “flooded.”
As the analysis proceeds, “You don’t see the full picture,” Mosley said, “because you are looking only at pairs of words. But the bigger picture is starting to come into focus.”
He showed the results of traditional data mining techniques like clustering — finding groups of words that tend to appear together and analyzing the claims where they appear — and association analysis — a way of finding words that appear together and determining what other words are likely to appear with them.
From there, he said, “You can begin to develop rules that are present in each claim description: Is it water damage? Is it water in the basement? Is it water related to the ceiling? And you begin to decipher key elements that come from some of those particular associations. That lets you refine and understand your claim severities better.”
Mosley’s second example analyzed 6 million insurance tweets focusing on consumer sentiment toward GEICO. The study looked at engagement with the ads and the effectiveness of marketing. The ads with the camel proclaiming Wednesday as “hump day” were popular, for example, but like many trendy things, quickly faded.
The study looked at people who switched insurers, exploring the subgroup who saved money and how much. People switching to GEICO saved $695 on average. People switching away saved $755.
“You end up having the kind of information you would get from a focus group,” Mosley said, “without having the focus group.”
The analyses showcased the value of actuaries — professionals who perform detailed quantitative analyses to deliver important business insights.
James P. Lynch, FCAS, is chief actuary and vice president of research and education for the Insurance Information Institute. He serves on the CAS Board of Directors.