Professional Insight
On the Shelf

Big Data Meets Literature

Nabokov’s Favorite Word Is Mauve: What the Numbers Reveal About the Classics, Bestsellers, and Our Own Writing By Ben Blatt, Simon & Schuster, 2017, 288 pp, $25.

The Federalist Papers are a collection of 85 essays written by Alexander Hamilton, John Jay and James Madison to promote the ratification of the U.S. Constitution. The three men all used the pen name “Publius” to preserve the anonymity of each essay. Years later, both Hamilton and Madison published lists that named the author of each essay. But for 12 essays, both Hamilton and Madison claimed authorship. Historians studied the documents, using the political ideology espoused within to try to determine who wrote what. Their work was inconclusive, and the mystery remained unsolved for over 150 years. In the 1960s, two statisticians and professors, Frederick Mosteller and David Wallace, used statistically inferred probabilities and Bayesian analysis to address the problem. They compared the frequency of 30 common words — for example, the number of times the word “upon” is used per 10,000 words of text — in the disputed essays to the frequency in papers known to be written by either Hamilton or Madison. They concluded that Madison was the author of all 12 essays. Later statistical studies performed by other authors agreed with this conclusion.

By necessity, Mosteller and Wallace went about their word-counting exercise in a rudimentary way: They cut paper copies of the essays into individual words, then alphabetized and counted. (In the book describing their research, they write, “During this operation a deep breath created a storm of confetti and a permanent enemy.”) In Nabokov’s Favorite Word is Mauve, statistician and journalist Ben Blatt builds upon the work of Mosteller and Wallace, using digitized texts and modern computing power instead of paper and scissors. Like Mosteller and Wallace, Blatt studies the unique fingerprint that defines an author’s style, but he asks a broad range of other questions as well, from whether writers follow their own writing advice, to how word choice varies between male and female authors. Most of Blatt’s work concentrates on fiction novels — classics, modern literary fiction and popular bestsellers.

One area of exploration is the use of adverbs. English students are warned to avoid them. Acclaimed authors shun them. (“I believe the road to hell is paved with adverbs,” says horror writer Stephen King.) Hemingway, the master of tight prose, is praised for not using them. But do the best writers really use fewer adverbs? Blatt used a suite of programs and libraries called the Natural Language Toolkit to count the number of adverbs used by Hemingway, King and several other authors in their complete bodies of work. The results are surprising. Hemingway and King both use adverbs at a higher rate than E. L. James, author of the erotic Fifty Shades trilogy. And Hemingway’s usage is also slightly higher than that of Stephenie Meyer, author of the Twilight series.

But wait — are all adverbs the same? Isn’t it the ones that end in -ly (“swiftly,” “slowly,” “softly”) that stand out? Adverbs like “not,” “also” and “often” generally slip by unnoticed. And it’s plausible that top writers can create vivid scenes without needing to rely on these -ly adverbs. Toni Morrison, Nobel Prize-winning author of Song of Solomon and Beloved, said in an interview, “I never say ‘She says softly.’ If it’s not already soft, you know, I have to leave a lot of space around it so a reader can hear that it’s soft.” When Morrison builds a scene with enough details to let readers hear the softness, there’s no need to tell them it’s so.

Considering just the adverbs that end in -ly, Blatt compiles results that align closer with expectation. Hemingway’s usage is almost half that of E. L. James and about 40 percent lower than that of Stephenie Meyer. In this test, the master of concision lives up to his name. And it’s not just better writers who use fewer adverbs. Blatt even found that within a great author’s canon, his most popular books tend to have the lowest -ly adverb frequency. For instance, Hemingway’s lowest -ly adverb rates are found in his most popular works, including The Sun Also Rises and A Farewell to Arms. His less well-known novels, like Across the River and Into the Trees and True at First Light, have the highest rates.

Some might protest the distillation of art into numbers and graphs. But the data has a lot to say about what makes an author great, and why certain of that author’s works are more popular than others. It’s one thing to analyze a section of tight prose from a Hemingway classic and quite another to review data showing that his prose is significantly tighter, in a statistical sense, than that of other authors.

Some might protest the distillation of art into numbers and graphs. But the data has a lot to say about what makes an author great, and why certain of that author’s works are more popular than others.


The author’s enthusiasm for literature comes through as strongly as his zeal for statistics. Blatt doesn’t just condense data into graphs, though he does this very well. He also posits why the data shows what it shows. (Why, for example, might the highest-acclaimed books use the fewest -ly adverbs?)

Readers seeking details on the methods used to produce Blatt’s results might be left disappointed. But this type of reader isn’t Blatt’s target audience; he says in his introduction, “You probably don’t care about the Poisson distribution or the parsing program used to decipher parts of speech.” And besides, the basic idea isn’t overly complex. Blatt is essentially counting the relative frequency of certain types of words and phrases. The innovation is in the use of text analysis to answer questions about writing and uncover patterns in great literature.


Julie Lederer, FCAS, MAAA, works for the Missouri Department of Insurance, Financial Institutions, & Professional Registration.