Sunday, December 13, 2015

What Happens When Computers Learn to Read Books?

In Kurt Vonnegut's classic novel Cat's Cradle, the character Claire Minton has the most fantastic ability; simply by reading the index of the book, she can deduce almost every biographical detail about the author. From scanning a sample of text in the index, she is able to figure out with near certainty that a main character in the book is gay (and therefore unlikely to marry his girlfriend). Claire Minton knows this because she is a professional indexer of books.

And that's what computers are today -- professional indexers of books.

Give a computer a piece of text from the 1950s, and based on the frequency of just fifteen words, the machine will be able to tell you whether the race of the author is white or black. That's the claim from two researchers at the University of Chicago, Hoyt Long and Richard So, who deploy complicated algorithms to examine huge bodies of text. They feed the machine thousands of scanned novels-worth of data, which it analyzes for patterns in the language -- frequency, presence, absence and combinations of words -- and then they test big questions about literary style.

"The machine can always -- with greater than a 95 percent accuracy -- separate white and black writers," So says. "That's how different their language is."

This is just an example. The group is digging deeper on other questions of race in literature but isn't ready to share the findings yet. In this case, minority writers represent a tiny fraction of American literature's canonical text. They hope that by shining a spotlight at unreviewed, unpublished or forgotten authors -- now easier to identify with digital tools -- or by simply approaching popular texts with different examination techniques, they can shake up conventional views on American literature. Though far from a perfect tool, scholars across the digital humanities are increasingly training big computers on big collections of text to answer and pose new questions about the past.

"We really need to consider rewriting American literary history when we look at things at scale," So says.

Who Made Who


A culture's corpus of celebrated literature functions like its Facebook profile. Mob rule curates what to teach future generations and does so with certain biases. It's not an entirely nefarious scheme. According to Dr. So, people can only process about 200 books. We can only compare a few at a time. So all analysis is reductive. The novel changed our relationship with complicated concepts like superiority or how we relate to the environment. Yet we needed to describe -- and communicate -- those huge shifts with mere words.

In machine learning, algorithms process reams of data on a particular topic or question. This eventually allows a computer to recognize certain patterns, whether that means spotting tumors, cycles in the weather or a quirk of the stock market. Over the last decade this has given rise to the digital humanities, where professors with large corpuses of text -- or any data, really -- use computers to develop hard metrics for areas that might be previously seen as more abstract. (...)

Mark Algee-Hewitt's group in Stanford's English department used machines to examine paragraph structure in 19th century literature. We all know that in most literature, when the writer moves to a new paragraph, the topic of the paragraph will change. That's English 101.

But Algee-Hewitt says they also found something that surprised them: whether a paragraph had a single or multiple topic was not governed by the paragraphs' length. One might think that a long paragraph would cover lots of ground. That wasn't the case. Topic variance within a paragraph has more to do with story genre and setting than the length.

Now they are looking for a pattern by narrative type.

"The truth is that we really don't know that much about the American novel because there's so much of it, so much was produced," says So. "We're finding that with these tools, we can do more scientific verification of these hypotheses. And frankly we often find that they're incorrect."

The Blind Men and The Elephant

But a computer can't read. In a human sense. Words create sentences, paragraphs, settings, characters, feelings, dreams, empathy and all the intangible bits in between. A computer simply detects, counts and follows the instructions provided by humans. No machine on earth understands Toni Morrison's Beloved.

At the same time, no human can examine, in any way, 10,000 books at a time. We're in this funny place where people assess the fundamental unit of literature (the story) while a computer assesses all the units in totality. The disparity -- that gap -- between what a human can understand and what a machine can understand is one of the root disagreements, among others, in academia when it comes to methodology around deploying computers to ask big questions about history.

Does a computer end up analyzing literature, itself or those who coded the question?

by Caleb Garling, Pricenomics | Read more:
Image: uncredited