UNCHARTED
Big Data as a Lens on Human Culture
Erez Aiden and Jean-Baptiste Michel
Riverhead Books; 280 pages; $27.95
Why do English speakers say "drove" rather than "drived"? As graduate students at the Harvard Programme for Evolutionary Dynamics about eight years ago, Erez Aiden and Jean-Baptiste Michel pondered the matter and decided that something like natural selection might be at work. In English, the "-ed" past-tense ending of Proto-Germanic, like a superior life form, drove out the Proto-Indo-European system of indicating tenses by vowel changes. Only the small class of verbs we know as irregular managed to resist.
To test this evolutionary premise, Mr Aiden and Mr Michel wound up inventing something they call culturomics, the use of huge amounts of digital information to track changes in language, culture and history. Their quest is the subject of Uncharted: Big Data as a Lens on Human Culture, an entertaining tour of the authors' big data adventure, whose implications they wildly oversell.
To tackle the drived/drove question, Mr Aiden and Mr Michel assigned two undergraduates to read every textbook on historical English grammar, compile a list of irregular verbs and follow their fortunes through the centuries. The students turned up 177 irregular verbs in Old English, a number that declined to 145 in Middle English (the language of Chaucer) and to 98 in modern English. Of the original Old English irregulars, the 12 most frequently used verbs stayed irregular, while 11 out of the 12 least frequently used verbs made the changeover. Only "slink" held the line.
"The data had spoken," the authors write. "Something akin to natural selection was influencing human culture, leaving its fingerprints among the verbs. Usage frequency was having an extraordinarily strong effect on verb survival, making the difference between the verbs that were mourn/mourned and the verbs that were fit/fit to survive."
Invigorated by the great verb chase, the authors went hunting for bigger game. Given a large enough storehouse of words and a fine filter, would it be possible to see cultural change at the micro level, to follow minute fluctuations in human thought processes and activities?
By chance, Google Books, the megaproject to digitise every page of every book ever printed - all 130 million of them - was starting to roll just as the authors were looking for their next target of enquiry.
Meetings were held, deals were struck and the authors got to it. In 2010, working with Google, they perfected the Ngram Viewer, which takes its name from the computer-science term for a word or phrase. This "robot historian," as they call it, can search the 30 million volumes already digitised by Google Books and instantly generate a usage-frequency timeline for any word, phrase, date or name, a sort of stock-market graph illustrating the ups and downs of cultural shares over time.
Mr Aiden, now director of the Centre for Genome Architecture at Rice University, and Mr Michel, who went on to start the data-science company Quantified Labs, play the Ngram Viewer (books.google.com/ngrams) like a Wurlitzer.
They graph, to take one example, the astounding career path of "chortle," coined by Lewis Carroll in "Jabberwocky," which has left its siblings "galumphing" and "frumious" in the dust. They tease out the most-mentioned names of people born in each year from 1800 to 1949. From those born in 1871, the name that appeared most often was Cordell Hull, secretary of state under Franklin D Roosevelt, not Orville Wright. They also come up with a sort of fame speedometer. The ngram data show that people are becoming famous at a younger age, and faster, than they did two generations ago. Fame is much bigger, too. At one point, the authors write, Bill Clinton's ngram "was almost exactly as frequent as the word lettuce, twice as frequent as the word cucumber, and about half as frequent as the word tomato. He completely outclassed second-tier vegetables like turnip and cauliflower".
The momentous term culturomics suggests the authors' ambitious view of what can seem like an intellectual parlour game. The magazine Mother Jones, they cheerfully admit, called the Ngram Viewer "possibly the greatest time-waster in the history of the Internet". But the authors argue that just as Galileo's telescope opened new, previously unimagined worlds, the powerful lens of culturomics "is going to change the humanities, transform the social sciences and renegotiate the relationship between the world of commerce and the ivory tower".
Judging by the evidence on offer in Uncharted, the claim seems a tad boastful. Yes, it is fascinating to know that "donut" gained traction as a variant spelling soon after Dunkin' Donuts was founded in 1950. The authors serve up many a tasty morsel like this: lots of fun but less than earthshaking without elaboration.
Uncharted began life as an article in Science magazine in December 2010, and the authors have huffed and puffed to inflate it to book length. They digress at every turn and, to add weight at the back end, they have appended nearly 50 ngram searches.
They also overexplain. Most readers do not need a background lesson on Nazi policies towards the arts to understand why, in German books published between 1933 and 1945, the graph for Marc Chagall dips like a downward-speeding roller coaster.
This may be potato chips for intellectuals, but it is irresistible. You cannot eat just one ngram.
©2013 The New York Times News Service
Big Data as a Lens on Human Culture
Erez Aiden and Jean-Baptiste Michel
Riverhead Books; 280 pages; $27.95
Why do English speakers say "drove" rather than "drived"? As graduate students at the Harvard Programme for Evolutionary Dynamics about eight years ago, Erez Aiden and Jean-Baptiste Michel pondered the matter and decided that something like natural selection might be at work. In English, the "-ed" past-tense ending of Proto-Germanic, like a superior life form, drove out the Proto-Indo-European system of indicating tenses by vowel changes. Only the small class of verbs we know as irregular managed to resist.
To test this evolutionary premise, Mr Aiden and Mr Michel wound up inventing something they call culturomics, the use of huge amounts of digital information to track changes in language, culture and history. Their quest is the subject of Uncharted: Big Data as a Lens on Human Culture, an entertaining tour of the authors' big data adventure, whose implications they wildly oversell.
To tackle the drived/drove question, Mr Aiden and Mr Michel assigned two undergraduates to read every textbook on historical English grammar, compile a list of irregular verbs and follow their fortunes through the centuries. The students turned up 177 irregular verbs in Old English, a number that declined to 145 in Middle English (the language of Chaucer) and to 98 in modern English. Of the original Old English irregulars, the 12 most frequently used verbs stayed irregular, while 11 out of the 12 least frequently used verbs made the changeover. Only "slink" held the line.
"The data had spoken," the authors write. "Something akin to natural selection was influencing human culture, leaving its fingerprints among the verbs. Usage frequency was having an extraordinarily strong effect on verb survival, making the difference between the verbs that were mourn/mourned and the verbs that were fit/fit to survive."
Invigorated by the great verb chase, the authors went hunting for bigger game. Given a large enough storehouse of words and a fine filter, would it be possible to see cultural change at the micro level, to follow minute fluctuations in human thought processes and activities?
By chance, Google Books, the megaproject to digitise every page of every book ever printed - all 130 million of them - was starting to roll just as the authors were looking for their next target of enquiry.
Meetings were held, deals were struck and the authors got to it. In 2010, working with Google, they perfected the Ngram Viewer, which takes its name from the computer-science term for a word or phrase. This "robot historian," as they call it, can search the 30 million volumes already digitised by Google Books and instantly generate a usage-frequency timeline for any word, phrase, date or name, a sort of stock-market graph illustrating the ups and downs of cultural shares over time.
Mr Aiden, now director of the Centre for Genome Architecture at Rice University, and Mr Michel, who went on to start the data-science company Quantified Labs, play the Ngram Viewer (books.google.com/ngrams) like a Wurlitzer.
They graph, to take one example, the astounding career path of "chortle," coined by Lewis Carroll in "Jabberwocky," which has left its siblings "galumphing" and "frumious" in the dust. They tease out the most-mentioned names of people born in each year from 1800 to 1949. From those born in 1871, the name that appeared most often was Cordell Hull, secretary of state under Franklin D Roosevelt, not Orville Wright. They also come up with a sort of fame speedometer. The ngram data show that people are becoming famous at a younger age, and faster, than they did two generations ago. Fame is much bigger, too. At one point, the authors write, Bill Clinton's ngram "was almost exactly as frequent as the word lettuce, twice as frequent as the word cucumber, and about half as frequent as the word tomato. He completely outclassed second-tier vegetables like turnip and cauliflower".
The momentous term culturomics suggests the authors' ambitious view of what can seem like an intellectual parlour game. The magazine Mother Jones, they cheerfully admit, called the Ngram Viewer "possibly the greatest time-waster in the history of the Internet". But the authors argue that just as Galileo's telescope opened new, previously unimagined worlds, the powerful lens of culturomics "is going to change the humanities, transform the social sciences and renegotiate the relationship between the world of commerce and the ivory tower".
Judging by the evidence on offer in Uncharted, the claim seems a tad boastful. Yes, it is fascinating to know that "donut" gained traction as a variant spelling soon after Dunkin' Donuts was founded in 1950. The authors serve up many a tasty morsel like this: lots of fun but less than earthshaking without elaboration.
Uncharted began life as an article in Science magazine in December 2010, and the authors have huffed and puffed to inflate it to book length. They digress at every turn and, to add weight at the back end, they have appended nearly 50 ngram searches.
They also overexplain. Most readers do not need a background lesson on Nazi policies towards the arts to understand why, in German books published between 1933 and 1945, the graph for Marc Chagall dips like a downward-speeding roller coaster.
This may be potato chips for intellectuals, but it is irresistible. You cannot eat just one ngram.
©2013 The New York Times News Service