- Posted by Stephen Whiteley
- On 21/02/2011
- books, google ngrams, Linguistics
Everyone who’s everyone in the world of language, from David Crystal to the London Review of Books, is blogging furiously about Google’s latest project, the Books Ngram viewer, a.k.a culturenomics.
You can see why. This is a mindbogglingly grandiose effort, involving the scanning and digitisation of two trillion words’ worth of books, in a range of languages; a total which represents 11% of the books published between 1800 and 2008. So far, Google Labs have made 4% of these (a trifling 500 billion words) available for online analysis.
The idea is that you can search for a word in the database, which will then analyse the frequency of its use over the past 200 years. If you haven’t already, take a look at the program here. You will quickly get an idea of exactly how inane and pointless the whole thing is.
I’m sure I am not alone in that, like a child with a dictionary, the first thing I did was look up an assortment of swearwords. Then my name. Then some nonsense words and neologisms. Then QuickSilver (quite interesting, actually, given that the word was originally used by the alchemists to mean mercury). Then I bookmarked the page and did something else.
Perhaps the problem was that I didn’t have a specific research need that ngram responded to. The journal Science, in which this project was heralded, argues that there are a wide range of interesting data to be mined using this method. It gives the example of Marc Chagall, mentions of whose name increased fivefold in English between 1936 and 1944, whilst it was mentioned just once in Nazi Germany. Not necessarily very surprising, but it is probably useful to have some precise statistics to back up an intuitive assumption.
But is it really that precise? The collection comprises books taken from ‘over 40 university libraries from around the world’, which any sociologist will tell you already constitutes a fairly skewed selection. There are no newspapers, no pamphlets, no magazines, no blogs (although the program’s creators say they plan to incorporate more ephemeral types of publication). As David Crystal wisely counsels, we must not exaggerate the importance of this project, nor its effectiveness as a research tool. According to the article in Science, the data show that Freud has become ’embedded in our collective subconscious’; but, as Crystal points out, the program cannot distinguish which Freud is mentioned: ‘They assume Sigmund. But what about Lucian, Clement, Anna…?’
The project’s founders, who include Stephen Pinker and Martin Nowak, say their approach ‘can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology.’ (PDF of their paper available here). There is undoubtedly significant potential here, but I suspect it will be some time before ‘Culturenomics’ (noisome neologism as it is) is capable of generating anything more than anecdotes.