Sunday Morning Coffee: One Million Words!; Culturomics; Seamus Heaney

1.1 million words...1.1 million words!

That is the total English-language lexicon estimated last month by The Cultural Observatory at Harvard, directed by Erez Lieberman Aiden and Jean-Baptiste Michel. (For more, see Patricia Cohen at the NY Times-- click here-- and the team at io9-- click here.)

The Cultural Observatory's mission is (per their website) "to enable the quantitative study of human culture across societies and across centuries...[by]...:

* Creating massive datasets relevant to human culture
* Using these datasets to power wholly new types of analysis
* Developing tools that enable researchers and the general public to query the data."

They call this approach "culturomics," describing it in a December 16th paper in Science, (Michel et al., "Quantitative Analysis of Culture Using Millions of Digitized Books"). Here's the article abstract:

"We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities."

Michel, Lieberman and their colleagues co-authored the paper with a team from Google, and together they have launched Ngram, Google's freely available, searchable database of the 5.2 million scanned books referenced in the abstract above, comprising c. 500 billion words and phrases.

Which leads us back to their estimates that the English language contains c. 1.1 million words, with about 8,500 new words entering every year. The Oxford English Dictionary includes perhaps half that total; one of culturomics first claims is that dictionaries miss 50-60% of the words actually in the lexicon, because low-frequency words do not make the cut. (A truly exhaustive dictionary would be a Borgesian venture, it seems to me, truly exhausting the capacity of humans to document; culturomic datasets such as Ngram complement and augment but do not replace dictionaries.) Ngram is a tool--like the specialized telescopes that search for quasars in the infinite-- to explore what The Cultural Observatory calls linguistic/lexigraphical "dark matter."

Let's plunge into this dark matter, this hitherto unrecognized aquifer, a river-ocean flowing beneath the sunlit waves we think we know. Let's dig deep through the strata of words, hunt for truffles in the roots, find the still-living marrow in ancient bones.

As Seamus Heaney puts it in "Bone Dreams":

A skeleton
in the tongue's
old dungeons.

I push back
through dictions,
Elizabethan canopies,
Norman devices,
the erotic mayflowers
of Provence
and the ivied Latins
of churchmen

to the scop's
twang, the iron
flash of consonants
cleaving the line."

