Quantitative analysis of culture using millions of digitized books - PubMed
- ️Sat Jan 01 2011
. 2011 Jan 14;331(6014):176-82.
doi: 10.1126/science.1199644. Epub 2010 Dec 16.
Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray; Google Books Team; Joseph P Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A Nowak, Erez Lieberman Aiden
Affiliations
- PMID: 21163965
- PMCID: PMC3279742
- DOI: 10.1126/science.1199644
Quantitative analysis of culture using millions of digitized books
Jean-Baptiste Michel et al. Science. 2011.
Abstract
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of 'culturomics,' focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
Figures

‘Culturomic’ analyses study millions of books at once. (A) Top row: authors have been writing for millennia; ~129 million book editions have been published since the advent of the printing press (upper left). Second row: Libraries and publishing houses provide books to Google for scanning (middle left). Over 15 million books have been digitized. Third row: each book is associated with metadata. Five million books are chosen for computational analysis (bottom left). Bottom row: a culturomic ‘timeline’ shows the frequency of ‘apple’ in English books over time (1800-2000). (B) Usage frequency of ‘slavery’. The Civil War (1861-1865) and the civil rights movement (1955-1968) are highlighted in red. The number in the upper left (1e-4) is the unit of frequency. (C) Usage frequency over time for ‘the Great War’ (blue), ‘World War I’ (green), and ‘World War II’ (red).

Culturomics has profound consequences for the study of language, lexicography, and grammar. (A) The size of the English lexicon over time. Tick marks show the number of single words in three dictionaries (see text). (B) Fraction of words in the lexicon that appear in two different dictionaries as a function of usage frequency. (C) Five words added by the AHD in its 2000 update. Inset: Median frequency of new words added to AHD4 in 2000. The frequency of half of these words exceeded 10-9 as far back as 1890 (white dot). (D) Obsolete words added to AHD4 in 2000. Inset: Mean frequency of the 2220 AHD headwords whose current usage frequency is less than 10-9. (E) Usage frequency of irregular verbs (red) and their regular counterparts (blue). Some verbs (chide/chided) have regularized during the last two centuries. The trajectories for ‘speeded’ and ‘speed up’ (green) are similar, reflecting the role of semantic factors in this instance of regularization. The verb ‘burn’ first regularized in the US (US flag) and later in the UK (UK flag). The irregular ‘snuck’ is rapidly gaining on ‘sneaked’. (F) Scatter plot of the irregular verbs; each verb's position depends on its regularity (see text) in the early 19th century (x-coordinate) and in the late 20th century (y-coordinate). For 16% of the verbs, the change in regularity was greater than 10% (large font). Dashed lines separate irregular verbs (regularity<50%) from regular verbs (regularity>50%). Six verbs became regular (upper left quadrant, blue), while two became irregular (lower right quadrant, red). Inset: the regularity of ‘chide’ over time. (G) Median regularity of verbs whose past tense is often signified with a –t suffix instead of –ed (burn, smell, spell, spill, dwell, learn, and spoil) in US (black) and UK (grey) books.

Cultural turnover is accelerating. (A) We forget: frequency of 1883 (blue), 1910 (green) and 1950 (red). Inset: We forget faster. The half-life of the curves (grey dots) is getting shorter (grey line: moving average). (B) Cultural adoption occurs faster. Median trajectory for three cohorts of inventions from three different time periods (1800-1840: blue, 1840-1880: green, 1880-1920: red). Inset: The telephone (green, date of invention: green arrow) and radio (blue, date of invention: blue arrow). (C) Fame of various personalities born between 1920 and 1930. (D) Frequency of the 50 most famous people born in 1871 (grey lines; median: dark gray). Five examples are highlighted. (E) The median trajectory of the 1865 cohort is characterized by four parameters: (i) initial ‘age of celebrity’ (34 years old, tick mark); (ii) doubling time of the subsequent rise to fame (4 years, blue line); (iii) ‘age of peak celebrity’ (70 years after birth, tick mark), and (iv) half-life of the post-peak ‘forgetting’ phase (73 years, red line). Inset: The doubling time and half-life over time. (F) The median trajectory of the 25 most famous personalities born between 1800 and 1920 in various careers.

Culturomics can be used to detect censorship. (A) Usage frequency of ‘Marc Chagall’ in German (red) as compared to English (blue). (B) Suppression of Leon Trotsky (blue), Grigory Zinoviev (green), and Lev Kamenev (red) in Russian texts, with noteworthy events indicated: Trotsky's assassination (blue arrow), Zinoviev and Kamenev executed (red arrow), the ‘Great Purge’ (red highlight), perestroika (grey arrow). (C) The 1976 and 1989 Tiananmen Square incidents both lead to elevated discussion in English texts. Response to the 1989 incident is largely absent in Chinese texts (blue), suggesting government censorship. (D) After the ‘Hollywood Ten’ were blacklisted (red highlight) from American movie studios, their fame declined (median: wide grey). None of them were credited in a film until 1960's (aptly named) ‘Exodus’. (E) Writers in various disciplines were suppressed by the Nazi regime (red highlight). In contrast, the Nazis themselves (thick red) exhibited a strong fame peak during the war years. (F) Distribution of suppression indices for both English (blue) and German (red) for the period from 1933-1945. Three victims of Nazi suppression are highlighted at left (red arrows). Inset: Calculation of the suppression index for ‘Henri Matisse’.

Culturomics provides quantitative evidence for scholars in many fields. (A) Historical Epidemiology: ‘influenza’ is shown in blue; the Russian, Spanish, and Asian flu epidemics are highlighted. (B) History of the Civil War. (C) Comparative History. (D) Gender studies. (E,F) History of Science. (G) Historical Gastronomy. (H) History of Religion: ‘God’.
Comment in
-
Digital data. Google books, Wikipedia, and the future of culturomics.
Bohannon J. Bohannon J. Science. 2011 Jan 14;331(6014):135. doi: 10.1126/science.331.6014.135. Science. 2011. PMID: 21233356 No abstract available.
-
Culturomics: periodicals gauge culture's pulse.
Schwartz T. Schwartz T. Science. 2011 Apr 1;332(6025):35-6; author reply 36-7. doi: 10.1126/science.332.6025.35-c. Science. 2011. PMID: 21454770 No abstract available.
-
Culturomics: statistical traps muddy the data.
Morse-Gagné EE. Morse-Gagné EE. Science. 2011 Apr 1;332(6025):35; author reply 36-7. doi: 10.1126/science.332.6025.35-b. Science. 2011. PMID: 21454771 No abstract available.
Similar articles
-
Digital data. Google books, Wikipedia, and the future of culturomics.
Bohannon J. Bohannon J. Science. 2011 Jan 14;331(6014):135. doi: 10.1126/science.331.6014.135. Science. 2011. PMID: 21233356 No abstract available.
-
Pechenick EA, Danforth CM, Dodds PS. Pechenick EA, et al. PLoS One. 2015 Oct 7;10(10):e0137041. doi: 10.1371/journal.pone.0137041. eCollection 2015. PLoS One. 2015. PMID: 26445406 Free PMC article.
-
[Culturomics--a new direction for scientific research, developed at Harvard University].
Rusu V. Rusu V. Rev Med Chir Soc Med Nat Iasi. 2011 Apr-Jun;115(2):303-5. Rev Med Chir Soc Med Nat Iasi. 2011. PMID: 21870715 Romanian. No abstract available.
-
Relevance of Piagetian cross-cultural psychology to the humanities and social sciences.
Oesterdiekhoff GW. Oesterdiekhoff GW. Am J Psychol. 2013 Winter;126(4):477-92. doi: 10.5406/amerjpsyc.126.4.0477. Am J Psychol. 2013. PMID: 24455813 Review.
-
Eichstaedt JC, Kern ML, Yaden DB, Schwartz HA, Giorgi S, Park G, Hagan CA, Tobolsky VA, Smith LK, Buffone A, Iwry J, Seligman MEP, Ungar LH. Eichstaedt JC, et al. Psychol Methods. 2021 Aug;26(4):398-427. doi: 10.1037/met0000349. Psychol Methods. 2021. PMID: 34726465 Review.
Cited by
-
Characterizing the cultural landscape of traditional Chinese settlements through genome maps.
Zui H, Min T. Zui H, et al. Heliyon. 2024 Oct 16;10(20):e39418. doi: 10.1016/j.heliyon.2024.e39418. eCollection 2024 Oct 30. Heliyon. 2024. PMID: 39497968 Free PMC article.
-
Goldman AD, Landweber LF. Goldman AD, et al. PLoS Genet. 2016 Jul 21;12(7):e1006181. doi: 10.1371/journal.pgen.1006181. eCollection 2016 Jul. PLoS Genet. 2016. PMID: 27442251 Free PMC article. Review.
-
On the relations between letter, word, and sentence-level processing during reading.
Brossette B, Grainger J, Lété B, Dufau S. Brossette B, et al. Sci Rep. 2022 Oct 22;12(1):17735. doi: 10.1038/s41598-022-22587-1. Sci Rep. 2022. PMID: 36273244 Free PMC article.
-
The rising entropy of English in the attention economy.
Pilgrim C, Guo W, Hills TT. Pilgrim C, et al. Commun Psychol. 2024 Aug 1;2(1):70. doi: 10.1038/s44271-024-00117-1. Commun Psychol. 2024. PMID: 39242771 Free PMC article.
-
How cognitive selection affects language change.
Li Y, Breithaupt F, Hills T, Lin Z, Chen Y, Siew CSQ, Hertwig R. Li Y, et al. Proc Natl Acad Sci U S A. 2024 Jan 2;121(1):e2220898120. doi: 10.1073/pnas.2220898120. Epub 2023 Dec 27. Proc Natl Acad Sci U S A. 2024. PMID: 38150495 Free PMC article.
References
-
- Wilson Edward O. Consilience. Knopf; New York: 1998.
-
- Sperber Dan. Anthropology and psychology: Towards an epidemiology of representations. Man. 1985;20:73–89.
-
- Lieberson Stanley, Horwich Joel. Implication analysis: a pragmatic proposal for linking theory and data in the social sciences. Sociological Methodology. 2008 December;38:1–50.
-
- Cavalli-Sforza LL, Feldman Marcus W. Cultural Transmission and Evolution. Princeton UP; Princeton, NJ: 1981. - PubMed
-
- Niyogi Partha. The Computational Nature of Language Learning and Evolution. MIT; Cambridge, MA: 2006.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources