cambridge.org

The generalizability crisis | Behavioral and Brain Sciences | Cambridge Core

  • ️Fri Feb 21 2025

References

Acosta, A., Adams, R. B. Jr., Albohn, D. N., Allard, E. S., Beek, T., Benning, S. D., … Zwaan, R. A. (2016). Registered replication report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917928.Google Scholar

Alogna, V. K., Attaya, M. K., Aucoin, P., Bahník, Š, Birch, S., Birt, A. R., … Zwaan, R. A. (2014). Registered replication report: Schooler and Engstler-Schooler (1990). Perspectives on Psychological Science, 9(5), 556578.CrossRefGoogle Scholar

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390412.CrossRefGoogle Scholar

Balota, D. A., Yap, M. J., Hutchison, K. A., & Cortese, M. J. (2012). Megastudies: What do millions (or so) of trials tell us about lexical processing? In Adelman, J. S. (Ed.), Visual word recognition volume 1: Models and methods, orthography and phonology (pp. 90–115). Psychology Press.Google Scholar

Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., van Ravenzwaaij, D., … Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences of the United States of America, 115(11), 26072612.CrossRefGoogle ScholarPubMed

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278.CrossRefGoogle ScholarPubMed

Bates, D., Maechler, M., Bolker, B., Walker, S., Christensen, R. H. B., Singmann, H., … Krivitsky, P. N. (2014). Lme4: Linear mixed-effects models using eigen and S4. R Package Version, 1(7), 123.Google Scholar

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., & Berk, R., … Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6.CrossRefGoogle ScholarPubMed

Bergelson, E., Bergmann, C., Byers-Heinlein, K., Cristia, A., Cusack, R., & Dyck, K., … (2017). Quantifying sources of variability in infancy research using the infant-directed speech preference.Google Scholar

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110(2), 203219.CrossRefGoogle ScholarPubMed

Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16(3), 199215.CrossRefGoogle Scholar

Brennan, R. L. (1992). Generalizability theory. Educational Measurement: Issues and Practice, 11(4), 2734.CrossRefGoogle Scholar

Brunswik, E. (1947). Systematic and representative design of psychological experiments. In Proceedings of the Berkeley symposium on mathematical statistics and probability (pp. 143202).Google Scholar

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., … Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32.CrossRefGoogle Scholar

Chabris, C. F., Hebert, B. M., Benjamin, D. J., Beauchamp, J., Cesarini, D., van der Loos, M., … Laibson, D. (2012). Most reported genetic associations with general intelligence are probably false positives. Psychological Science, 23(11), 13141323.CrossRefGoogle ScholarPubMed

Cheung, I., Campbell, L., LeBel, E. P., Ackerman, R. A., Aykutoğlu, B., Bahník, Š, … Yong, J. C. (2016). Registered replication report: Study 1 from Finkel, Rusbult, Kumashiro, & Hannon (2002). Perspectives on Psychological Science, 11(5), 750764.CrossRefGoogle Scholar

Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12(4), 335359.CrossRefGoogle Scholar

Cohen, J. (2016). The earth is round (p < 0.05). In Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (Eds.), What if there were no significance tests? (pp. 6982). Routledge.Google Scholar

Coleman, E. B. (1964). Generalizing to a language population. Psychological Reports, 14(1), 219226.CrossRefGoogle Scholar

Colhoun, H. M., McKeigue, P. M., & Davey Smith, G. (2003). Problems of reporting genetic associations with complex outcomes. Lancet (London, England), 361(9360), 865872.CrossRefGoogle ScholarPubMed

Cornfield, J., & Tukey, J. W. (1956). Average values of mean squares in factorials. The Annals of Mathematical Statistics, 27(4), 907949.CrossRefGoogle Scholar

Crabbe, J. C., Wahlsten, D., & Dudek, B. C. (1999). Genetics of mouse behavior: Interactions with laboratory environment. Science, 284(5420), 16701672.CrossRefGoogle ScholarPubMed

Crits-Christoph, P., & Mintz, J. (1991). Implications of therapist effects for the design and analysis of comparative studies of psychotherapies. Journal of Consulting and Clinical Psychology, 59(1), 2026.CrossRefGoogle ScholarPubMed

Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 30(2), 116.CrossRefGoogle Scholar

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281302.CrossRefGoogle ScholarPubMed

Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. The British Journal of Mathematical and Statistical Psychology, 16(2), 137163.CrossRefGoogle Scholar

Draper, D. (1995). Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 4570.Google Scholar

Ebstein, R. P., Novick, O., Umansky, R., Priel, B., Osher, Y., Blaine, D., … Belmaker, R. H. (1996). Dopamine D4 receptor (D4DR) exon III polymorphism associated with the human personality trait of novelty seeking. Nature Genetics, 12(1), 7880.CrossRefGoogle ScholarPubMed

Eerland, A. S., Magliano, A. M., Zwaan, J. P., Arnal, R. A., Aucoin, J. D., & Crocker, P. (2016). Registered replication report: Hart & Albarracín (2011). Perspectives on Psychological Science, 11(1), 158171.CrossRefGoogle Scholar

Feynman, R. P. (1974). Cargo cult science. Engineering Sciences, 37(7), 1013.Google Scholar

Francis, G. (2012). Publication bias and the failure of replication in experimental psychology. Psychonomic Bulletin & Review, 19(6), 975991.CrossRefGoogle ScholarPubMed

Gelman, A. (2015). The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. Journal of Management, 41(2), 632643.CrossRefGoogle Scholar

Gelman, A. (2016). The problems with p-values are not just with p-values. The American Statistician, 70(supplemental material to the ASA statement on p-values and statistical significance), 10.Google Scholar

Gelman, A. (2018). The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Personality and Social Psychology Bulletin, 44(1), 1623.CrossRefGoogle Scholar

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.CrossRefGoogle Scholar

Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis. Downloaded January, 1–17.Google Scholar

Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 838.CrossRefGoogle ScholarPubMed

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587606.CrossRefGoogle Scholar

Gigerenzer, G., & Marewski, J. N. (2015). Surrogate science: The idol of a universal method for scientific inference. Journal of Management, 41(2), 421440.CrossRefGoogle Scholar

Guion, R. M. (1980). On Trinitarian doctrines of validity. Professional Psychology, 11(3), 385398.CrossRefGoogle Scholar

Hamilton, L. S., & Huth, A. G. (2018). The revolution will not be controlled: Natural stimuli in speech neuroscience. Language, Cognition and Neuroscience, 35(5), 573582.CrossRefGoogle Scholar

Hofman, J. M., Sharma, A., & Watts, D. J. (2017). Prediction and explanation in social systems. Science, 355(6324), 486488.CrossRefGoogle ScholarPubMed

Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E., & Gallant, J. L. (2016). Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600), 453458.CrossRefGoogle ScholarPubMed

Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron, 76(6), 12101224.CrossRefGoogle ScholarPubMed

Ioannidis, J. (2008). Why most discovered true associations are inflated. Epidemiology (Cambridge, Mass.), 19(5), 640648.CrossRefGoogle ScholarPubMed

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524532.CrossRefGoogle ScholarPubMed

Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255260.CrossRefGoogle ScholarPubMed

Judd, C. M., Westfall, J., & Kenny, D. A. (2012). Treating stimuli as a random factor in social psychology: A new and comprehensive solution to a pervasive but largely ignored problem. Journal of Personality and Social Psychology, 103(1), 54–69.CrossRefGoogle ScholarPubMed

Keuleers, E., & Balota, D. A. (2015). Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments. The Quarterly Journal of Experimental Psychology, 68(8), 14571468.CrossRefGoogle ScholarPubMed

Kruschke, J. K., & Liddell, T. M. (2017). The Bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25, 178206.CrossRefGoogle Scholar

Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis based on the correlation between effect size and sample size. PLoS ONE, 9(9), e105825.CrossRefGoogle ScholarPubMed

Lakens, D. (2017). Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355362.CrossRefGoogle Scholar

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2(3), 168171.CrossRefGoogle Scholar

Lesch, K. P., Bengel, D., Heils, A., Sabol, S. Z., Greenberg, B. D., Petri, S., … Murphy, D. L. (1996). Association of anxiety-related traits with a polymorphism in the serotonin transporter gene regulatory region. Science, 274(5292), 15271531.CrossRefGoogle ScholarPubMed

Lilienfeld, S. O. (2004). Taking theoretical risks in a world of directional predictions. Applied and Preventive Psychology, 11(1), 4751.CrossRefGoogle Scholar

Lilienfeld, S. O. (2017). Psychology's replication crisis and the grant culture: Righting the ship. Perspectives on Psychological Science, 12(4), 660664.CrossRefGoogle Scholar

Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3), 151159.CrossRefGoogle ScholarPubMed

MacLeod, C. M. (1991). Half a century of research on the Stroop effect: An integrative review. Psychological Bulletin, 109(2), 163203.CrossRefGoogle Scholar

Marewski, J. N., & Olsson, H. (2009). Beyond the null ritual: Formal modeling of psychological processes. Zeitschrift für Psychologie/Journal of Psychology, 217(1), 4960.CrossRefGoogle Scholar

Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., & Bates, D. (2017). Balancing type I error and power in linear mixed models. Journal of Memory and Language, 94, 305315.CrossRefGoogle Scholar

Mayo, D. G. (1991). Novel evidence and severe tests. Philosophy of Science, 58(4), 523552.CrossRefGoogle Scholar

Mayo, D. G. (2018). Statistical inference as severe testing. Cambridge University Press.CrossRefGoogle Scholar

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statistical significance. The American Statistician, 73(Suppl. 1), 235245.CrossRefGoogle Scholar

Meehl, P. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In Harlow, L. L., Mulaik, S. A. & Steiger, J. H. (Eds.), What if there were no significance tests? (pp. 393425). Erlbaum.Google Scholar

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103115.CrossRefGoogle Scholar

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806.CrossRefGoogle Scholar

Meehl, P. E. (1986). What social scientists don't understand. In Fiske, D. W. & Shweder, R. A. (Eds.), Metatheory in social science: Pluralisms and subjectivities (pp. 315338). University of Chicago Press.Google Scholar

Meehl, P. E. (1990a). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108141.CrossRefGoogle Scholar

Meehl, P. E. (1990b). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195244.CrossRefGoogle Scholar

Meissner, C. A., & Brigham, J. C. (2001). A meta-analysis of the verbal overshadowing effect in face identification. Applied Cognitive Psychology, 15(6), 603616.CrossRefGoogle Scholar

Meissner, C. A., & Memon, A. (2002). Verbal overshadowing: A special issue exploring theoretical and applied issues. Applied Cognitive Psychology, 16(8), 869872.CrossRefGoogle Scholar

Moshontz, H., Campbell, L., Ebersole, C. R., IJzerman, H., Urry, H. L., Forscher, P. S., … Chartier, C. R. (2018). Psychological science accelerator: Advancing psychology through a distributed collaborative network. Advances in Methods and Practices in Psychological Science, 1(4), 501515.CrossRefGoogle ScholarPubMed

Nagel, M., Jansen, P. R., Stringer, S., Watanabe, K., de Leeuw, C. A., Bryois, J., … Posthuma, D. (2018). Meta-analysis of genome-wide association studies for neuroticism in 449,484 individuals identifies novel genetic loci and pathways. Nature Genetics, 50(7), 920927.CrossRefGoogle ScholarPubMed

O'Leary-Kelly, S. W., & Vokurka, R. J. (1998). The empirical assessment of construct validity. Journal of Operations Management, 16(4), 387405.CrossRefGoogle Scholar

Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7(6), 528530.CrossRefGoogle Scholar

Popper, K. (2014). Conjectures and refutations: The growth of scientific knowledge. Routledge.CrossRefGoogle Scholar

Reuss, H., Kiesel, A., & Kunde, W. (2015). Adjustments of response speed and accuracy to unconscious cues. Cognition, 134, 5762.CrossRefGoogle ScholarPubMed

Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107(2), 358.CrossRefGoogle Scholar

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian T tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225237.CrossRefGoogle Scholar

Rozin, P. (2001). Social psychology and science: Some lessons from Solomon Asch. Personality and Social Psychology Review, 5(1), 214.CrossRefGoogle Scholar

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211252.CrossRefGoogle Scholar

Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in python using PyMC3. PeerJ Computer Science, 2, e55.CrossRefGoogle Scholar

Savage, J. E., Jansen, P. R., Stringer, S., Watanabe, K., Bryois, J., de Leeuw, C. A., … Posthuma, D. (2018). Genome-wide association metaanalysis in 269,867 individuals identifies new genetic and functional links to intelligence. Nature Genetics, 50(7), 912919.CrossRefGoogle Scholar

Schooler, J. W., & Engstler-Schooler, T. Y. (1990). Verbal overshadowing of visual memories: Some things are better left unsaid. Cognitive Psychology, 22(1), 3671.CrossRefGoogle ScholarPubMed

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. SAGE.Google Scholar

Shrout, P. E., & Rodgers, J. L. (2018). Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69, 487510.CrossRefGoogle ScholarPubMed

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 13591366.CrossRefGoogle ScholarPubMed

Simons, D. J., Holcombe, A. O., & Spellman, B. A. (2014). An introduction to registered replication reports at perspectives on psychological science. Perspectives on Psychological Science, 9(5), 552555.CrossRefGoogle ScholarPubMed

Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on generality (COG): A proposed addition to all empirical papers. Perspectives on Psychological Science, 12(6), 11231128.CrossRefGoogle ScholarPubMed

Smaldino, P. E. (2017). Models are stupid, and we need more of them. In Vallacher, R. R., Read, S. J., & Nowak, A. (Eds.), Computational social psychology (pp. 311331). Routledge.CrossRefGoogle Scholar

Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. Royal Society Open Science, 3(9), 160384.CrossRefGoogle ScholarPubMed

Smedslund, J. (1991). The pseudoempirical in psychology and the case for psychologic. Psychological Inquiry, 2(4), 325338.CrossRefGoogle Scholar

Spiers, H. J., & Maguire, E. A. (2007). Decoding human brain activity during real-world experiences. Trends in Cognitive Sciences, 11(8), 356365.CrossRefGoogle ScholarPubMed

Steckler, A., McLeroy, K. R., Goodman, R. M., Bird, S. T., & McCormick, L. (1992). Toward integrating qualitative and quantitative methods: An introduction. Health Education Quarterly, 19(1), 18.CrossRefGoogle ScholarPubMed

Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18(6), 643.CrossRefGoogle Scholar

Tong, C. (2019). Statistical inference enables bad science; statistical thinking enables good science. The American Statistician, 73(Suppl. 1), 246261.CrossRefGoogle Scholar

Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37(1), 12.CrossRefGoogle Scholar

Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J., & Reinero, D. A. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences of the United States of America, 113(23), 64546459.CrossRefGoogle ScholarPubMed

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779804.CrossRefGoogle Scholar

Wahlsten, D., Metten, P., Phillips, T. J., Boehm, S. L., Burkhart-Kasch, S., & Dorow, J., … (2003). Different data from different labs: Lessons from studies of gene–environment interaction. Journal of Neurobiology, 54(1), 283311.CrossRefGoogle ScholarPubMed

Walker, H. A., & Cohen, B. P. (1985). Scope statements: Imperatives for evaluating theory. American Sociological Review, 50, 288301.CrossRefGoogle Scholar

Westfall, J., Nichols, T. E., & Yarkoni, T. (2016). Fixing the stimulus-as-fixed-effect fallacy in task fMRI. Wellcome Open Research, 1, 23.CrossRefGoogle ScholarPubMed

Wolsiefer, K., Westfall, J., & Judd, C. M. (2017). Modeling stimulus variation in three common implicit attitude tasks. Behavior Research Methods, 49(4), 11931209.CrossRefGoogle ScholarPubMed

Wray, N. R., Ripke, S., Mattheisen, M., Trzaskowski, M., Byrne, E. M., & Abdellaoui, A., … Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium. (2018). Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nature Genetics, 50(5), 668681.CrossRefGoogle ScholarPubMed

Yarkoni, T. (2009). Big correlations in little studies: Inflated fMRI correlations reflect low statistical power-commentary on Vul et al. (2009). Perspectives on Psychological Science, 4(3), 294298.CrossRefGoogle Scholar

Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 11001122.CrossRefGoogle ScholarPubMed