Treffer: Domain-general object recognition predicts human ability to tell real from AI-generated faces.

Title:
Domain-general object recognition predicts human ability to tell real from AI-generated faces.
Authors:
Chow JK; Department of Psychology, Vanderbilt University., McGugin RW; Department of Psychology, Vanderbilt University., Gauthier I; Department of Psychology, Vanderbilt University.
Source:
Journal of experimental psychology. General [J Exp Psychol Gen] 2026 Mar; Vol. 155 (3), pp. 629-648. Date of Electronic Publication: 2025 Dec 15.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: American Psychological Assn Country of Publication: United States NLM ID: 7502587 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1939-2222 (Electronic) Linking ISSN: 00221015 NLM ISO Abbreviation: J Exp Psychol Gen Subsets: MEDLINE
Imprint Name(s):
Original Publication: Washington, American Psychological Assn.
Grant Information:
Vanderbilt University; National Science Foundation
Entry Date(s):
Date Created: 20251215 Date Completed: 20260205 Latest Revision: 20260205
Update Code:
20260205
DOI:
10.1037/xge0001881
PMID:
41396656
Database:
MEDLINE

Weitere Informationen

Faces created by artificial intelligence (AI) are now considered indistinguishable from real faces. Still, humans vary in their ability to detect these faces-a skill so novel it would have been useless a few years ago. We show that some individuals are consistently better at discriminating real from AI-generated faces. We used latent variable modeling to test whether this ability can be predicted by a domain-general ability, called o, which is measured as the shared variance between perceptual and memory judgments of both novel and familiar objects. We show that o predicts detection of AI-generated faces better than face recognition, intelligence, or experience with AI. An analysis of the relation between performance and cues in the image reveals that people are more likely to be misled by cues from AI faces than from real faces. It also suggests that those with a high o are less cue dependent than those with a low o. The o advantage on our task likely reflects robust visual processing under challenging conditions rather than superior artifact detection. Our results add to a growing literature suggesting that o can predict a wide range of perceptual decisions, including one that lacks evolutionary precedent, providing insights into the cognitive architecture underlying complex perceptual judgments. An understanding of individual differences in AI detection may facilitate interactions between humans and AI, for instance, to optimize training data for generative models. (PsycInfo Database Record (c) 2026 APA, all rights reserved).

Domain-General Object Recognition Predicts Human Ability to Tell Real From AI-Generated Faces

<cn> <bold>By: Jason K. Chow</bold>
> Department of Psychology, Vanderbilt University
> <bold>Rankin W. McGugin</bold>
> Department of Psychology, Vanderbilt University
> <bold>Isabel Gauthier</bold>
> Department of Psychology, Vanderbilt University </cn>

<bold>Review of: </bold>xge0001881.pdf

<bold>Acknowledgement: </bold>Timothy Vickery served as action editor.Jason K. Chow and Rankin W. McGugin share first authorship. The authors declare no conflicts of interest. This work was supported by the David K. Wilson Chair Research Fund from Vanderbilt University and the National Science Foundation, Behavioral and Cognitive Sciences Award 2316474.Jason K. Chow played a lead role in formal analysis and methodology, a supporting role in writing–original draft and writing–review and editing, and an equal role in data curation and investigation. Rankin W. McGugin played a lead role in investigation and methodology, a supporting role in writing–original draft and writing–review and editing, and an equal role in data curation. Isabel Gauthier played a lead role in conceptualization, supervision, and writing–review and editing and an equal role in writing–original draft.

Despite our expertise at processing faces, we now confront a perceptual challenge we were not prepared for: determining whether a face is real or generated by artificial intelligence (AI). AI-generated faces (henceforth referred to as AI faces) are becoming increasingly difficult to distinguish from real photographs. While early AI faces contained obvious artifacts—asymmetric eye colors, misaligned teeth, or blurred hair textures—modern algorithms produce images that can fool most viewers (Duffy et al., 2024; Nightingale & Farid, 2022). Because interacting with realistic AI-generated images is such a new situation, evolutionary adaptations—and perhaps also learned expertise—may not be helpful. Some individuals appear to be better at detecting AI faces than others (Duffy et al., 2024), but whether cognitive and perceptual abilities are relevant has not been systematically investigated. In addition, we do not know whether individual differences in AI face detection ability are stable and reliable across different faces and viewing conditions (Duffy et al., 2024; Nightingale & Farid, 2022). Understanding what makes certain people consistently better at this task is crucial, especially given how rapidly AI technology is advancing. This knowledge could have implications for theoretical models of visual recognition and for practical efforts to combat misinformation and fraud (Cross, 2022; Kietzmann et al., 2020).

Modern AI models, such as generative adversarial networks (GANs), learn from large data sets of real images to produce realistic human faces almost indistinguishable from actual photos (Karras et al., 2020; Wang et al., 2018). A GAN consists of one network creating images that another network attempts to categorize. As the generator iteratively learns to fool the detector, it creates images that can also fool human perceivers. AI faces now appear online across different platforms and contexts, not only in portrait form but also as part of more complex scenes and videos. In a large study (N = 1,276) looking at people’s detection of synthetic stimuli in ecological contexts similar to newsfeeds and social media, detecting AI faces was more difficult than detecting AI animals, food, or landscapes (Cooke et al., 2024). Despite the large sample size, individual differences were examined for only two dimensions: (a) Familiarity with synthetic media did not predict detection ability, and (b) younger participants outperformed older participants in detecting audio AI content, with a much smaller effect of age for AI images. A study focusing on AI faces only (Duffy et al., 2024) found small individual differences driven by gender (women tended to rate all faces, AI or not, as less realistic than men did), age (adults below 40 were slightly better than those older than 50), and a small age congruence effect (with participants performing a little better with faces closer to their age group).

Most of the attention devoted to studying how humans perceive AI faces has focused on average performance, with a growing consensus that humans are often no better than chance at this task. A different approach uses signal detection analysis and suggests that humans may in fact be able to separate real from AI faces with above-chance accuracy, but they may judge AI faces—especially White AI faces—as more likely to be real than real faces (Miller et al., 2023). While the average sensitivity (d′) was −0.49, performance ranged a great deal across participants, from a d′ of ∼ −2 to ∼ 2. The stability of such individual differences over time was not assessed, but errors were moderately correlated with higher confidence, suggesting the variability may reflect true individual differences. Our work demonstrates the stability of such individual differences and tests whether they are related to perceptual and cognitive abilities. Specifically, we ask whether those with high domain-general object recognition ability (Richler et al., 2019) are better apt to distinguish real from AI faces.

General object recognition ability, o, is the ability to make fine-grained visual discriminations within a category (e.g., distinguishing similar birds) rather than broad categorical distinctions (e.g., bird vs. fish; Smithson & Gauthier, 2026). Like fluid intelligence, o is stable across contexts and expected to be relatively resistant to training. Nonetheless, o predicts category-specific learning given experience in a specific domain: Individuals are faster at learning to categorize blood cells as cancerous as a function of o (Smithson et al., 2022). This ability was originally defined as a higher order latent variable in a model in which it accounted for 89% of the variance across perceptual and memory tasks with several novel (artificial) object categories (Richler et al., 2019). Later work found that o applies to both familiar and novel objects (Sunday et al., 2021). Measures of o predict success in tasks such as identifying lung nodules in chest X-rays (Sunday, Donnelly, & Gauthier, 2018), categorizing white blood cells as cancerous (Smithson et al., 2022), judging participant sex from retinal images (Delavari et al., 2023), or recognizing musical notation (Chang & Gauthier, 2020). Consistent with the idea that o is a new construct among psychometric abilities (Smithson & Gauthier, 2026), it appears to have predictive power for various skills even when controlling for other abilities, such as intelligence, working memory, perceptual speed, early vision, and spatial reasoning (Richler et al., 2019; Smithson, Chow, & Gauthier, 2024; Smithson & Gauthier, 2025). Given it can predict performance across a variety of complex visual tasks, o is a plausible predictor of the ability to detect real from AI faces.

However, it is possible that the most consistent individual differences in AI face detection could be specifically limited to face recognition. Unfamiliar face recognition ability (which we will call f) is typically measured by tests that assess performance in matching or recognizing unfamiliar faces across various poses (Burton et al., 2010; Duchaine & Nakayama, 2006). Performance on different face recognition tests is highly correlated (Bowles et al., 2009; Verhallen et al., 2017), and this shared variance is largely independent of memory, processing speed, and general cognitive abilities (Wilhelm et al., 2010). If the detection of AI-generated faces uses processes similar to those engaged by natural face recognition, f should predict this ability. While a considerable part of f appears to be heritable (Shakeshaft & Plomin, 2015; Wilmer et al., 2010), face recognition is sensitive to experience (Arcaro et al., 2017; Chua & Gauthier, 2020; McGugin et al., 2018). Learning the dimensions that govern statistical variations in real faces (Dotsch et al., 2016) could plausibly facilitate AI face detection (for instance, if artificial faces break some of these rules). If detecting AI-generated faces relies more heavily on specialized face processing skills than on general visual abilities, then f will predict performance better than o—we test this in our first study.

As an ability, f is similar to o as it is measured with a range of tasks, and differs from o in that it is a domain-specific ability, only measured with faces. Traditionally, face and object recognition abilities were often operationalized with a single task each, defined as distinct but correlated abilities (e.g., Dennett et al., 2012; Gray et al., 2019; Shakeshaft & Plomin, 2015; see Figure 1A). An alternative is to conceive of these abilities as part of a more general hierarchical framework of abilities (Schneider & McGrew, 2018; see Figure 1B). In such a model, o can be considered a broad ability, not unlike visual processing (or Gv). It may represent a broad ability that has not been characterized before (Smithson & Gauthier, 2026), as Gv is typically measured with spatial tasks and does not correlate with o once general intelligence is controlled (Smithson, Chow, & Gauthier, 2024). A domain-specific ability like f lies at a lower level than o and under its umbrella (alongside other putative specific abilities for other categories, like bodies, animals, and tools). By this account, some part of the variance common across different face recognition tasks should depend on o, because o is defined as the more general object recognition ability and also because both should include some variance stemming from a general intelligence factor (g).
>
><anchor name="fig1"></anchor>xge_155_3_629_fig1a.gif

Just as some of the variance in face recognition tasks can be attributed to o, some of the variance in a set of o tasks may depend on g (Carroll, 1993; Figure 1B). For this reason, in our second study, we consider g as a potential predictor of AI-generated face detection. In the prediction of academic achievement, many have argued that specific factors have little incremental predictive value over g (Canivez et al., 2014; Glutting et al., 2006). But specific factors can add incremental predictive value (Kell et al., 2013; LePine et al., 1997; McGill, 2015), especially if they are stable over time (Breit et al., 2024), which appears to be the case for o (Smithson, Chow, & Gauthier, 2024). Testing for incremental validity can be done with observed scores, by partialling out a more general factor, and with latent scores, by simultaneously modeling general and specific factors. We use both approaches in this work and find that they converge.

<h31 id="xge-155-3-629-d563e345">Research Overview</h31>

We developed the AIFT (AI face test), the first test created specifically to measure individual differences in the ability to judge whether a face is real versus generated by AI. We chose to use a relatively early GAN model (Karras et al., 2020) that does not fool humans to such an extent as to make it impossible to measure systematic variability. The design of the AIFT was governed by psychometric considerations (e.g., achieving reliability and avoiding floor), rather than representativeness of the full difficulty range of StyleGAN faces.<anchor name="b-fn1"></anchor><sups>1</sups> We do not include feedback on the main trials in our tasks to avoid measuring learning effects rather than stable individual differences. Whether the ability measured by this task is relevant to the ability to evaluate faces created by newer AI technology is a very difficult question to answer (because performance would be near the floor) and is out of our current scope.

We were most interested in comparing people rather than images and, thus, used a forced-choice task, which avoids concerns that complicate interpretation when using other tasks (response bias, distributional assumptions; Brady et al., 2023). In a forced-choice format, participants must decide which is the real face in pairs of images, instead of judging each face individually. We used structural equation modeling (SEM; Bollen, 2014) to compare o and f (Study 1) and o and g (Study 2) as predictors of performance on the AIFT. Existing constructs (f, o, g) were operationalized as the shared variance for a set of tasks, selected from a theoretically infinite possible set of tasks, but in practice based on prior research. These tasks are assumed to share their reliance on the construct of interest, but vary in ways that are orthogonal to their definition. The ability to judge whether a face is AI-generated or not was operationalized as a specific ability, that is, an ability closely related to a specific task (Schneider & McGrew, 2018)—in this case, the newly developed AIFT. To preview our results, we find evidence that o may be at just the right level in a hierarchical model of abilities to predict detection of AI faces, when compared with both more and less general abilities.

<h31 id="xge-155-3-629-d563e383">Transparency and Openness</h31>

The preregistration for Study 1 can be found on the Open Science Framework (OSF; McGugin et al., 2024; <a href="https://osf.io/ueqjc" target="_blank">https://osf.io/ueqjc</a>). Primary data, analysis scripts, and the AIFT 1.0 are available on the OSF (McGugin & Gauthier, 2025a; <a href="https://osf.io/82krc" target="_blank">https://osf.io/82krc</a>). The preregistration for Study 2 can be found on the OSF (McGugin & Gauthier, 2024; <a href="https://osf.io/q2z4f" target="_blank">https://osf.io/q2z4f</a>). Primary data, analysis scripts, and the AIFT 2.0 are available on the OSF (McGugin & Gauthier, 2025b; <a href="https://osf.io/csgf5" target="_blank">https://osf.io/csgf5</a>). There were minor deviations from the preregistration, which are discussed next.

<h31 id="xge-155-3-629-d563e412">Deviations From Preregistration</h31>

In the preregistration for Study 1, we aimed for N = 250, based on work showing that correlations tend to stabilize regardless of their magnitude at this sample size (Schönbrodt & Perugini, 2013). We included several attention checks and, in light of recent work (Stosic et al., 2024), had planned to report analyses at varying levels of excluding people who committed attention check errors. We later realized that it would be difficult to assign the blame for a difference in results if two analyses differed both in the sample size and the inclusion of participants who made careless responses. For this reason, we chose to continue data collection until we had 250 participants with no attention check error and to compare them to the first 250 participants we collected, regardless of attention check errors. This comparison is included in the Supplemental Materials. We had included basic information about one model to test the correlated factors model (which we call M1) and had mentioned we would compare several models without specifying them. In hindsight, there are not a lot of plausible models to test, with only one theoretically interesting alternative (which we call M2, with o as a higher level ability relative to f). We had not specified how we were going to deal with the weights and correlated errors for parcels in our models. In the results, we present how we assign these parameters based on fit (although they are tested only when they are theoretically reasonable). Finally, we originally presented the AIFT as an observed variable but decided that using parcels for this factor, as with the other factors, would produce regression estimates that were disattenuated for measurement error, which is desirable. We believe none of these are theoretically important changes and that they help provide better estimates of our effects.

Study 1


> <h31 id="xge-155-3-629-d563e434">Method</h31>

<bold>Participants</bold>

We tested 291 participants between 18 and 45, fluent in English, living in the United States, and who had 95% task approval on the platform <a href="https://www.prolific.com/" target="_blank">https://www.prolific.com/</a>. We collected data until 250 participants performed without committing any attentional check errors. In the Supplemental Materials, we also report models for the first 250 participants tested, regardless of attentional check errors. The final sample included 145 participants reporting as women, 101 as men, and four with missing information (Mage = 32.66, SD = 7.14). The ethnicity composition was as follows: White (54.8%), Black (14.8%), mixed (12.8%), Asian (9.2%), other (6.4%), and unknown (2%). We asked the 250 participants who performed the task in Study 1 to redo the test 1 week later and collected data from the first 100 respondents (50 women, 50 men; Mage = 33.17, SD = 6.66) to calculate test–retest reliability.

<bold>Procedure</bold>

Participants completed five tests in a fixed order, lasting approximately 28 min in total. The tests were ordered as follows: AIFT 1.0, Novel Object Memory Test with Ziggerins (NOMT-Z), Cambridge Face Memory Test (CFMT), 3AFC-Matching Test with Asymmetric Greebles (3AFC-Match-AG), Vanderbilt Face Matching Test–1 Target (VFMT-1T), and AI Questionnaire (details below). Tests were implemented in jsPsych (Version 7.2; de Leeuw, 2015). A total of 12 attention check trials were included throughout the tests to ensure participant engagement. These trials assessed the same type of judgment as the surrounding trials but were intentionally simplified. For example, in face perception tasks, cartoon faces replaced human faces, while object tasks featured distractors with entirely different shapes and colors.

AIFT 1.0


>

We created the AIFT 1.0 (Figure 2A) to measure the ability to judge whether a face is real or AI-generated. We selected pairs of AI and real faces through pilot work and chose trials for which performance was correlated with the rest of the test. We also aimed to choose trials that spanned a range of difficulty. It is well known that people can spot an AI face by using watermark artifacts, distorted elements in the background, unrealistic color, or artificial smiles (Duffy et al., 2024). Some studies manually exclude images with artifacts (Duffy et al., 2024) or use newer generation algorithms where they are much less frequent (Miller et al., 2023). We did not manually screen for such artifacts and selected trials based on psychometric results for pairs of AI/real faces, as opposed to data for individual images as done in other work. Nonetheless, this process eliminated most AI images that contained obvious artifacts because either they were too easily detected or performance on these trials did not correlate with the rest of the test.
>
><anchor name="fig2"></anchor>xge_155_3_629_fig2a.gif

This test was designed with images from <a href="https://www.whichfaceisreal.com" target="_blank">https://www.whichfaceisreal.com</a>, which presents pairs of images: a real one from the Flickr-Faces-HQ Data Set and an artificially generated one from StyleGAN (Karras et al., 2020). The AIFT 1.0 had 75 test trials, with 39 (52%) showing the real face in the right position. In the test, participants viewed two faces side by side for a fixed time window (2 s for the first three trials, 1.5 s for all subsequent trials). After the faces disappeared, participants click to indicate whether the face on the left or the face on the right was the real face. Participants received three practice trials with feedback, followed by 75 test trials without feedback. The trials were roughly ordered from easy to hard. Percent correct was used to index performance.

Demographic attributes such as race, gender, and age were not available as ground-truth labels for the images we used—in fact, one could argue this has no meaning for StyleGAN faces. As a result, these attributes could only be inferred in coarse categories (e.g., White vs. non-White; man, woman, or unsure; child, adult, older) based on visual inspection by the corresponding author. Per this subjective analysis, the two faces paired on specific trials appeared to be the same in all three attributes (age, gender, race) on 13% of the trials, different in one attribute on 33% of the trials, different in two attributes on 45% of the trials, and different in all three attributes on 8% of the trials. Most faces appeared White, and the distribution was similar for real faces (77%) and AI-generated faces (76%). Most faces appeared to be women, more so for AI faces (AI, women: 53%, men: 37%, unsure: 9%) than for real faces (real, women: 65%, men: 32%, unsure: 3%). Overall, most faces on the test appeared to be adult, with real faces (adult: 56%, child: 27%, older adult: 17%) on average appearing older than AI-generated faces (adult: 81%, child: 13%, older adult: 5%). Importantly, the AIFT is not a test designed to ask what kind of information people use to spot AI-generated faces, nor is it meant to make claims regarding whether people can make accurate decisions in the presence or absence of certain cues. Given the way we constructed and selected trials, such inferences would be invalid. The AIFT is designed to compare people’s ability to judge which of two faces is AI-generated. For instance, age (including the skin texture differences in older faces) may be a diagnostic cue, but because all participants were tested on the same set of trials and in the same order, this cue, like any other cue, was available to all. If a participant hypothesized that age is a diagnostic feature and decided how to use it, the absence of feedback on the test means that the only support for this hypothesis would come from one’s sense of confidence about their decisions.

We asked the 250 participants who performed the task to redo the test 1 week later and collected data from the first 100 respondents. The AIFT 1.0 shows high test–retest stability (r = .70, Figure 2D, r = .88 when corrected for the internal consistency of the two measures) and internal consistency (λ2 = .80).

NOMT-Z


>

In the NOMT with novel objects called “Ziggerins” (Richler et al., 2019; Smithson, Chow, Chang, & Gauthier, 2024), participants first studied six unique Ziggerins (Figure 2B) for 20 s. They then completed two 24-trial blocks (48 trials total) in which they made an unspeeded choice regarding which of three objects presented together was one of the six studied exemplars. In the first block, the objects were shown in the same viewpoint as during study, while they were in a new viewpoint for the second block. Percent correct was used to index performance.

CFMT Extended Version


>

In the extended version of CFMT (CFMT+; Russell et al., 2009), participants first studied frontal views of unfamiliar White male faces (Figure 2C), followed by introductory learning trials. Participants were then given forced-choice test displays containing one target face and two distractor faces, where they were instructed to select the face that matched one of the original target faces. In four sections of the test, matching faces varied from their original presentation in their lighting, pose, and/or degree of visual noise. The CFMT+ consisted of 102 total trials. Percent correct was used to index performance.

3AFC-Match-AG


>

The 3AFC-Match-AG (Smithson, Chow, & Gauthier, 2024) uses a category of novel objects called “Asymmetric Greebles” (Figure 2B). Participants were shown a target object briefly (300–1,000 ms, depending on the trial), which was followed by a visual mask (500 ms), and asked to identify the target from an array of three objects. Feedback was provided for the first three trials, followed by 48 test trials. Percent correct was used to index performance.

VFMT-1T


>

The VFMT-1T was adapted from the VFMT (Sunday, Lee, & Gauthier, 2018), which used two study target faces on each trial. It used male and female White faces (Figure 2C). Participants were presented with a target face for a short duration (2,000 ms for Trial 1, 1,500 ms for Trial 2, 1,000 ms for Trials 3–14, then 300 ms for Trials 15–end) and asked to identify the target face in a three-alternative forced-choice display. The matching face was a different image of the same person, varying from its original presentation in position, perspective, and similarity to target. Images were manually edited to blur background and hair. The last 12 trials included visual noise overlaid on the three choices to increase difficulty. Feedback was provided for the first three trials, followed by 61 test trials. Percent correct was used to index performance.

AI Questionnaire


>

An AI experience questionnaire was used to measure “digital media literacy”: the ability to critically evaluate and interpret media content, including distinguishing between authentic and manipulated images. It included seven questions: (1) “How familiar are you with artificial intelligence (e.g., virtual assistants like Siri, AI-powered software like Photoshop AI tools, natural language assistants like ChatGPT, or image generation tools like Midjourney)?” (from 1 = very unfamiliar to 7 = very familiar); (2) “How often do you interact with AI in your daily life?” (from 1 = never to 6 = multiple times a day); (3) “How often do you use AI tools to generate or modify images?” (from 1 = never to 6 = multiple times a day); (4) “How many years have you been using AI tools to generate or modify images (enter 0 for never)?”; (5) “Do you feel confident in your ability to distinguish AI-generated content from human-created content?” (from 1 = not confident to 7 = very confident); (6) “How specifically can you describe important indicators of AI-generated images?” (from 1 = I don’t know any to 7 = I can describe them); and (7) “How familiar are you with the underlying techniques used to generate AI-generated images?” (from 1 = very unfamiliar to 7 = very familiar). Responses to all questions were z-scored and averaged into a single index.

<bold>Analyses</bold>

With SEM, latent variables are inferred from scores on several tasks (or indicators). Trials for a given task can also be divided into parcels (instead of using the total score for each test). The use of parcels can be likened to grouping trials in different ways in the computation of split-half reliability. It can limit the influence of measurement error and provide more valid estimates of relations between constructs (Little et al., 2013). How the parcels are constructed is less important than the fact that they are used, especially when the sample size and number of trials per parcel are large (Little et al., 2013; Sterba & MacCallum, 2010). Our decision to parcel tasks and how we created parcels was made before data were collected because we suspected that more indicators could be required to ensure model identification.

In our models (Figure 2E and 2F), each latent variable was specified by four parcels of trials. Our preregistered analyses included a correlated model (M1; see Figure 2E) with three factors: There were four indicators of an f factor (the CFMT+ and the VFMT-1T, each split into two parcels; details below) and four indicators of an o factor (the NOMT-Z and the 3AFC-Match-AG, each split into two parcels). The ability to judge AI faces was indicated by a single task, the AIFT. The AIFT was broken down into four parcels of trials, allowing us to estimate it as a latent variable and take into account measurement error in estimating relationships with this factor (see Chang et al., 2024; Kaltwasser et al., 2014, for a similar approach).

It is customary to begin SEM analyses with a model in which different factors are allowed to correlate, as a point of comparison for other models (Dunn & McCray, 2020). M1 is also the model we preregistered. While we can frame o and f as part of a hierarchical model of abilities, few studies have explicitly studied these abilities in a hierarchical framework. We also fit the data to a second model (M2; Figure 2F), which takes a more theoretical approach to account for the correlation between f and o. Instead of allowing the two factors to correlate, in this case, o is defined as a domain-general ability that causes some of the variance in all face and object recognition tasks, whereas f is defined as an ability that causes the variance unique to face recognition tasks (Meyer et al., 2021). The goal of this work is not to provide a test between these two models. Indeed, a strong test would have benefitted from the inclusion of other domain-specific abilities in addition to face recognition. Here, we wanted to ask how well f and o predicted AI face detection ability, whether f and o were two different but correlated abilities, or whether they were part of a hierarchy that explains their shared variance based on o being more general.

When fitting models, we used a nested model comparison to test whether correlated errors between the different parcels of the same task, for the face and object recognition tasks, improved the model fit. These correlated errors are theoretically meaningful as they represent the variance specific to a task, as opposed to that shared with the other indicators of the same factor. If the correlated errors improved the fit, they were included. We also tested whether allowing the weights of the different parcels of the same task (i.e., the two parcels for each of the two object recognition and face recognition tasks, and the four parcels of the AIFT) to vary improved the model fit. The simplest assumption is that these weights should not differ, and so they were set to be equal unless allowing them to vary improved the fit. This was also tested using a nested model comparison. Because these two aspects of the model could interact, we also tested all for combinations (with correlated errors and equal parcel weights; with correlated errors and no equal parcel weights; no correlated errors and equal parcel weights; no correlated errors and no equal parcel weights) and selected the best model based on global model fit and interpretability of local parameters.

Because SEM is not familiar to many who study high-level vision, we also provide zero-order correlations between composite scores and a multiple regression in which f and o are used as predictors of AIFT scores. These analyses are more familiar to most readers and provide an accessible way to understand the basic patterns in the data. Zero-order correlations provide a transparent view of the bivariate relationships between tasks, though they cannot disentangle shared variance between predictors. Multiple regression helps address this limitation by examining the unique contribution of face and object recognition abilities in predicting AI face detection, but unlike SEM, it treats observed scores as if they were measured without error. The agreement between these complementary approaches would suggest that the relationships we observe are not artifacts of any particular analytical approach.

<h31 id="xge-155-3-629-d563e657">Results</h31>

Table 1 provides summary statistics for all tasks and aggregates for Study 1. While parceling alters the granularity of reliability estimates in the SEMs, the task-level reliability reflects the overall measurement quality relevant to correlations, regressions, and SEMs. Importantly, individual differences measurement quality depends on systematic variance across participants rather than mean performance levels. Items with below-chance performance can still contribute meaningfully to individual differences if they correlate with the rest of the scale, as they provide discriminative information for higher ability participants.
>
><anchor name="tbl1"></anchor>xge_155_3_629_tbl1a.gif

<bold>Correlations and Multiple Regression</bold>

Besides face and object recognition, the only other predictors we considered were age, gender (we coded as 1 for women and 0 for not women, so as to include all participants given there were only four others<anchor name="b-fn2"></anchor><sups>2</sups>), and digital media literacy. We assessed the correlation between these predictors and performance on the AIFT using Bayesian tests (using JASP 0.19, with the default uninformative stretched beta priors of 1.0). We interpret Bayes factors (BF10) as evidence for or against the alternative hypothesis following Wagenmakers et al. (2018), where BF10 &gt; 100 indicates extreme evidence, 30–100 very strong evidence, 10–30 strong evidence, and 3–10 moderate evidence for the alternative hypothesis; values between 1/3 and 3 are considered inconclusive; and BF10 &lt; 1/3 indicates moderate, &lt;1/10 strong, and &lt;1/30 very strong evidence for the null. We found moderate evidence in favor of no correlation of the AIFT with age (r = −.094, 95% credible interval or CI [−0.215, 0.030], BF10 = 0.239), strong evidence in favor of no correlation with experience with AI (r = −.026, 95% CI [−0.149, 0.098], BF10 = 0.086), and inconclusive evidence with gender (r = .132, 95% CI [0.007, 0.252], BF10 = 0.679). We did not include these covariates in the models.

We inspected the confidence question further because of prior work suggesting a negative relationship between confidence and accuracy (Miller et al., 2023). In our study, the evidence instead provided moderate support for no correlation between confidence and AIFT scores (r = .08, 95% CI [−0.04, 0.19], BF10 = 0.184). With our crude measure of confidence based on a single question before the AIFT, we did not replicate the finding that people who make the most errors were the most confident (Miller et al., 2023; Miller’s confidence results were based on ratings given on a trial-by-trial basis).

The two face recognition tasks were correlated (r = .54, 95% CI [0.44, 0.62], BF10 = 2.33 × 10<sups>+17</sups>). For use in a multiple regression analysis, they were z-transformed and averaged into an f index. The two object recognition tasks were also correlated<anchor name="b-fn3"></anchor><sups>3</sups> (r = .29, 95% CI [0.17, 0.39], BF10 = 2,573) and were z-transformed and averaged into an o index. Age was not reliably correlated with either o (r = –.009, 95% CI [−0.123, 0.106], BF10 = 0.074) or f (r = .078, 95% CI [−0.038, 0.190], BF10 = 0.175), with moderate support for the null in both cases. While o showed no relationship with gender (r = .034, 95% CI [−0.081, 0.148], BF10 = 0.087), f was weakly associated with gender (r = .170, 95% CI [0.055, 0.278], BF10 = 4.82), with women performing slightly better than men.

Both f and o were positively correlated with AIFT 1.0 scores (f: r = .20, 95% CI [0.08, 0.32], BF10 = 15.05; o: r = .22, 95% CI [0.10, 0.33], BF10 = 35.66). In addition, f and o were positively correlated with each other (r = .42, 95% CI [0.32, 0.52], BF10 = 1.76 × 10<sups>+9</sups>), which is consistent with the theoretical definition of o as domain general. Compared to a null model (using a default weakly informative Jeffreys–Zellner–Siow prior of .354), the model with f and o predictors was most strongly supported (BF10 = 59.71, R<sups>2</sups> = .064). Across all models, f had a BFinclusion of 2.75, whereas o was a stronger predictor with a BFinclusion of 7.2.

<bold>SEM</bold>

We used SEM to account for measurement error and construct-irrelevant variance, the latter referring to variance unique to a given measure rather than shared across measures of the same construct. We started with the most simple model (M1; Figure 2E), in which f and o are inferred by their effect on the face and object recognition tasks, respectively, and allowed to correlate. We used both factors as predictors of AI faces (the narrow ability related to performance on AIFT trials). We also fit the data to a more theoretical model in which o is a broader ability that encompasses the narrower domain-specific f (M2; Figure 2F).

For model M1 (Figure 3 left), the addition of correlated errors between parcels of the same task improved the fit, nested χ²(4) = 269.51, p &lt; .0001. When equality constraints for the weights of different parcels of the same task were added to the model, they did not significantly affect the fit, nested χ²(7) = 13.50, p = .061. However, the version without these equality constraints had a better global fit (root-mean-square error of approximation [RMSEA] of .039 vs. .045; standardized root-mean-square residual [SRMR] of .042 vs. .052; comparative fit index [CFI] of .987 vs. .975), so we selected this version of the model. Overall, model M1, with correlated errors and no constraints on parcel weights, fit the data well, as indicated by a significant but acceptable chi-square statistic, χ²(47) = 67.52, p = .026; a high CFI = 0.987; a low RMSEA = 0.039 (90% CI [0.014, 0.058]); and a small SRMR = 0.042. All indicators for all three factors had significant loadings. The correlation between o and f was high and significant (r = .81, p &lt; .001). In this model, o was the only significant predictor of AI faces (standardized coefficient β = .49, p = .029).
>
><anchor name="fig3"></anchor>xge_155_3_629_fig3a.gif

When fitting model M2 (Figure 3 right), adding correlated errors when parcel weights were not equal led to an inadmissible solution with some negative variances (Heywood cases, with variances for CFMT1 and CFMT2 of −0.002 and −0.004, respectively). With equality constraints on parcel weights added, the fit did not significantly improve, nested χ<sups>2</sups>(9) = 13.71, p = .133, suggesting the correlated errors should be kept. When comparing models with and without correlated errors, adding them improved the fit significantly so they were included, nested χ<sups>2</sups>(4) = 208.66, p &lt; 2.2e−16. Overall, M2 with correlated errors and equal parcel weights fit the data well, as indicated by a significant but acceptable chi-square statistic, χ<sups>2</sups>(53) = 81.02, p = .008; a high CFI = 0.974; a low RMSEA = 0.046 (90% CI [0.024, 0.065]); and a small SRMR = 0.052. In this model, all indicators for o and for AI faces had significant loadings, but none of the indicators of f were significant. In other words, the face recognition tasks had significant loadings on o, of about the same magnitude as those from object recognition tasks (loadings between .42 and .55), and no additional significant shared variance that was not already modeled by task-specific variance.

In this context, it is not surprising that o was the only significant predictor of AI faces (standardized coefficient β = .37, p = .002) as there is essentially no evidence for f as a factor. It is likely that the fit for M2 could be further improved by dropping f as a factor, but given our present goals, it is sufficient that both models conclude that o is a significant predictor of AI faces.

<h31 id="xge-155-3-629-d563e901">Discussion</h31>

Zero-order correlations show that performance in both face and object recognition tasks correlates with performance on the AIFT. But these two abilities are correlated with each other, and when both are included in a multiple regression, o receives twice the support as a predictor of AIFT compared to f.

In our preregistration, we planned SEM analyses that would allow us to define constructs at a latent level and account for measurement error (Bollen, 2014). Using this approach, we first tested a relatively atheoretical model with o and f as distinct but correlated abilities. This model (M1) supports o, but not f, as a predictor of AI faces, once the substantial shared variance is controlled for. We also tested a second model (M2), which accounts for the substantial correlation between o and f by treating o as a domain-general ability that contributes to performance across all tasks, both face and object recognition. In this hierarchical framework, f represents a face-specific ability that explains additional variance unique to face recognition tasks beyond what is accounted for by the general object recognition ability (o). In this case, we simply find no evidence for f, but o was again a significant predictor of AI faces.

An astute reader will note that based on zero-order correlations, our two face tests were more strongly correlated (r = .54) than our two object tests (r = .29) and may question the selection of these tasks to operationalize each construct. However, the relative magnitude of these correlations is as expected based on prior empirical work (e.g., Smithson et al., 2022; Sunday, Donnelly, & Gauthier, 2018; Verhallen et al., 2017). Simply put, our two face tests differ in task format but share a stimulus category, while the two object tests differ in both task format and task category. A domain-general ability like o plausibly requires indicators that are farther apart (less strongly correlated) in multivariate space (Little et al., 1999) because it includes more territory in terms of processing demands and content domains. We estimated both o and f with the same kinds of tests: one memory task and one matching task. Even though the AIFT overlaps more in terms of stimuli with indicators of f than with indicators of o, the latter factor was the only predictor of AI faces. This provides a strong validation for o as a useful predictor across tasks and domains.

Nonetheless, the lack of evidence for unique f variance in M2 should be interpreted cautiously. Other work has suggested that a latent variable for f is distinct from latent variables for nonface tasks (e.g., Ćepulić et al., 2018), although that work usually does not directly measure o and/or used novel objects. It is frequent to compare face recognition to object recognition where the latter is represented by objects from a single category (Duchaine & Nakayama, 2006; Meyer et al., 2021; Shakeshaft & Plomin, 2015). Some nonface objects, like cars, are relatively poor representatives of most other categories (Ćepulić et al., 2018; Sunday et al., 2019), in which case measurements from more familiar categories are required to converge on o (Richler et al., 2017).

Finally, even with a general ability like o, we could account for only 23% of the nonmeasurement error variance in the AIFT (14% in M2). Other factors likely contribute to why some people are consistently better at detecting AI faces. In Study 2, we sought to replicate the finding that o predicts this skill, turning to a comparison with general intelligence.

Study 2


> <h31 id="xge-155-3-629-d563e1010">Method</h31>

<bold>Participants</bold>

We recruited online from Prolific.com, obtaining data for 291 participants between 18 and 45, fluent in English, living in the United States, and who had 95% approval on the platform, until we had 250 who failed none of the attention checks. The final sample included 137 women, 109 men, and four who did not report gender. Mage was 31.8 (SD = 7.07). The ethnicity composition was as follows: White (62.4%), Black (16.8%), mixed (9.2%), Asian (6.8%), other (0.4%), and unknown (0.08%). We also collected data in a follow-up study with only the AIFT using an unlimited presentation time. We collected data from 51 participants using the same criteria as above. This sample included 26 women, 22 men, and three unknown. Ethnicity was White (62.7%), Black (21.6%), mixed (2.0%), Asian (5.9%), and unknown (7.8%), and Mage = 31.3, SD = 6.89.

<bold>Procedure</bold>

Participants completed six tests in a fixed order, lasting approximately 36 min in total. The tests were ordered as follows: AIFT 2.0, Grammatical Reasoning Test (speed), NOMT (o), Paper Folding (spatial), 3AFC-Match-AG (o), and Vocabulary Test (crystallized). Tests were implemented in jsPsych (Version 7.2; de Leeuw, 2015). A total of 11 attention checks were included throughout the tasks, using the same format as in Study 1. We define g as the variance that would be shared between o and other broad abilities, including processing speed, spatial reasoning and crystallized intelligence (Bates & Gignac, 2022).

AIFT 2.0


>

We created a new version of the AIFT (Version 2.0) to reduce test duration while preserving reliability. We dropped 15 trials and kept what we considered to be the best 60 trials. Specifically, we dropped the most difficult trials (accuracy &lt; 50%) from the first test session in Study 1, unless they had good item-total (aka item-rest) correlation (&lt;.15) and/or a correlation with o &gt; 5%. Reliability remained high for these 60 trials, using the retest data in Study 1 (λ2 = .80). In Study 2, participants completed the AIFT 2.0 with the same design and instructions as the AIFT 1.0. No feedback was provided, and three attention checks were included.

Grammatical Reasoning Test


>

The Grammatical Reasoning Test was adapted from prior work (Baddeley, 1968). Participants were presented with a statement followed by a pair of letters (either AB or BA; Figure 4B). Participants were asked to judge whether the letter pair was a correct representation of the statement by clicking “true” or “false.” Eight sentences that correctly described the pair AB but incorrectly described the pair BA were “A precedes B,” “B does not precede A,” “B is preceded by A,” “A is not preceded by B,” “B follows A,” “A does not follow B,” “A is followed by B,” and “B is not followed by A.” If the order of A and B was swapped in these sentences, they would correctly describe the pair BA, but incorrectly describe the pair AB. Participants were first shown two example trials with an explanation of why the correct answer was correct. This was followed by two practice trials with feedback. Participants were given 90 s to complete as many of the 32 test trials as possible. Performance is the number of correctly completed trials in 90 s.
>
><anchor name="fig4"></anchor>xge_155_3_629_fig4a.gif

NOMT-Z


>

The NOMT-Z was included as in Study 1.

Paper Folding Test


>

In the Paper Folding Test (Ekstrom et al., 1976), participants viewed diagrams of a piece of paper undergoing a series of folds before a hole is punched through it. They then identified the correct spatial arrangement of hole punches if the folded piece of paper were to be unfolded (Figure 4A). Participants were given five possible answer choices and 20 total trials. The diagram and the answer choices were simultaneously displayed until participant response, and there was unlimited time to respond. Performance was indexed with percent correct.

3AFC-Match-AG


>

The 3AFC-Match-AG task was included as in Study 1.

Vocabulary Test


>

Participants completed a vocabulary test based on a selection of the stimuli from prior work (Warrington et al., 1998), where participants had to select a synonym for a target word (Figure 4C). We added a third option to reduce the impact of guessing. The test consisted of 30 trials total and was not speeded. Three very easy trials (e.g., FAST/quick, carrot, potato) were included as attention checks. Performance was indexed with percent correct.

<h31 id="xge-155-3-629-d563e1105">Results</h31>

<bold>Correlations and Multiple Regression</bold>

All measures in Study 2 had acceptable reliability &gt;.7 (Table 2). As in Study 1, the two object recognition tasks were positively correlated (Figure 4; r = .36, 95% CI [0.24, 0.46], BF10 = 1.64 × 10<sups>+6</sups>). They were z-transformed and averaged into an o index for use in a multiple regression. The Vocabulary, Baddeley Grammatical, and Paper Folding tasks were positively correlated (Figure 4; rs = .23–.29, BF10 = 68–4,385) and were first averaged into a g index (without o tasks) for the purpose of zero-order correlations and multiple regression.
>
><anchor name="tbl2"></anchor>xge_155_3_629_tbl2a.gif

For all three variables (o, g, and AIFT 2.0 scores), we found evidence for a null correlation with age and gender (age, rs = –.05 to .08, BF10 = .08–.17; gender, rs = .00–.05, BF10 = .08–.11). The g and o indices were positively correlated (r = .43, 95% CI [0.34, 0.53], BF10 = 9.35 × 10<sups>+9</sups>), and while we found support for o being positively correlated with AIFT 2.0 scores (r = .20, 95% CI [0.08, 0.32], BF10 = 13.72), we found support for no correlation between g and AIFT 2.0 (r = .08, 95% CI [−0.05, 0.20], BF10 = 0.17).

In a multiple regression analysis, compared to a null model (default Jeffreys–Zellner–Siow prior of .354), the model with only o as a predictor was most strongly supported (BF10 = 19.60, R<sups>2</sups> = .041). Across all models, o had a BFinclusion of 11.8, whereas g had a BFinclusion of 0.36, suggesting almost 32 times more support for o as a predictor of AIFT relative to g. But this interpretation is limited because, as expected from the definition of g as the apex of the hierarchy of broad abilities, including o (Richler et al., 2019; Smithson & Gauthier, 2026), the two o tasks also correlated with many g tasks. In addition, the Paper Folding task, while correlated with the other g tasks, was more strongly correlated with o tasks.

<bold>SEM</bold>

We analyzed Study 2 data based on our preregistered model (M3), which was analogous to M2 from Study 1, except that in this case, the more general factor was g and o was more specific, influencing only the NOMT and 3AFC-Matching tasks. The extensive literature on g (see Deary, 2012; Spearman, 1927) situates it firmly as broader than o, and we therefore saw no reason to include a correlated model. As in Study 1, tasks with a larger number of trials (NOMT-Z, 3AFC-Match-AG, AIFT 2.0) were parceled, and we attempted to fit model M3 with and without correlated method variance and equality constraints on parcel weights, as in Study 1. All of these models led to inadmissible solutions with some negative variances. Based on our inspection of zero-order correlations, we suspected that the Paper Folding task caused these problems, as it was correlated with all tasks, including object recognition tasks. We therefore modified M3 by allowing paper folding to load on both o and g tasks (M3′; Figure 4). We come back to this decision in the Discussion section.

We compared versions with and without correlated method variance and equality constraints on parcel weights. The addition of correlated errors between parcels of the same task improved the fit, nested χ²(2) = 26.11, p &lt; .0001. When equality constraints for the weights of different parcels of the same task were added to the model, they did not significantly affect the fit, nested χ²(7) = 9.25, p = .24. However, the model without these equality constraints had a better global fit (RMSEA of .025 vs. .027; SRMR of .043 vs. .047; CFI of .992 vs. .989), so we selected this model.

Overall, this model (M3′; Figure 5), with correlated errors and no constraints on parcel weights, fit the data well, as indicated by a nonsignificant chi-square statistic, χ²(35) = 40.44, p = .242; a high CFI = 0.992; a low RMSEA = 0.025 (90% CI [0.000, 0.054]); and a small SRMR = 0.043. All indicators for all three factors had significant loadings except one (3AFC-Match-2 on g, p = .08). In this model, paper folding had significant loadings on both g (.58, p &lt; .001) and o (.48, p &lt; .001). As in Study 1, o was the only significant predictor of AI faces (standardized coefficient β = .43, p = .001), with 18.5% of the variance in AIFT accounted for.
>
><anchor name="fig5"></anchor>xge_155_3_629_fig5a.gif

<bold>Qualitative Item Analyses</bold>

Humans can use several attributes to distinguish AI from real faces (e.g., symmetry, smoothness of skin, attractiveness), although they often fail to use these cues in a valid manner (e.g., judging less distinctive faces as more human, when the reverse tends to be the case; Miller et al., 2023). Our work was not designed to reveal what features were useful to participants because we used a measurement approach prioritizing the interpretation of individual differences. First, since we were interested in elucidating what drove participants’ ability (rather than the image properties they used), and also because we wanted to avoid response bias (Brady et al., 2023), the AIFT required forced choice for consistent pairings of AI and real faces. As a result, performance on any given trial may be driven by the fact that the AI face looked particularly artificial, the human face looked particularly real, or by a combination of the two. Second, the trials on the AIFT were in the same order for all participants, roughly increasing in difficulty throughout the test, based on pilot results. This is to encourage participants to try their best in this difficult task and to ensure that individual differences are not due to order effects (Goodhew & Edwards, 2019; Richler et al., 2019). However, this limits the ability to interpret relative performance across trials.

Although our task is not ideal for item analyses, we inspected item results and the images across trials to explore possible reasons why specific AIFT trial judgments may be most related to o. While some difficult trials were simply at or near chance, other trials included face pairs for which the AI face was consistently judged to be more likely real than the real face, as in Miller et al. (2023; see Figure 6A for examples). In both studies, we could only account for about 20% of the variance in AI faces. The signal is noisy in individual trials, but whether a trial is related to o is consistent across Studies 1 and 2 (r = .376, 95% CI [0.13, 0.57], BF10 = 11.49; Figure 6C). We found moderate evidence that item difficulty was not associated with how well a particular trial was related to o (r = −.13, 95% CI [−0.36, 0.123], BF10 = .26; see Figure 6D).
>
><anchor name="fig6"></anchor>xge_155_3_629_fig6a.gif

<bold>Follow-Up Study</bold>

An interesting question is how the ability measured by the AIFT is impacted by the limited presentation time used in Studies 1 and 2 (1,500 ms). Under time pressure, participants could use a different strategy than they would have with no time limit. We addressed this in a follow-up study, collecting data in the AIFT 2.0 with no limit on presentation time (N = 51; see the Participants section for details). In this version, the face pair and response options were shown simultaneously and remained visible until participants responded. Participants were asked to respond fast and accurately and to guess as needed (none missed an attention check under these conditions). We compared their performance with that of the 300 participants with no attention errors in Study 2. Mean performance with unlimited presentation was higher (M = 73%, SD = 13.7) than that for limited presentation time (M = 64.6%, SD = 9.9; t = 3.81, p &lt; .001). Response times in the two studies are difficult to compare because responses in the unlimited condition contain both viewing time and decision time (mean of median responses over trials = 3,180 ms, SD = 1,262), whereas those in Study 2 (mean of median responses over trials = 976 ms, SD = 289) did not reflect the image viewing time. To allow some comparison of the response times, a gross adjustment was made by adding 1,500 ms to response times in Study 2. After this correction, response times in the follow-up study were significantly longer than those in Study 2, by 482 ms (SD = 374, t = 8.175, p &lt; .001).

Does this additional time change the ability measured by the AIFT? Several results suggest that it does not. We found extreme support for the idea that trial difficulty was highly correlated in the two conditions (r = .91, 95% CI [0.85, 0.95], BF10 = 9.632 × 10<sups>+20</sups>; Figure 6E). The difficulty of a given trial in Study 2, with limited encoding time, was not correlated—with moderate support—with the improvement in accuracy for the same trial in the follow-up study, with unlimited time (r = −.14, 95% CI [−0.38, −0.11], BF10 = .29). Remember that trials were roughly ordered by difficulty on the test, and the correlation between trial number and performance was not affected by presentation time (Study 2: r = −.75, 95% CI [−0.84, −0.61], BF10 = 2.635 × 10<sups>+9</sups>; follow-up: r = −.71, 95% CI [−0.81, −0.54], BF10 = 5.496 × 10<sups>+7</sups>). However, we find moderate evidence for the claim that improvement in accuracy, from limited to unlimited presentations, was not correlated with trial number (r = .06, 95% CI [−0.20, 0.30], BF10 = .18). In addition, there was inconclusive evidence suggesting that the additional time spent with unlimited presentation was correlated with improvement in performance on that trial (r = .17, 95% CI [−0.09, 0.40], BF10 = 0.37). These results suggest that using 1,500 ms was a good compromise for the AIFT. It eliminates variation in viewing time and results in faster testing but does not seem to qualitatively change what is measured by the test. The 1,500-ms presentation time imposed enough pressure on participants to avoid ceiling even for those with the best ability.

<bold>What Cues Do People Use, and Does It Depend on o?</bold>

Miller et al. (2023) reported that AI-generated faces are often perceived as more human than real ones, a phenomenon they called AI hyperrealism. Participants rely on cues like proportionality, familiarity, and the smoothness of the skin but often interpret these cues in the wrong direction. This reflects poor metacognitive insight: People misjudge with high confidence. Another limitation to using human ratings is that they are time-intensive to collect. In our study, where participants compared two faces on each trial, cue attribution may become even harder for participants. We turned to an alternative strategy. Beyond psychological impressions, we explored whether low-level visual features—such as artifacts or background context—might serve as additional cues. We hypothesized that large language models (LLMs) could make these judgments, using cues that partially overlap with but also diverge from human strategies.

To generate a list of additional cues, we gave an LLM (ChatGPT4o) the 60 face pairs from AIFT 2.0, using Prompt 1:<blockquote>You will see two images of faces side by side. One is a real person, the other one is AI. Guess which one is real (left or right), give your confidence on a scale of 1–100%, and tell me what cues you are using.</blockquote>We then compared the LLM’s responses to average human performance on the same trials (from Study 1 and Study 2, as the two data sets were highly correlated in mean trial accuracy; r = .97, 95% CI [0.94, 0.98], BF10 = 6.449 × 10<sups>3</sups>¹).

The LLM made only four errors (93% accuracy) on trials that were also difficult for humans (average accuracy for the 500 human participants: 42%, 45%, 58%, and 62%). Given the small number of errors, there was no reliable correlation between the LLM’s trial-by-trial accuracy and that of humans (r = .22, 95% CI [−0.04, 0.44], BF10 = 0.650). Its confidence ratings were negatively correlated with human accuracy (r = –.36, 95% CI [−0.55, −0.10]), suggesting that the model does this differently than the average human participant. Nonetheless, the cues themselves, and their correlation with human performance, can be used to study how high-o and low-o individuals differ in their decisions.

From the explanations the LLM provided, we synthesized a set of nine cues that either a human or an LLM could plausibly use to judge facial realism: (1) skin texture, (2) lighting and shading, (3) anatomical proportions, (4) facial expression, (5) facial features (eyes, mouth, and teeth), (6) hair, (7) background, (8) clothing and accessories, and (9) reflections or distortions. Cues 1 through 5 were consistent with those reported in Miller et al. (2023), albeit under slightly different labels (e.g., Miller’s “alive in the eyes” aligns with LLM’s references to eye, teeth, and mouth detail). Cues 6 through 9 were novel additions from the LLM responses. We did not ask the LLM to judge more subjective features included by Miller et al. like memorability or familiarity.

Next, we asked the LLM to rate each face in the 60 AIFT 2.0 trials using these nine cues, using Prompt 2, which instructed the LLM to evaluate the faces on each cue, assigning each a score between 0 and 1 based on how strongly the cue suggested that the face was real (1) or AI-generated (0). Prompt 2 is included with the code for these cue analyses with the Study 2 materials on the public link. Prompt 2 included information about how each cue can be helpful that was based on a summary of the descriptions provided by the LLM answering Prompt 1. For instance, for hair, we had “Suggesting AI: Blending artifacts near the hairline, hair merging unnaturally with background, poorly defined or inconsistent hair strands. Suggesting Real: Messy or naturally disordered hair, realistic hair shine or variation, clothing folds, posture, and accessory integration.” The prompt emphasized using graded, nuanced judgments rather than binary values, unless a cue was clearly and extremely indicative of one category. Values closer to 1 indicate the LLM found this cue for an image to suggest a real human photograph, and values closer to 0 indicate that the LLM found the cue to suggest an AI-generated face. The ratings we obtained are not independent ratings for each face on each cue (as used by Miller et al., 2023), but ratings of how a cue provides information about the specific AI versus real judgment.

While LLM ratings may not perfectly mirror human judgments, they generally aligned with our impressions—for example, if the model scored a face as expressive or a background as strange, we tended to agree. That said, the LLM’s superior performance suggests it may leverage features or combinations of features that humans overlook or underuse. For our analysis, which examines how o influences cue-based judgments, what matters is not whether the ratings are human-like, but that they are consistent across participants. To assess reliability, we collected two sets of independent ratings from the LLM for the two faces in each of 20 randomly selected trials (40 of the 120 faces) and computed intraclass correlation coefficients (ICC2; Shrout & Fleiss, 1979). Agreement was high across cues (ICC2 = .90–.97 for the nine cues).

We recoded the LLM cue ratings so that higher scores indicated stronger evidence for a face’s true category. That is, for real faces, a higher score still meant “more likely to be human,” while for AI-generated faces, a higher score now meant “more likely to be AI” (i.e., high scores consistently supported an accurate judgment). Each of the 60 trials involved 20 cue ratings—10 for the real face and 10 for the AI face—recast so that higher values always favored the correct classification. We then compared these adjusted cue scores to average human performance, using data from all 500 participants in Studies 1 and 2.

We correlated cue evidence scores for real and AI faces (as rated by the LLM) with human accuracy on each trial. This was done across all 500 participants, as well as separately for the top and bottom 10th percentiles in object recognition ability (n = 51 per group). Positive correlations indicate that stronger LLM-rated cues aligned with higher human accuracy. These analyses can tell us which cues are most diagnostic and whether sensitivity to cues varied by level.

In the full sample (N = 500), participants used most cues accurately for real faces, but their use of cues for AI faces was overall in the wrong direction (Figure 7A). High-o and low-o groups were similar in several ways (Figure 7B), but there were some pattern differences. The low-o group appeared to rely more on all cues from real faces (in the same direction as the high-o group, but more so). The cue use for AI faces was more similar across the two groups. No comparisons were statistically significant (bootstrap analyses with 1,000 iterations, all ps &gt; .45). Power analyses indicated that detecting effects of this size would require many more trials: over 1,000 for several cues and more than 35,000 in the most extreme cases. The largest differences in cue use between groups were found for cues associated with real faces. Group differences in two cue uses from real faces, hair and reflections, could be detected with 80% power using 300 trials. This does not mean there are no differences in cue use as a function of o, just that these differences at the level of individual cues are small. Cue use across all cues did differ significantly across groups. Comparing absolute values of each correlation across groups (how much a cue influenced performance regardless of whether it was used correctly or not), we found that 14 of the 18 cues had more influence on performance for low-o than high-o individuals (one value was zero, sign test, p = .013).
>
><anchor name="fig7"></anchor>xge_155_3_629_fig7a.gif

A paradox emerged in the data: Low-o participants showed stronger correct use of diagnostic real-face cues but performed worse overall on the AIFT (55.8% vs. 66.8%). It is important to remember that these correlations reflect patterns of cue dependence, independent of performance level. Because high-o participants are more accurate on the AIFT, we expect that the most important differences between groups (if they are captured by these cues) would be in cues that the high-o individuals use better than low-o individuals. For instance, not so much in the fact that low-o participants make better use of generally helpful cues from real faces, but perhaps more in the fact that high-o participants are less misled than low-o participants to the hair cues provided by real faces.

We evaluated a cue-dependency hypothesis, which proposes that low-o participants may rely more on obvious cues, performing well when such cues are present and poorly when they are not. High-o individuals, by contrast, may show more stable performance across cue strengths. To test this, we examined how cue evidence predicted the trial-level correlation between object recognition scores and accuracy (Figure 6). The hypothesis predicts a negative relationship: Strong cues reduce the predictive power of o, while subtle cues increase it.

The results (Table 3) support the cue-dependency hypothesis. The only cues that significantly predict whether a trial is related to o are those for real faces (skin, lighting, and hair), and these correlations are all negative. The more obvious a cue is on a given trial, the less performance on this trial depends on o. When obvious diagnostic information is absent, individual differences in visual processing ability (measured by o) become more determinant of performance on the AIFT.
>
><anchor name="tbl3"></anchor>xge_155_3_629_tbl3a.gif

This pattern is consistent with our earlier finding that low-o participants numerically showed stronger reliance on both real-face cues (accurate dependency) and AI face cues (inaccurate dependency), suggesting they are generally more cue dependent regardless of whether the cues are helpful or misleading. High-o participants, by contrast, may use visual information more selectively, evidenced by their reduced susceptibility to some misleading AI cues. Together, these findings suggest that low-o participants require obvious diagnostic signals to succeed, while high-o participants can maintain good performance even under challenging visual conditions.

Our cue use analyses have limitations: They do not account for cue integration across both faces in a trial, rely on LLM-derived features that may not reflect human perception, and are constrained by a modest sample and correlational design. We find that high-o individuals are less cue dependent than low-o individuals, for the cues we measured, but it is possible that they also used cues that we did not include better than those with low o.

Still, we replicated a key finding from Miller et al. (2023)—participants misused cues when identifying AI versus real faces—despite differences in stimuli (StyleGAN vs. StyleGAN2) and procedure (forced choice vs. single-face ratings). In single-face ratings, it is difficult to assign cue use differentially to real versus AI faces because response biases confound the measurement of cue utilization—participants make binary real/AI judgments about single faces, making it unclear whether apparent “cue use” reflects actual feature detection or general tendencies toward “real” or “AI” response preferences. However, in the context of forced-choice judgments, where each judgment is associated with specific real and AI faces, we found that cues from AI faces were misused more than those from real faces.

We also extend this prior work by revealing individual differences in cue use. High-o participants who performed better overall on the AIFT seemed less cue dependent than low-o participants. Trial-level analyses showed that o became most predictive when diagnostic real-face cues were weak—suggesting high-o individuals excel at resisting misleading information. Low-o participants aligned more closely with real-face cues, but when these are less obvious, they may be more vulnerable to misleading AI cues or simply guess. Future work could test these hypotheses by combining an individual differences approach with a larger number of trials.

<h31 id="xge-155-3-629-d563e1563">Discussion</h31>

Study 2 replicates the key finding from Study 1: o predicts the detection of AI-generated faces. This is a strong result since o is now the more specific ability of the two in the model. An important limitation is that we did not preregister o’s contribution to the Paper Folding task. In hindsight, this correlation is perhaps not too surprising, given that both o and g are, by definition, domain-general constructs measured through tasks that also account for more specific variance. Experts on the structure of intelligence (Schneider & McGrew, 2018) have pointed out that different strategies, imagery versus analytic, can be used on the Paper Folding Test (Burte et al., 2019), which may be why this task tends to load on two broad abilities: fluid intelligence (Gv) and visual ability (Gf). While g is a very robust construct, its estimation becomes more stable with the number of tests included (Floyd et al., 2009). Here, we used a limited range of indicators that targeted g instead of visual ability and that did not allow a separation of fluid intelligence from o, so the variance in this task can only be partitioned between g and o.

Given the limited scope of our models, we acknowledge that a more comprehensive understanding of o and of its place in a hierarchy of abilities will require larger batteries that include more broad abilities. A key theoretical question is whether o should be considered a component of fluid intelligence (visual processing), as might be suggested by visual ability’s theoretical definition as “broad visual perception” (Carroll, 1993), or whether o represents a distinct broad ability. Behavioral and neuroimaging results indicate that object recognition tasks (similar to our o tasks) and mental rotation are supported by different mechanisms and neural systems (Cheung et al., 2009; Gauthier et al., 2002; Hayward et al., 2006). One study found that the correlation between o and spatial abilities (r = .74) was greatly reduced (r = .33) once g was controlled (Smithson, Chow, & Gauthier, 2024). While these broader questions about the structure of visual abilities warrant further investigation, Study 2 demonstrates that AIFT performance is more specifically linked to object recognition ability than to general intelligence.

General Discussion


>

We find that o predicts how well people can judge whether a face is AI-generated. However, narrowing further to a face-specific ability does not improve predictions, nor does expanding to a broader ability like general intelligence. This suggests that o is well positioned to explain complex visual skills like AI detection in faces. One reason face-specific abilities may not help in this task is that face recognition relies on efficiently processing individual facial features (Audette et al., 2025; Gold et al., 2012). Since these features were extracted to support other tasks such as recognizing individuals, processing facial expressions, or estimating age, they may not be helpful for the entirely different challenge of separating real from artificial faces. To be clear, the features we use to recognize real faces appear to be useful for AI-generated faces, when the task is to recognize these faces as individuals (Uittenhove et al., 2024). The driving factor here may be the task itself—assessing whether a face is AI-generated differs from identity discrimination and likely relies on different cues. Our cue analysis suggests that for judging real from AI faces, people can use salient cues from real faces correctly, but several of the same cues are used incorrectly when applying them to AI faces.

Why does o predict AI face detection given that it is also measured with tasks that assess identity discrimination? Remember that both f and o have similar zero-order correlations with AIFT performance and that they are correlated with each other. Multiple regression and SEM converge to suggest that there is no additional information from f over o in the prediction of AIFT performance. Face recognition tasks can predict AIFT because, and perhaps only because, they tap into domain-general object recognition. However, one could imagine a perceptual ability that is specific to the face domain but general in terms of the dimensions processed. We can make many kinds of judgments about faces beyond identity, for instance age, gender, ethnicity, emotion, social traits, and health (White & Burton, 2022). There is as of yet little evidence for shared variance across these face-relevant tasks that is not explained by a more general perceptual ability (although see Walker et al., 2025), but if this ability existed, it may be a good candidate to predict additional performance on the AIFT.

There are also interesting reasons why intelligence may not be particularly helpful in predicting AIFT performance. Intelligence is associated with general mechanisms that include working memory, attention control, and executive functioning (Conway et al., 2003; Diamond, 2013; Heitz et al., 2005). While tasks that heavily load on g typically show strong correlations between a person’s accuracy and their confidence levels (Ackerman et al., 2002), perceptual tasks demonstrate much weaker accuracy–confidence relationships (Jin et al., 2022). When evaluating AI faces, humans rely on multiple dimensions—including proportionality, familiarity, attractiveness, and memorability—but weigh and combine these cues in ways that lead to errors. Notably, those who express the highest confidence are frequently incorrect (Miller et al., 2023). The poor insight people have in detecting AI faces suggests it is more of a perceptual than a cognitive task.

A plausible reason that people with high o excel at detecting AI faces is their ability to encode rich and/or precise image representations. People with a higher o have higher sensitivity to object shape across multiple areas in the extrastriate cortex (McGugin et al., 2022). This was measured during an incidental task, suggesting that high-o individuals may automatically encode shape more precisely or at least do so even when they are not asked to attend to shape. This could in turn help them solve a host of complex visual problems, and interestingly, it could be helpful in tasks like AI detection, where most people have little experience with features that are diagnostic or how to use them (Miller et al., 2023). Our cue analysis suggested that high-o individuals were more cue independent, which could mean they were more flexible on the cues they used on each trial and integrated them more effectively. Another relevant finding is that the same cognitive capacity underlies AIFT judgments regardless of whether encoding limitations are present. Additional processing time does not qualitatively change the way in which people make these judgments, which is consistent with our conjecture that individual differences on this task are more influenced by bottom-up perceptual processes than by top-down cognitive strategies. The result also stands in contrast with the finding that time pressure changes the ability that is measured by tests of fluid intelligence (Chuderski, 2013; Partchev & De Boeck, 2012).

We should be curious about other predictors of AI detection for faces, since a large amount of reliable variance in the AIFT remains to be accounted for. Although we found no evidence that experience with AI predicted this ability, other perceptual or cognitive abilities may fare better. We did not test for specific cognitive abilities such as attentional control, perceptual speed, or working memory. These abilities are consistently highly correlated with g (e.g., Ackerman & Beier, 2007; Engle & Kane, 2004), but they could still account for some additional variance. Here, we operationalized a very general g factor—one that captures the variance shared between o tasks and verbal tasks like vocabulary and grammatical reasoning—because we wanted to contrast o with the most general ability possible. General intelligence is relatively independent of the specific tasks used to estimate it (e.g., Floyd et al., 2009), and we selected our indicators because of their prior use and particularly high g loadings with g (Bates & Gignac, 2022; Walker et al., 2025). But there are challenges associated with the measurement of a hierarchy of skills with a small battery of tasks. It remains possible that detection of AI-generated faces could be predicted by an attentional control or a working memory factor operationalized by several tasks targeting these processes more directly. Attentional control could help individuals focus on subtle differences between images, while working memory could aid in comparing representations of the faces (we note, however, that this might suggest a stronger influence of unlimited encoding time). In sum, even though g was not a strong predictor, other broad abilities falling under g could still be relevant. After all, o was operationalized here as a broad ability that falls under the g umbrella.

Our work has broader implications and raises many questions for how we understand visual abilities in the context of rapidly advancing AI technology. For example, would people with high o also excel at detecting other AI-generated images, such as landscapes, animals, or nonface objects? Human raters with high o might improve AI training more effectively than an unselected sample. Conversely, our findings suggest that individuals with low o, regardless of intelligence or face recognition skills, could be more vulnerable to visual misinformation. Finally, comparing performance variability in humans and artificial neural networks could offer a way to explore the cognitive mechanisms underlying perception. Unlike humans, AI models can be systematically altered (Mehrer et al., 2020), allowing researchers to test hypotheses about how complex abilities develop. For instance, individual differences in linguistic skills and theory of mind were found to align across LLMs (Kosinski, 2024) and humans (Milligan et al., 2007), highlighting the role of attention mechanisms. Similarly, if AI models mirrored human variability in perceptual tasks, they could help probe internal representations and test hypotheses about abilities like AI-generated face detection (Chow & Palmeri, 2024).

<h31 id="xge-155-3-629-d563e1775">Constraints on Generality</h31>

Our findings that o predicts detection of AI-generated faces are based on U.S. adults aged 18–45 recruited online. We expect the results to generalize to similar internet-literate populations but cannot claim generality to children, older adults, or non-Western groups. The effect was observed with faces generated by StyleGAN compared against Flickr-Faces-HQ real faces, under a forced-choice task without feedback. We expect replication with comparable GAN-era stimuli and procedures, but not necessarily with newer AI generators or different task formats. We have no reason to believe that the results depend on other characteristics of participants, materials, or context.

Footnotes

<anchor name="fn1"></anchor>

<sups> 1 </sups> Indeed, the test likely does not represent the full range, as we tried to avoid trials that were too easy because of obvious artifacts and eventually remove in the AIFT 2.0 several trials that are too difficult to provide discriminative information.

<anchor name="fn2"></anchor>

<sups> 2 </sups> Note that this is only a matter of convenience. The results were almost identical if we coded men = 1 and not men = 0 or if we do not include participants who do not fit into these two categories.

<anchor name="fn3"></anchor>

<sups> 3 </sups> The higher correlation between face tests then between novel object tests is consistent with prior work and with their conception as a narrow versus a broad ability. In each pair, the tasks are different, but in addition for objects, the category is also different.

References

<anchor name="c1"></anchor>

Ackerman, P. L., & Beier, M. E. (2007). Further explorations of perceptual speed abilities in the context of assessment methods, cognitive abilities, and individual differences during skill acquisition. Journal of Experimental Psychology: Applied, 13(4), 249–272. 10.1037/1076-898X.13.4.249

<anchor name="c2"></anchor>

Ackerman, P. L., Beier, M. E., & Bowen, K. R. (2002). What we really know about our abilities and our knowledge. Personality and Individual Differences, 33(4), 587–605. 10.1016/S0191-8869(01)00174-X

<anchor name="c3"></anchor>

Arcaro, M. J., Schade, P. F., Vincent, J. L., Ponce, C. R., & Livingstone, M. S. (2017). Seeing faces is necessary for face-domain formation. Nature Neuroscience, 20(10), 1404–1412. 10.1038/nn.4635

<anchor name="c4"></anchor>

Audette, P.-L., Côté, L., Blais, C., Duncan, J., Gingras, F., & Fiset, D. (2025). Part-based processing, but not holistic processing, predicts individual differences in face recognition abilities. Cognition, 256, Article 106057. 10.1016/j.cognition.2024.106057

<anchor name="c5"></anchor>

Baddeley, A. D. (1968). A 3 min reasoning test based on grammatical transformation. Psychonomic Science, 10(10), 341–342. 10.3758/BF03331551

<anchor name="c6"></anchor>

Bates, T. C., & Gignac, G. E. (2022). Effort impacts IQ test scores in a minor way: A multi-study investigation with healthy adult volunteers. Intelligence, 92, Article 101652. 10.1016/j.intell.2022.101652

<anchor name="c7"></anchor>

Bollen, K. A. (2014). Structural equations with latent variables. Wiley. <a href="https://www.scirp.org/reference/referencespapers?referenceid=3759004" target="_blank">https://www.scirp.org/reference/referencespapers?referenceid=3759004</a>

<anchor name="c8"></anchor>

Bowles, D. C., McKone, E., Dawel, A., Duchaine, B., Palermo, R., Schmalzl, L., Rivolta, D., Wilson, C. E., & Yovel, G. (2009). Diagnosing prosopagnosia: Effects of ageing, sex, and participant-stimulus ethnic match on the Cambridge Face Memory Test and Cambridge Face Perception Test. Cognitive Neuropsychology, 26(5), 423–455. 10.1080/02643290903343149

<anchor name="c9"></anchor>

Brady, T. F., Robinson, M. M., Williams, J. R., & Wixted, J. T. (2023). Measuring memory is harder than you think: How to avoid problematic measurement practices in memory research. Psychonomic Bulletin & Review, 30(2), 421–449. 10.3758/s13423-022-02179-w

<anchor name="c10"></anchor>

Breit, M., Scherrer, V., & Preckel, F. (2024). How useful are specific cognitive ability scores? An investigation of their stability and incremental validity beyond general intelligence. Intelligence, 103, Article 101816. 10.1016/j.intell.2024.101816

<anchor name="c11"></anchor>

Burte, H., Gardony, A. L., Hutton, A., & Taylor, H. A. (2019). Knowing when to fold’em: Problem attributes and strategy differences in the Paper Folding test. Personality and Individual Differences, 146, 171–181. 10.1016/j.paid.2018.08.009

<anchor name="c12"></anchor>

Burton, A. M., White, D., & McNeill, A. (2010). The Glasgow Face Matching Test. Behavior Research Methods, 42(1), 286–291. 10.3758/BRM.42.1.286

<anchor name="c13"></anchor>

Canivez, G. L., Watkins, M. W., James, T., Good, R., & James, K. (2014). Incremental validity of WISC-IV(UK) factor index scores with a referred Irish sample: Predicting performance on the WIAT-II(UK). British Journal of Educational Psychology, 84(4), 667–684. 10.1111/bjep.12056

<anchor name="c14"></anchor>

Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge University Press. 10.1017/CBO9780511571312

<anchor name="c15"></anchor>

Ćepulić, D.-B., Wilhelm, O., Sommer, W., & Hildebrandt, A. (2018). All categories are equal, but some categories are more equal than others: The psychometric structure of object and face cognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 44(8), 1254–1268. 10.1037/xlm0000511

<anchor name="c16"></anchor>

Chang, T.-Y., Cha, O., McGugin, R., Tomarken, A., & Gauthier, I. (2024). How general is ensemble perception?Psychological Research, 88(3), 695–708. 10.1007/s00426-023-01883-z

<anchor name="c17"></anchor>

Chang, T.-Y., & Gauthier, I. (2020). Distractor familiarity reveals the importance of configural information in musical notation. Attention, Perception, & Psychophysics, 82(3), 1304–1317. 10.3758/s13414-019-01826-0

<anchor name="c18"></anchor>

Cheung, O. S., Hayward, W. G., & Gauthier, I. (2009). Dissociating the effects of angular disparity and image similarity in mental rotation and object recognition. Cognition, 113(1), 128–133. 10.1016/j.cognition.2009.07.008

<anchor name="c19"></anchor>

Chow, J. K., & Palmeri, T. J. (2024). Manipulating and measuring variation in deep neural network (DNN) representations of objects. Cognition, 252, Article 105920. 10.1016/j.cognition.2024.105920

<anchor name="c20"></anchor>

Chua, K.-W., & Gauthier, I. (2020). Domain-specific experience determines individual differences in holistic processing. Journal of Experimental Psychology: General, 149(1), 31–41. 10.1037/xge0000628

<anchor name="c21"></anchor>

Chuderski, A. (2013). When are fluid intelligence and working memory isomorphic and when are they not?Intelligence, 41(4), 244–262. 10.1016/j.intell.2013.04.003

<anchor name="c22"></anchor>

Conway, A. R., Kane, M. J., & Engle, R. W. (2003). Working memory capacity and its relation to general intelligence. Trends in Cognitive Sciences, 7(12), 547–552. 10.1016/j.tics.2003.10.005

<anchor name="c23"></anchor>

Cooke, D., Edwards, A., Barkoff, S., & Kelly, K. (2024). As good as a coin toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli. arXiv. 10.48550/arXiv.2403.16760

<anchor name="c24"></anchor>

Cross, C. (2022). Using artificial intelligence (AI) and deepfakes to deceive victims: The need to rethink current romance fraud prevention messaging. Crime Prevention and Community Safety, 24(1), 30–41. 10.1057/s41300-021-00134-w

<anchor name="c25"></anchor>

de Leeuw, J. R. (2015). jsPsych: A JavaScript library for creating behavioral experiments in a Web browser. Behavior Research Methods, 47(1), 1–12. 10.3758/s13428-014-0458-y

<anchor name="c26"></anchor>

Deary, I. J. (2012). Intelligence. Annual Review of Psychology, 63(1), 453–482. 10.1146/annurev-psych-120710-100353

<anchor name="c27"></anchor>

Delavari, P., Ozturan, G., Yuan, L., Yilmaz, Ö., & Oruc, I. (2023). Artificial intelligence, explainability, and the scientific method: A proof-of-concept study on novel retinal biomarker discovery. PNAS Nexus, 2(9), Article pgad290. 10.1093/pnasnexus/pgad290

<anchor name="c28"></anchor>

Dennett, H. W., McKone, E., Tavashmi, R., Hall, A., Pidcock, M., Edwards, M., & Duchaine, B. (2012). The Cambridge Car Memory Test: A task matched in format to the Cambridge Face Memory Test, with norms, reliability, sex differences, dissociations from face memory, and expertise effects. Behavior Research Methods, 44(2), 587–605. 10.3758/s13428-011-0160-2

<anchor name="c29"></anchor>

Diamond, A. (2013). Executive functions. Annual Review of Psychology, 64(1), 135–168. 10.1146/annurev-psych-113011-143750

<anchor name="c30"></anchor>

Dotsch, R., Hassin, R. R., & Todorov, A. (2016). Statistical learning shapes face evaluation. Nature Human Behaviour, 1(1), Article 1. 10.1038/s41562-016-0001

<anchor name="c31"></anchor>

Duchaine, B., & Nakayama, K. (2006). The Cambridge Face Memory Test: Results for neurologically intact individuals and an investigation of its validity using inverted face stimuli and prosopagnosic participants. Neuropsychologia, 44(4), 576–585. 10.1016/j.neuropsychologia.2005.07.001

<anchor name="c32"></anchor>

Duffy, S., August, A., & Wisniewski, K. (2024). Discriminating real from AI-generated faces: Effects of emotion, gender, and age. Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 46). <a href="https://escholarship.org/uc/item/49v4z44k" target="_blank">https://escholarship.org/uc/item/49v4z44k</a>

<anchor name="c33"></anchor>

Dunn, K. J., & McCray, G. (2020). The place of the bifactor model in confirmatory factor analysis investigations into construct dimensionality in language testing. Frontiers in Psychology, 11, Article 1357. 10.3389/fpsyg.2020.01357

<anchor name="c34"></anchor>

Ekstrom, R. B., French, J. W., Harman, H. H., & Dermen, D. (1976). Kit of factor-referenced cognitive tests. Educational Testing Service.

<anchor name="c35"></anchor>

Engle, R. W., & Kane, M. J. (2004). Executive attention, working memory capacity, and a two-factor theory of cognitive control. In B. H.Ross (Ed.), The psychology of learning and motivation: Advances in research and theory (pp. 145–199). Elsevier.

<anchor name="c36"></anchor>

Floyd, R. G., Shands, E. I., Rafael, F. A., Bergeron, R., & McGrew, K. S. (2009). The dependability of general-factor loadings: The effects of factor-extraction methods, test battery composition, test battery size, and their interactions. Intelligence, 37(5), 453–465. 10.1016/j.intell.2009.05.003

<anchor name="c37"></anchor>

Gauthier, I., Hayward, W. G., Tarr, M. J., Anderson, A. W., Skudlarski, P., & Gore, J. C. (2002). BOLD activity during mental rotation and viewpoint-dependent object recognition. Neuron, 34(1), 161–171. 10.1016/S0896-6273(02)00622-0

<anchor name="c38"></anchor>

Glutting, J. J., Watkins, M. W., Konold, T. R., & McDermott, P. A. (2006). Distinctions without a difference: The utility of observed versus latent factors from the WISC-IV in estimating reading and math achievement on the WIAT-II. The Journal of Special Education, 40(2), 103–114. 10.1177/00224669060400020101

<anchor name="c39"></anchor>

Gold, J. M., Mundy, P. J., & Tjan, B. S. (2012). The perception of a face is no more than the sum of its parts. Psychological Science, 23(4), 427–434. 10.1177/0956797611427407

<anchor name="c40"></anchor>

Goodhew, S. C., & Edwards, M. (2019). Translating experimental paradigms into individual-differences research: Contributions, challenges, and practical recommendations. Consciousness and Cognition, 69, 14–25. 10.1016/j.concog.2019.01.008

<anchor name="c41"></anchor>

Gray, K. L. H., Biotti, F., & Cook, R. (2019). Evaluating object recognition ability in developmental prosopagnosia using the Cambridge Car Memory Test. Cognitive Neuropsychology, 36(1–2), 89–96. 10.1080/02643294.2019.1604503

<anchor name="c42"></anchor>

Hayward, W. G., Zhou, G., Gauthier, I., & Harris, I. M. (2006). Dissociating viewpoint costs in mental rotation and object recognition. Psychonomic Bulletin & Review, 13(5), 820–825. 10.3758/BF03194003

<anchor name="c43"></anchor>

Heitz, R. P., Unsworth, N., & Engle, R. W. (2005). Working memory capacity, attention control, and fluid intelligence. In O.Wilhelm & R. W.Engle (Eds.), Handbook of understanding and measuring intelligence (pp. 61–77). Sage Publications. 10.4135/9781452233529.n5

<anchor name="c44"></anchor>

Jin, S., Verhaeghen, P., & Rahnev, D. (2022). Across-subject correlation between confidence and accuracy: A meta-analysis of the confidence database. Psychonomic Bulletin & Review, 29(4), 1405–1413. 10.3758/s13423-022-02063-7

<anchor name="c45"></anchor>

Kaltwasser, L., Hildebrandt, A., Recio, G., Wilhelm, O., & Sommer, W. (2014). Neurocognitive mechanisms of individual differences in face cognition: A replication and extension. Cognitive, Affective, & Behavioral Neuroscience, 14(2), 861–878. 10.3758/s13415-013-0234-y

<anchor name="c46"></anchor>

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of StyleGAN. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 8110–8119). <a href="https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_the_Image_Quality_of_StyleGAN_CVPR_2020_paper.html" target="_blank">https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_the_Image_Quality_of_StyleGAN_CVPR_2020_paper.html</a>

<anchor name="c47"></anchor>

Kell, H. J., Lubinski, D., Benbow, C. P., & Steiger, J. H. (2013). Creativity and technical innovation: Spatial ability’s unique role. Psychological Science, 24(9), 1831–1836. 10.1177/0956797613478615

<anchor name="c48"></anchor>

Kietzmann, J., Lee, L. W., McCarthy, I. P., & Kietzmann, T. C. (2020). Deepfakes: Trick or treat?Business Horizons, 63(2), 135–146. 10.1016/j.bushor.2019.11.006

<anchor name="c49"></anchor>

Kosinski, M. (2024). Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences of the United States of America, 121(45), Article e2405460121. 10.1073/pnas.2405460121

<anchor name="c50"></anchor>

LePine, J. A., Hollenbeck, J. R., Ilgen, D. R., & Hedlund, J. (1997). Effects of individual differences on the performance of hierarchical decision-making teams: Much more than g. Journal of Applied Psychology, 82(5), 803–811. 10.1037/0021-9010.82.5.803

<anchor name="c51"></anchor>

Little, T. D., Lindenberger, U., & Nesselroade, J. R. (1999). On selecting indicators for multivariate measurement and modeling with latent variables: When “good” indicators are bad and “bad” indicators are good. Psychological Methods, 4(2), 192–211. 10.1037/1082-989X.4.2.192

<anchor name="c52"></anchor>

Little, T. D., Rhemtulla, M., Gibson, K., & Schoemann, A. M. (2013). Why the items versus parcels controversy needn’t be one. Psychological Methods, 18(3), 285–300. 10.1037/a0033266

<anchor name="c53"></anchor>

McGill, R. J. (2015). Spearman’s Law of Diminishing Returns (SLODR): Examining effects at the level of prediction. Journal of Psychology and Behavioral Science, 3(1), 24–36. 10.15640/jpbs.v3n1a3

<anchor name="c54"></anchor>

McGugin, R. W., Chow, J. K., & Gauthier, I. (2024). Preregistration: Do face and object recognition abilities predict the ability to judge AI-generated faces?<a href="https://osf.io/ueqjc" target="_blank">https://osf.io/ueqjc</a>

<anchor name="c55"></anchor>

McGugin, R. W., & Gauthier, I. (2024). Preregistration: Does object recognition ability predict the ability to judge AI-generated faces, above and beyond g?<a href="https://osf.io/q2z4f" target="_blank">https://osf.io/q2z4f</a>

<anchor name="c56"></anchor>

McGugin, R. W., & Gauthier, I. (2025a). Do face and object recognition abilities predict the ability to judge AI-generated faces?<a href="https://osf.io/82krc" target="_blank">https://osf.io/82krc</a>

<anchor name="c57"></anchor>

McGugin, R. W., & Gauthier, I. (2025b). Does object recognition ability predict the ability to judge AI-generated faces, above and beyond g?<a href="https://osf.io/csgf5" target="_blank">https://osf.io/csgf5</a>

<anchor name="c58"></anchor>

McGugin, R. W., Ryan, K. F., Tamber-Rosenau, B. J., & Gauthier, I. (2018). The role of experience in the face-selective response in right FFA. Cerebral Cortex, 28(6), 2071–2084. 10.1093/cercor/bhx113

<anchor name="c59"></anchor>

McGugin, R. W., Sunday, M. A., & Gauthier, I. (2022). The neural correlates of domain-general visual ability. Cerebral Cortex, 33(8), 4280–4292. 10.1093/cercor/bhac342

<anchor name="c60"></anchor>

Mehrer, J., Spoerer, C. J., Kriegeskorte, N., & Kietzmann, T. C. (2020). Individual differences among deep neural network models. Nature Communications, 11(1), Article 5725. 10.1038/s41467-020-19632-w

<anchor name="c61"></anchor>

Meyer, K., Sommer, W., & Hildebrandt, A. (2021). Reflections and new perspectives on face cognition as a specific socio-cognitive ability. Journal of Intelligence, 9(2), Article 30. 10.3390/jintelligence9020030

<anchor name="c62"></anchor>

Miller, E. J., Steward, B. A., Witkower, Z., Sutherland, C. A. M., Krumhuber, E. G., & Dawel, A. (2023). AI hyperrealism: Why AI faces are perceived as more real than human ones. Psychological Science, 34(12), 1390–1403. 10.1177/09567976231207095

<anchor name="c63"></anchor>

Milligan, K., Astington, J. W., & Dack, L. A. (2007). Language and theory of mind: Meta-analysis of the relation between language ability and false-belief understanding. Child Development, 78(2), 622–646. 10.1111/j.1467-8624.2007.01018.x

<anchor name="c64"></anchor>

Nightingale, S. J., & Farid, H. (2022). AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proceedings of the National Academy of Sciences of the United States of America, 119(8), Article e2120481119. 10.1073/pnas.2120481119

<anchor name="c65"></anchor>

Partchev, I., & De Boeck, P. (2012). Can fast and slow intelligence be differentiated?Intelligence, 40(1), 23–32. 10.1016/j.intell.2011.11.002

<anchor name="c66"></anchor>

Richler, J. J., Tomarken, A. J., Sunday, M. A., Vickery, T. J., Ryan, K. F., Floyd, R. J., Sheinberg, D., Wong, A. C.-N., & Gauthier, I. (2019). Individual differences in object recognition. Psychological Review, 126(2), 226–251. 10.1037/rev0000129

<anchor name="c67"></anchor>

Richler, J. J., Wilmer, J. B., & Gauthier, I. (2017). General object recognition is specific: Evidence from novel and familiar objects. Cognition, 166, 42–55. 10.1016/j.cognition.2017.05.019

<anchor name="c68"></anchor>

Russell, R., Duchaine, B., & Nakayama, K. (2009). Super-recognizers: People with extraordinary face recognition ability. Psychonomic Bulletin & Review, 16(2), 252–257. 10.3758/PBR.16.2.252

<anchor name="c69"></anchor>

Schneider, W. J., & McGrew, K. S. (2018). The Cattell–Horn–Carroll theory of cognitive abilities. In D. P.Flanagan & E. M.McDonough (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (pp. 73–163). The Guilford Press.

<anchor name="c70"></anchor>

Schönbrodt, F. D., & Perugini, M. (2013). At what sample size do correlations stabilize?Journal of Research in Personality, 47(5), 609–612. 10.1016/j.jrp.2013.05.009

<anchor name="c71"></anchor>

Shakeshaft, N. G., & Plomin, R. (2015). Genetic specificity of face recognition. Proceedings of the National Academy of Sciences of the United States of America, 112(41), 12887–12892. 10.1073/pnas.1421881112

<anchor name="c93"></anchor>

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. 10.1037//0033-2909.86.2.420

<anchor name="c72"></anchor>

Smithson, C. J. R., Chow, J. K., Chang, T.-Y., & Gauthier, I. (2024). Measuring object recognition ability: Reliability, validity, and the aggregate z-score approach. Behavior Research Methods, 56(7), 6598–6612. 10.3758/s13428-024-02372-w

<anchor name="c73"></anchor>

Smithson, C. J. R., Chow, J. K., & Gauthier, I. (2024). Visual and auditory object recognition in relation to spatial abilities. Journal of Vision, 24(10), Article 1385. 10.1167/jov.24.10.1385

<anchor name="c74"></anchor>

Smithson, C. J. R., Eichbaum, Q., & Gauthier, I. (2022). Domain-general object recognition ability predicts supervised category learning on a medical imaging task. Journal of Vision, 22(14), Article 4427. 10.1167/jov.22.14.4427

<anchor name="c76"></anchor>

Smithson, C. J. R., & Gauthier, I. (2025). Domain-general object recognition ability has perception and memory subfactors. Journal of Vision, 25(9), Article 2297. 10.1167/jov.25.9.2297

<anchor name="c75"></anchor>

Smithson, C. J. R., & Gauthier, I. (2026). Is domain-general object recognition ability a novel construct?Annual Review of Psychology. Advance online publication. 10.1146/annurev-psych-020325-034053

<anchor name="c77"></anchor>

Spearman, C. (1927). The abilities of man (Vol. 6). Macmillan New York.

<anchor name="c78"></anchor>

Sterba, S. K., & MacCallum, R. C. (2010). Variability in parameter estimates and model fit across repeated allocations of items to parcels. Multivariate Behavioral Research, 45(2), 322–358. 10.1080/00273171003680302

<anchor name="c79"></anchor>

Stosic, M. D., Murphy, B. A., Duong, F., Fultz, A. A., Harvey, S. E., & Bernieri, F. (2024). Careless responding: Why many findings are spurious or spuriously inflated. Advances in Methods and Practices in Psychological Science, 7(1). 10.1177/25152459241231581

<anchor name="c80"></anchor>

Sunday, M. A., Dodd, M. D., Tomarken, A. J., & Gauthier, I. (2019). How faces (and cars) may become special. Vision Research, 157, 202–212. 10.1016/j.visres.2017.12.007

<anchor name="c81"></anchor>

Sunday, M. A., Donnelly, E., & Gauthier, I. (2018). Both fluid intelligence and visual object recognition ability relate to nodule detection in chest radiographs. Applied Cognitive Psychology, 32(6), 755–762. 10.1002/acp.3460

<anchor name="c82"></anchor>

Sunday, M. A., Lee, W.-Y., & Gauthier, I. (2018). Age-related differential item functioning in tests of face and car recognition ability. Journal of Vision, 18(1), Article 2. 10.1167/18.1.2

<anchor name="c83"></anchor>

Sunday, M. A., Tomarken, A. J., Cho, S.-J., & Gauthier, I. (2021). Novel and familiar object recognition rely on the same ability. Journal of Experimental Psychology: General, 151(3), 676–694. 10.1037/xge0001100

<anchor name="c84"></anchor>

Uittenhove, K., Otroshi Shahreza, H., Marcel, S., & Ramon, M. (2024). Synthetic and natural face identity processing share common mechanisms. bioRxiv. 10.1101/2024.08.03.605972

<anchor name="c85"></anchor>

Verhallen, R. J., Bosten, J. M., Goodbourn, P. T., Lawrance-Owen, A. J., Bargary, G., & Mollon, J. D. (2017). General and specific factors in the processing of faces. Vision Research, 141, 217–227. 10.1016/j.visres.2016.12.014

<anchor name="c86"></anchor>

Wagenmakers, E.-J., Love, J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Selker, R., Gronau, Q. F., Dropmann, D., Boutin, B., Meerhoff, F., Knight, P., Raj, A., van Kesteren, E.-J., van Doorn, J., Šmíra, M., Epskamp, S., Etz, A., Matzke, D., …Morey, R. D. (2018). Bayesian inference for psychology. Part II: Example applications with JASP. Psychonomic Bulletin & Review, 25(1), 58–76. 10.3758/s13423-017-1323-7

<anchor name="c87"></anchor>

Walker, D. L., Palermo, R., & Gignac, G. E. (2025). The inter-association between face processing, intelligence, and autistic-like nonverbal communication. Quarterly Journal of Experimental Psychology. Advance online publication. 10.1177/17470218251323388

<anchor name="c88"></anchor>

Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8798–8807). 10.1109/CVPR.2018.00917

<anchor name="c89"></anchor>

Warrington, E., McKenna, P., & Orpwood, L. (1998). Single word comprehension: A Concrete and Abstract Word Synonym Test. Neuropsychological Rehabilitation, 8(2), 143–154. 10.1080/713755528

<anchor name="c90"></anchor>

White, D., & Burton, A. M. (2022). Individual differences and the multidimensional nature of face perception. Nature Reviews Psychology, 1(5), 287–300. 10.1038/s44159-022-00041-3

<anchor name="c91"></anchor>

Wilhelm, O., Herzmann, G., Kunina, O., Danthiir, V., Schacht, A., & Sommer, W. (2010). Individual differences in perceiving and recognizing faces-one element of social cognition. Journal of Personality and Social Psychology, 99(3), 530–548. 10.1037/a0019972

<anchor name="c92"></anchor>

Wilmer, J. B., Germine, L., Chabris, C. F., Chatterjee, G., Williams, M., Loken, E., Nakayama, K., & Duchaine, B. (2010). Human face recognition ability is specific and highly heritable. Proceedings of the National Academy of Sciences of the United States of America, 107(11), 5238–5241. 10.1073/pnas.0913053107

Submitted: February 21, 2025 Revised: September 26, 2025 Accepted: October 11, 2025