*Result*: Investigating fine- and coarse-grained structural correspondences between deep neural networks and human object image similarity judgments using unsupervised alignment.

Title:
Investigating fine- and coarse-grained structural correspondences between deep neural networks and human object image similarity judgments using unsupervised alignment.
Authors:
Takahashi S; Graduate School of Arts and Science, University of Tokyo, 3-8-1 Komaba, Meguro-ku, 153-8902, Tokyo, Japan., Sasaki M; Graduate School of Arts and Science, University of Tokyo, 3-8-1 Komaba, Meguro-ku, 153-8902, Tokyo, Japan., Takeda K; Graduate School of Arts and Science, University of Tokyo, 3-8-1 Komaba, Meguro-ku, 153-8902, Tokyo, Japan., Oizumi M; Graduate School of Arts and Science, University of Tokyo, 3-8-1 Komaba, Meguro-ku, 153-8902, Tokyo, Japan. Electronic address: c-oizumi@g.ecc.u-tokyo.ac.jp.
Source:
Neural networks : the official journal of the International Neural Network Society [Neural Netw] 2026 Mar; Vol. 195, pp. 108222. Date of Electronic Publication: 2025 Oct 17.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Pergamon Press Country of Publication: United States NLM ID: 8805018 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1879-2782 (Electronic) Linking ISSN: 08936080 NLM ISO Abbreviation: Neural Netw Subsets: MEDLINE
Imprint Name(s):
Original Publication: New York : Pergamon Press, [c1988-
Contributed Indexing:
Keywords: Deep neural networks; Gromov-Wasserstein optimal transport; Human object representations; Representational similarity analysis; Unsupervised alignment
Entry Date(s):
Date Created: 20251028 Date Completed: 20260124 Latest Revision: 20260128
Update Code:
20260130
DOI:
10.1016/j.neunet.2025.108222
PMID:
41151525
Database:
MEDLINE

*Further Information*

*The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms-such as supervised, self-supervised, and CLIP-acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations. In contrast, self-supervised models showed limited matching at both fine- and coarse-grained levels, but still formed object clusters that reflected human coarse category structure. Our results offer new insights into the role of linguistic information in acquiring precise object representations and the potential of self-supervised learning to capture coarse categorical structures.
(Copyright © 2025 The Authors. Published by Elsevier Ltd.. All rights reserved.)*

*Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.*