*Result*: Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding.

Title:
Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding.
Authors:
Liu, Yi1 (AUTHOR) liuyi0089@gmail.com, Li, Chengxin1 (AUTHOR) s23150812015@smail.cczu.edu.cn, Xu, Shoukun1 (AUTHOR) jpuxsk@163.com, Han, Jungong2 (AUTHOR) jungonghan77@gmail.com
Source:
International Journal of Computer Vision. Jul2025, Vol. 133 Issue 7, p4483-4503. 21p.
Database:
Academic Search Index

*Further Information*

*Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion, which is essential for real-world applications like autonomous driving, where visible, depth, event, LiDAR, etc., are used. Besides, few attempts for multi-modal fusion, e.g., simple concatenation, cross-modal attention, and token selection, cannot well dig into the intrinsic shared and specific details of multiple modalities. To tackle the challenge, in this paper, we propose a Part-Whole Relational Fusion (PWRF) framework. For the first time, this framework treats multi-modal fusion as part-whole relational fusion. It routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets). Through this part-whole routing, our PWRF generates modal-shared and modal-specific semantics from the whole-level modal capsules and the routing coefficients, respectively. On top of that, modal-shared and modal-specific details can be employed to solve the issue of multi-modal scene understanding, including synthetic multi-modal segmentation and visible-depth-thermal salient object detection in this paper. Experiments on several datasets demonstrate the superiority of the proposed PWRF framework for multi-modal scene understanding. The source code has been released on https://github.com/liuyi1989/PWRF. [ABSTRACT FROM AUTHOR]*