*Result*: 基于 CUDA 和比特切片的 SM4 算法 软件优化和实现.

Title:
基于 CUDA 和比特切片的 SM4 算法 软件优化和实现.
Alternate Title:
Software optimization and implementation of SM4 algorithm based on CUDA and bit slicing.
Authors:
吴江雨1, 何 鹏1,2,3 penghe@hubu.edu.cn
Source:
Application Research of Computers / Jisuanji Yingyong Yanjiu. Sep2025, Vol. 42 Issue 9, p2825-2833. 9p.
Geographic Terms:
Database:
Academic Search Index

*Further Information*

*SM4 algorithm as a national standard symmetric encryption algorithm in China, plays a crucial role in achieving high-quality and efficient data protection. Current optimizations of the SM4 algorithm mainly focus on bit-slicing and instruction set optimization. However, bit-slicing and instruction set optimization both involve frequent data interaction and a high dependency on underlying hardware, leading to varying levels of support across different architectures. To address the aforementioned issues, this paper proposed an improved bit-slicing optimization method for data orchestration in data processing to enhance data transmission efficiency. Additionally, within the framework of the CUDA programming model, it implemented the SM4 algorithm for efficient parallel encryption using local GPUs. The experimental results indicate that after utilizing bit slicing, the speedup ratio (Ep) can reach 3. 03 when processing plaintext of size 32 KB. Compared to the general SM4 algorithm, the optimized SM4 algorithm achieves an encryption speed of 14 648 Mbit / s, requiring 2. 0 cycles/ Byte for encryption, resulting in a performance improvement of 40% to 215% . The proposed solution significantly enhances the encryption and decryption efficiency of the current SM4 algorithm through GPU parallel acceleration. With the improved bit-slicing optimization, it can also increase the speed for small data, while ensuring a good improvement in security. [ABSTRACT FROM AUTHOR]*

*SM4 算法作为中国国家标准的对称加密算法, 其加密效率是实现高质量、高效率数据保护的关键因素. 目前 SM4 算法优化主要表现在比特切片和指令集优化等方面. 比特切片和指令集优化分别存在着数据交互频 繁以及高度依赖于底层硬件, 不同架构的支持程度存在不同的问题. 针对上述问题, 提出了在数据处理上采用 改进的比特切片优化数据编排的方法, 从而提高数据传输效率, 并且在 CUDA 编程模型的框架上, 通过本地 GPU 实现 SM4 算法的高效通用并行加密. 实验结果表明, 在使用比特切片后, 对小型数据也能够提高速度, 处理明 文大小为 32 KB 时加速比 (Ep) 能达到 3. 03. 另外, 与通用 SM4 算法相比, 优化后的 SM4 算法加密速度可以达 到 14 648 Mbit/s,加密每字节需要的时钟周期可以达到 2. 0 cycles/Byte, 性能提升 40% ~ 215%. 该方案在 GPU 的并行加速下能够大大提升当前 SM4 算法的加解密效率, 在基于改进的比特切片优化下, 也能提高小型数据的 速度, 并且安全性得到了良好的提升. [ABSTRACT FROM AUTHOR]*