Department of Computing, Imperial College London, London, U.K.
Department of Electronic and Electrical Engineering, University College London, London, U.K.
Corerain Technologies Ltd., Shenzhen, Guangdong, China
贝叶斯神经网络 (BayesNNs) 已在各种安全关键型应用中展示了其优势,例如自动驾驶或医疗保健,因为它们能够捕获和表示模型的不确定性。然而,标准的贝叶斯神经网络需要重复运行,因为蒙特卡洛采样来量化它们的不确定性,这给它们的实际硬件性能带来了负担。为了解决这个性能问题,本文系统地利用了 BayesNN 中广泛的结构化稀疏性和冗余计算。与标准卷积神经网络中的非结构化或结构化稀疏性不同,贝叶斯神经网络的结构化稀疏性由 Monte Carlo Dropout 及其在不确定性估计和预测期间所需的相关采样引入,这可以通过算法和硬件优化来利用。我们首先将观察到的稀疏模式分为三类:通道稀疏、层稀疏和样本稀疏。在算法方面,提出了一个框架来自动探索这三个稀疏类别而不牺牲算法性能。我们证明了可以利用结构化稀疏性将 CPU 设计速度提高多达 49 倍,并将 GPU 设计速度提高多达 40 倍。在硬件方面,提出了一种新的硬件架构来加速贝叶斯神经网络,该架构利用运行时自适应硬件引擎和智能跳过支持实现了较高的硬件性能。在 FPGA 上实现建议的硬件设计后,我们的实验表明,与未优化的贝叶斯网络相比,算法优化的贝叶斯神经网络可以实现高达 56 倍的加速。与优化的 GPU 实现相比,我们的 FPGA 设计实现了高达 7.6 倍的加速和高达 39.3 倍的能效。
Bayesian neural networks (BayesNNs) have demonstrated their advantages in various safety-critical applications, such as autonomous driving or healthcare, due to their ability to capture and represent model uncertainty. However, standard BayesNNs require to be repeatedly run because of Monte Carlo sampling to quantify their uncertainty, which puts a burden on their real-world hardware performance. To address this performance issue, this article systematically exploits the extensive structured sparsity and redundant computation in BayesNNs. Different from the unstructured or structured sparsity in standard convolutional NNs, the structured sparsity of BayesNNs is introduced by Monte Carlo Dropout and its associated sampling required during uncertainty estimation and prediction, which can be exploited through both algorithmic and hardware optimizations. We first classify the observed sparsity patterns into three categories: channel sparsity, layer sparsity and sample sparsity. On the algorithmic side, a framework is proposed to automatically explore these three sparsity categories without sacrificing algorithmic performance. We demonstrated that structured sparsity can be exploited to accelerate CPU designs by up to 49 times, and GPU designs by up to 40 times. On the hardware side, a novel hardware architecture is proposed to accelerate BayesNNs, which achieves a high hardware performance using the runtime adaptable hardware engines and the intelligent skipping support. Upon implementing the proposed hardware design on an FPGA, our experiments demonstrated that the algorithm-optimized BayesNNs can achieve up to 56 times speedup when compared with unoptimized Bayesian nets. Comparing with the optimized GPU implementation, our FPGA design achieved up to 7.6 times speedup and up to 39.3 times higher energy efficiency.