The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely. As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with, the contents by NLM or the National Institutes of Health. Learn more about our disclaimer.
Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2023 Apr 25; 40(2): 257–264.
PMCID: PMC10162928

Language: Chinese | English

基于改进慢快网络的猕猴多行为识别方法

A multi-behavior recognition method for macaques based on improved SlowFast network

伟峰 仲

哈尔滨理工大学 自动化学院(哈尔滨 150000), School of Automation, Harbin University of Science and Technology School, Harbin 150000, P. R. China

Find articles by 伟峰 仲

哲 徐

哈尔滨理工大学 自动化学院(哈尔滨 150000), School of Automation, Harbin University of Science and Technology School, Harbin 150000, P. R. China 中国科学院自动化研究所 模式识别国家重点实验室(北京 100000), National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing100000, P. R. China

Find articles by 哲 徐

翔昱 朱

哈尔滨理工大学 自动化学院(哈尔滨 150000), School of Automation, Harbin University of Science and Technology School, Harbin 150000, P. R. China

Find articles by 翔昱 朱

喜波 马

哈尔滨理工大学 自动化学院(哈尔滨 150000), School of Automation, Harbin University of Science and Technology School, Harbin 150000, P. R. China 哈尔滨理工大学 自动化学院(哈尔滨 150000), School of Automation, Harbin University of Science and Technology School, Harbin 150000, P. R. China 中国科学院自动化研究所 模式识别国家重点实验室(北京 100000), National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing100000, P. R. China

corresponding author Corresponding author.
喜波 马: nc.ca.ai@am.obix
马喜波,Email: nc.ca.ai@am.obix

其中, Frame i 指的是第 i 帧, ResFrame i 为第 i + 1帧减去第 i 帧得到的第 i 个残差帧。在TAS-MBR网络中,初始视频帧个数为32,将输入帧转化为残差帧,快支路的采样间隔为2,输入的残差帧个数为16。在残差块5后面加入了Transformer编码结构,寻找帧之间的运动关系。

2.3. 侧向连接

为了能够将快支路得到的运动信息和慢支路得到的语义信息进行融合,提出了侧向连接的概念 [ 27 ] 。侧向连接出现在残差块2~残差块5之后,将快支路得到的时间信息与慢支路对应残差块得到的语义信息融合之后,通过三维卷积操作来实现特征图尺寸的匹配和相加。整体上,本网络的主要结构是基于SlowFast的双支路结构。与文献[ 18 ]提出的SlowFast不同之处在于:① 残差块2~残差块5从原网络的三层卷积变成了两层卷积,卷积的个数也不同,慢、快支路卷积核的比例为8:1;② 在快支路中输入帧之间进行相减处理,使之成为残差帧;③ 快支路在经过残差块5之后,使用Transformer编码结构对特征图进行处理,使其获得更多的时间信息。

3. 实验

3.1. 实验数据集

本文在MBVD-9上评估所提出的网络架构。MBVD-9数据集中是由不同性别、不同年龄段、不同视角的恒河猴和食蟹猴视频行为片段组成。该数据集共有9类猕猴行为,分别是卧倒、蹲坐、行走、向上移动、向下移动、悬挂、直立、攀附和进食。视频格式为.mp4,帧率为15 帧/s或60 帧/s,数量共计3 849条,总时长为7.03 h,平均视频时长为6.58 s,视频帧数不小于30帧。为保证数据分布大体一致,每类行为随机抽取1/4作为测试集数据,剩余数据作为训练集数据。其中训练集2 874条,验证集975条,划分比例为3:1。

3.2. 实验条件

实验计算机显存大小为24 GB,硬盘大小为12 T。CPU核心个数为64,显卡个数为4。网络训练采用小批量随机梯度下降法,动量为0.9,初始学习率为0.001,随后动态更新学习率,当准确率不再上升,学习率就缩小为原来的1/10,批大小为16,训练轮数为100。

在数据处理方面,将视频帧随机裁剪为112 × 112。依据视频长短采用不同的帧采样间隔(以符号 frameinterval 表示)进行采样,具体如式(2)所示:

视频的总帧数与帧采样间隔之间的关系如式(2)所示,帧采样间隔的大小随着视频总帧数升高而递增,可以更加全面地获得视频信息。采样帧有50%的概率进行水平翻转,输入到网络中个数为32。网络的慢、快支路中输入帧个数分别为4和16。

3.3. 残差帧和Transformer有效性评估

表2 所示,SlowFast的骨干结构对识别准确率影响较大。相对于骨干网络使用101层的Resnet和50层的Resnet,TAS-MBR网络较为轻量的骨干网络更适合处理MBVD-9数据集。以TAS-MBR-1、TAS-MBR-2、TAS-MBR-3分别代表:未使用残差帧和Transformer操作的TAS-MBR网络、未使用Transformer操作的TAS-MBR网络和未使用残差帧操作的TAS-MBR网络。如 表2 所示,残差帧和Transformer对分类准确率均有提升,证明了残差帧和Transformer对于本网络的有效性。TAS-MBR网络相较于使用50层Resnet为骨干网络的SlowFast,识别准确率有明显提升。

表 2

Table of ablation experiment results

消融实验效果表

模型 残差帧 Transformer 分类准确率(%)
SlowFast,101层Resnet 77.34
SlowFast,50层Resnet 81.25
TAS-MBR-1 90.62
TAS-MBR-2 92.45
TAS-MBR-3 93.49
TAS-MBR 94.53

3.4. 与其他网络的比较

为了验证TAS-MBR网络的性能,将TAS-MBR网络与其它行为识别网络在MBVD-9数据集上进行了实验。本次实验的网络有C3D、双流膨胀三维卷积网络(inflated three dimensional convnet,I3D) [ 26 ] 、时空分离的R3D(R(2+1)D) [ 27 ] 、时间分段网络(temporal segment networks,TSN) [ 28 ] 、双流卷积神经网络、时空Transformer(time-space transformer,Timesformer)网络 [ 29 ] 和TAS-MBR网络。各网络的平均分类准确率如 表3 所示,其中C3D和Timesformer网络在采用了大数据集预训练的情况下与TAS-MBR网络不采用大数据集预训练效果接近。如 表3 所示,在其他网络均采用了大数据集预训练的情况下TAS-MBR网络仍达到了最优效果。

表 3

Comparison of different algorithms on MBVD-9 dataset

MBVD-9数据集上不同算法性能的比较

模型 预训练数据集 分类准确率(%)
C3D Sports1m 94.45
I3D Kinetics 92.10
R(2+1)D Kinetics 89.54
TSN Kinetics 93.33
双流卷积神经网络 Kinetics 88.24
Timesformer Kinetics 94.46
TAS-MBR 94.53

3.5. 猕猴各类行为分类准确率

在验证了TAS-MBR网络的有效性之后,本实验给出了TAS-MBR网络在MBVD-9数据集上对猕猴各类行为的分类准确率。卧倒、蹲坐、行走、向上移动、向下移动、悬挂、直立、攀附和进食的准确率分别为90.86%、91.37%、96.54%、93.87%、93.94%、99.46%、93.03%、94.35%和93.77%。TAS-MBR网络对于猕猴九类行为的分类准确率都在90%以上,分类效果良好。其中悬挂的准确率最高,达到了99.46%,可能是由于动作较为舒展,易于网络识别。卧倒的准确率最低,为90.86%,可能是由于动作较为隐蔽,较难识别。如 图4 所示,卧倒与蹲坐容易互相干扰,可能是动作姿势相近,导致这两类动作准确率较低。

Confusion matrix of classification accuracy of actions

各行为分类准确率的混淆矩阵

4. 结论

本文从实际场景出发拍摄、记录猕猴的行为,并提出TAS-MBR卷积神经网络,进而准确、快速地识别猕猴的行为。本文的主要贡献在于:① 提出了可供研究的猕猴行为数据集MBVD-9,包含三个视角共九类猕猴的行为视频;② 利用残差帧和Transformer模块改进了原有的SlowFast的快支路,提高了分类准确率;③ 提出了TAS-MBR网络,并在猕猴行为数据集上达到了最优效果。通过实验证明了残差帧和Transformer对于SlowFast快支路中时间信息获取的有效性和TAS-MBR网络对猕猴行为分类的准确性。

重要声明

利益冲突声明:本文全体作者均声明不存在利益冲突。

作者贡献声明:仲伟峰主要负责提供数据分析指导以及论文审阅修订;徐哲主要负责论文撰写、数据处理、算法设计和实验设计与分析;朱翔昱主要负责提供实验指导和算法设计指导;马喜波主要负责项目主持、数据收集及整理、论文撰写指导。

伦理声明:本研究通过了中国科学院自动化研究所的动物伦理委员会审核(批文编号:IA-202042)。

Funding Statement

科技部重大专项(2016YFA0100902);国家自然科学基金(82090051,81871442)

Funding Statement

Ministry of Science and Technology of the People’s Republic of China; National Natural Science Foundation of China

References

1. Harrer S, Shah P, Antony B, et al Artificial intelligence for clinical trial design. Trends Pharmacol Sci. 2019; 40 (8):577–591. doi: 10.1016/j.tips.2019.05.005. [ PubMed ] [ CrossRef ] [ Google Scholar ]
2. Wang H, Brown P C, Chow E C Y, et al 3D cell culture models: drug pharmacokinetics, safety assessment, and regulatory consideration. Clin Transl Sci. 2021; 14 (5):1659–1680. doi: 10.1111/cts.13066. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
3. Plagenhoef M R, Callahan P M, Beck W D, et al Aged rhesus monkeys: cognitive performance categorizations and preclinical drug testing. Neuropharmacology. 2021; 187 :108489. doi: 10.1016/j.neuropharm.2021.108489. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
4. Wu L, Wu D, Chen J, et al Intranasal salvinorin a improves neurological outcome in rhesus monkey ischemic stroke model using autologous blood clot. J Cereb Blood Flow Metab. 2021; 41 (4):723–730. doi: 10.1177/0271678X20938137. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
5. 童安炀, 唐超, 王文剑 基于双流网络与支持向量机融合的人体行为识别 模式识别与人工智能 2021; 34 (9):863–870. doi: 10.16451/j.cnki.issn1003-6059.202109009. [ CrossRef ] [ Google Scholar ]
6. Klaser A, Marszałek M, Schmid C. A spatio-temporal descriptor based on 3D-gradients//19th British Machine Vision Conference (BMVC), Leeds: British Machine Vision Association, 2008, 275: 1-10.
7. Laptev I, Marszalek M, Schmid C, et al. Learning realistic human actions from movies//IEEE Conference on Computer Vision and Pattern Recognition, Alaska: IEEE, 2008. DOI: 10.1109/CVPR.2008.4587756.
8. Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance//European Conference on Computer Vision, Graz: IEEE, 2006: 428-441.
9. Messing R, Pal C, Kautz H. Activity recognition using the velocity histories of tracked keypoints//2009 IEEE 12th International Conference on Computer Vision, Kyoto: IEEE, 2009: 104-111.
10. Wang H, Klser A, Schmid C, et al Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision. 2013; 103 (1):60–79. doi: 10.1007/s11263-012-0594-8. [ CrossRef ] [ Google Scholar ]
11. 周波, 李俊峰 结合目标检测的人体行为识别 自动化学报 2020; 46 (9):1961–1970. doi: 10.16383/j.aas.c180848. [ CrossRef ] [ Google Scholar ]
12. Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos//Advances in Neural Information Processing Systems, Montreal: IEEE, 2014: 568-576.
13. Liu P, Lyu M, King I, et al. Selflow: self-supervised learning of optical flow//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, California: IEEE, 2019: 4571-4580.
14. Tran D, Bourdev L, Fergus R. Learning spatiotemporal features with 3D convolutional networks//Proceedings of the IEEE International Conference on Computer Vision, Santiago: IEEE, 2015: 4489-4497.
15. Christoph R, Pinz F A. Spatiotemporal residual networks for video action recognition//Advances in Neural Information Processing Systems, Barcelona: IEEE, 2016: 3468-3476.
16. He K, Zhang X, Ren S. Deep residual learning for image recognition//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas: IEEE, 2016: 770-778
17. Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas: IEEE, 2015: 2625-2634.
18. Feichtenhofer C, Fan H, Malik J. SlowFast networks for video recognition//Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul: IEEE, 2019: 6202-6211.
19. Li D, Zhang K, Li Z, et al A spatiotemporal convolutional network for multi-behavior recognition of pigs. Sensors (Basel) 2020; 20 (8):2381–2399. doi: 10.3390/s20082381. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
20. Li C, Zhong Q, Xie D, et al. Skeleton-based action recognition with convolutional neural networks//2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong: IEEE, 2017: 597-600.
21. Girdhar R, Carreira J, Doersch C, et al. Video action transformer network//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, California: IEEE, 2019: 244-253.
22. Vaswani A, Shazeer N, Parmar N. Attention is all you need//Advances in Neural Information Processing Systems, California: IEEE, 2017: 5998-6008.
23. Tao L, Wang X, Yamasaki T. Motion representation using residual frames with 3D CNN//IEEE International Conference on Image Processing, Abu Dhabi: IEEE, 2020: 1786-1790.
24. Bala P C, Eisenreich B R, Yoo S B M, et al Automated markerless pose estimation in freely moving macaques with OpenMonkeyStudio. Nat Commun. 2020; 11 (1):4560. doi: 10.1038/s41467-020-18441-5. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
25. Tang D H, Wang C Y, Huang X, et al Inosine induces acute hyperuricaemia in rhesus monkey (Macaca mulatta) as a potential disease animal model. Pharmaceutical Biology. 2021; 59 (1):175–182. [ PMC free article ] [ PubMed ] [ Google Scholar ]
26. Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice: IEEE, 2017: 6299-6308.
27. Tran D, Wang H, Torresani L, et al. A closer look at spatiotemporal convolutions for action recognition//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City and UT: IEEE, 2018: 6450-6459.
28. Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition//European Conference on Computer Vision, Amsterdam: IEEE, 2016: 20-36.
29. Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding//International Conference on Machine Learning (ICMC), 2021. arXiv: 2102.05095.

Articles from Sheng Wu Yi Xue Gong Cheng Xue Za Zhi = Journal of Biomedical Engineering are provided here courtesy of West China Hospital of Sichuan University