Research on forest fire video recognition based on improved Vision Transformer

ZHANG Min; XIN Ying; HUANG Tianqi

doi:10.12302/j.issn.1000-2006.202407013

ZHANG Min, XIN Ying, HUANG Tianqi

JOURNAL OF NANJING FORESTRY UNIVERSITY ›› 2025, Vol. 49 ›› Issue (4) : 186-194.

PDF(4408 KB)

JOURNAL OF NANJING FORESTRY UNIVERSITY ›› 2025, Vol. 49 ›› Issue (4) : 186-194. DOI: 10.12302/j.issn.1000-2006.202407013

Research on forest fire video recognition based on improved Vision Transformer

Author information +

History +

Abstract

【Objective】This research aims to resolve the limitations of existing forest fire recognition algorithms in temporal feature utilization and computational efficiency, this study proposes a video-based recognition model (C3D-ViT) to enhance both detection accuracy and operational efficiency in practical forest monitoring scenarios.【Method】We presented a hybrid architecture integrating 3D Convolutional Neural Networks (3DCNN) with Vision Transformer (ViT). The framework emploied 3D convolution kernels to extract spatiotemporal features from video sequences, which were subsequently tokenized into vector representations. Vision Transformer’s self-attention mechanism then globally models feature relationships across temporal and spatial dimensions, with final classification achieved through the MLP Head layer. Comprehensive ablation studied and comparative experiments were conducted against ResNet50, LSTM, YOLOv5, and baseline 3DCNN, ViT models.【Result】The C3D-ViT achieves 96.10% accuracy, outperforming ResNet50 (89.07%), LSTM (93.26%), and YOLOv5 (91.46%), and has improved compared to the accuracy of the original 3DCNN and Vision Transformer (93.91%, 90.43%). The improved C3D-ViT model performs better in recognition performance, with high recognition accuracy and stability under unfavorable conditions such as occlusion, long distance, and thin smoke. The demand for real-time detection can be realized.【Conclusion】The C3D-ViT framework effectively addresses spatiotemporal modeling challenges in wildfire detection through synergistic CNN-Transformer interaction, providing a technically viable solution for next-generation forest fire early warning systems.

Key words

forest fire / deep learning / object detection / 3DCNN / Vision Transformer (ViT)

Cite this article

EndNote

Ris (Procite)

Bibtex

Download Citations

ZHANG Min , XIN Ying , HUANG Tianqi. Research on forest fire video recognition based on improved Vision Transformer[J]. JOURNAL OF NANJING FORESTRY UNIVERSITY. 2025, 49(4): 186-194 https://doi.org/10.12302/j.issn.1000-2006.202407013

References

List( Publishing order | Descend order by publishing year | Descend order by cited within ) Chart analysis

[1]	罗环敏. 基于极化干涉SAR的森林结构信息提取模型与方法[D]. 成都: 电子科技大学, 2011. LUO H M. Models and methods of extracting forest structure information by polarimetric sar interferometry[D]. Chengdu: University of Electronic Science and Technology of China, 2011. DOI:10.7666/d.D762098. Cited in this article [1]

[2]

SHERSTJUK

, ZHARIKOVA

, SOKOL

. Forest fire-fighting monitoring system based on UAV team and remote sensing[C]//2018 IEEE 38th International Conference on Electronics and Nanotechnology (ELNANO). April 24-26,2018,Kyiv,UKraine.IEEE, 2018:663-668.DOI: 10.1109/ELNANO.2018.8477527.

Cited in this article [1]

[3]

牛弘健, 刘文萍, 陈日强, 等. 基于Resnet的林地无人机图像去雾改进算法[J]. 南京林业大学学报(自然科学版), 2024, 48(2):175-181.

NIU

H J

, LIU

W P

, CHEN

R Q

, et al. Improved algorithm for defogging forest drone images based on Resnet[J]. Journal of Nanjing Forestry University (Natural Sciences Edition), 2024, 48 (2): 175-181.DOI:10.12302/j.issn.1000-2006.202203011.

Cited in this article [1]

[4]	林高华. 基于动态纹理和卷积神经网络的视频烟雾探测方法研究[D]. 合肥: 中国科学技术大学, 2018. LIN G H. Studies on dynamic texture and convolutional neural networks based smoke detection in video sequences[D]. Hefei: University of Science and Technology of China, 2018. Cited in this article [1]

[5]

薛震洋, 林海峰, 焦万果. 基于卷积神经网络的林火小目标和烟雾检测模型[J]. 南京林业大学学报(自然科学版), 2025, 49(1):225-234.

XUE

Z Y

, LIN

H F

, JIAO

W G

. A forest fire small target and smoke detection model based on convolutional neural network[J]. Journal of Nanjing Forestry University (Natural Science Edition), 2025, 49 (1): 225-234. DOI:10.12302/j.issn.1000-2006.202308047.

Cited in this article [1]

[6]	JI S, XU W, YANG M, et al. 3D Convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231.DOI:10.1109/TPAMI.2012.59. Cited in this article [1]

[7]	HU J, SHEN L, ALBANIE S, et al. Gather-excite: exploiting feature context in convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2018, 31: 9401-9411. DOI:10.48550/arXiv.1810.12348. Cited in this article [1]

[8]	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. June 18-23,2018,Salt Lake City,UT,USA.IEEE, 2018:7794-7803.DOI: 10.1109/CVPR.2018.00813. Cited in this article [1]

[9]	CAO Y, XU J, LIN S, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond[C]//2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).October 27-28,2019.Seoul,Korea.IEEE,2019:1971-1980.DOI: 10.1109/iccvw.2019.00246. Cited in this article [1]

[10]	BELLO I, ZOPH B, LE Q, et al. Attention augmented convolutional networks[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul,Korea.IEEE, 2019:3286-3295.DOI: 10.1109/iccv.2019.00338. Cited in this article [1]

[11]	TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]// 38th International Conference on Machine Learning (ICML 2021). Curran Associates, Inc, 2022: 10337-10347. DOI:10.48550/arXiv.2012.12877. Cited in this article [1]

[12]	YUAN L, CHEN Y P, WANG T, et al. Tokens-to-token ViT:training vision transformers from scratch on ImageNet[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV).October 10-17,2021,Montreal,QC,Canada.IEEE,2021:538-547.DOI: 10.1109/ICCV48922.2021.00060. Cited in this article [1]

[13]	ZHU X, SU W, LU L, et al. Deformable DETR: deformable Transformers for end-to-end object detection[C]// ICLR Organizing Committee. Proceedings of the International Conference on Learning Representations. ICLR, 2021. DOI:10.48550/arXiv.2010.04159. Cited in this article [1]

[14]

PENG

Z L

, HUANG

, GU

S Z

, et al. Conformer:local features coupling global representations for visual recognition[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV).October 10-17,2021,Montreal,QC,Canada.IEEE,2021:357-366.DOI: 10.1109/ICCV48922.2021.00042.

Cited in this article [1]

[15]	CHEN Y P, DAI X Y, CHEN D D, et al. Mobile-former:bridging MobileNet and transformer[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).June 18-24,2022.New Orleans,LA,USA.IEEE,2022:5260-5269.DOI: 10.1109/cvpr52688.2022.00520. Cited in this article [1]

[16]	VASWANI A, SHAZEER N M, PARMAR N, et al. Attention is all you need[C]// Advances in Neural Information Processing Systems 30. California:Curran Associates, 2017: 5998-6008. DOI: 10.48550/arXiv.1706.03762. Cited in this article [1]

[17]	石争浩, 李成建, 周亮, 等. Transformer驱动的图像分类研究进展[J]. 中国图象图形学报, 2023, 28(9):2661-2692. SHI Z H, LI C J, ZHOU L, et al. Survey on Transformer for image classification[J]. Journal of Image and Graphics, 2023, 28(9):2661-2692.DOI: 10.11834/jig.220799. Cited in this article [1]

[18]

赵凤, 耿苗苗, 刘汉强, 等. 卷积神经网络与视觉Transformer联合驱动的跨层多尺度融合网络高光谱图像分类方法[J]. 电子与信息学报, 2024, 46(5):2237-2248.

ZHAO

, GENG

M M

, LIU

H Q

, et al. Convolutional neural network and vision Transformer-driven cross-layer multi-scale fusion network for hyperspectral image classification[J]. Journal of Electronics & Information Technology, 2024, 46(5):2237-2248.DOI: 10.11999/JEIT231209.

Cited in this article [1]

[19]	GUO J Y, HAN K, WU H, et al. CMT:convolutional neural networks meet vision transformers[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).June 18-24,2022,New Orleans,LA,USA.IEEE,2022:12165-12175.DOI: 10.1109/CVPR52688.2022.01186. Cited in this article [1]

[20]	HINTON G E, OSINDERO S, TEH Y W. A fast-learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7): 1527-1554.DOI:10.1162/neco.2006.18.7.1527. Cited in this article [1]

[21]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. 2014:1409.1556.https://arxiv.org/abs/1409.1556v6. https://arxiv.org/abs/1409.1556v6 Cited in this article [1]

[22]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.DOI:10.1109/TPAMI.2016.2577031. Cited in this article [1]

[23]

ZHANG

, WALLACE

B C

. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification[C]//Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Taipei: Asian Federation of Natural Language Processing, 2017. DOI: 10.48550/arXiv.1510.03820.

Cited in this article [1]

[24]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//2015 IEEE International Conference on Computer Vision (ICCV). December 7-13,2015,Santiago,Chile.IEEE, 2015:4489-4497.DOI: 10.1109/ICCV.2015.510. Cited in this article [1]

[25]	XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). July 21-26,2017,Honolulu,HI,USA.IEEE, 2017:5987-5995.DOI: 10.1109/CVPR.2017.634. Cited in this article [1]

[26]	IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]// Lille, France:Proceedings of Machine Learning Research 2015. JMLR: W&CP, 2015,37:448-456. Cited in this article [1]

PDF(4408 KB)

Accesses

Citation

Detail

Sections

Recommended

The full text is translated into English by AI, aiming to facilitate reading and comprehension. The core content is subject to the explanation in Chinese.

www.nldxb.njfu.edu.cn

Edited ＆ Published by Editorial Department of Nanjing Forestry University(Natural Sciences Edition)
Address： No.159 Longpan Road,Nanjing 210037 Jiangsu,P.R.China
E-mail ：xuebao@njfu.edu.cn , xuebao@njfu.com.cn
Telephone +86-25-85428247,85427076
Distributed by China International Book Trading Corporation P.O. Box 399, Beijing ,P.R. China

〈

〉

Received	Revised	Published
2024-07-09	2024-10-25	2025-07-30
Issue Date
2025-07-29

Please choose a citation manager

Content to export