基于多尺度特征的双编码Transformer湿地区域分割

赵垣锟; 胡春华

doi:10.12302/j.issn.1000-2006.202509033

赵垣锟, 胡春华

南京林业大学学报（自然科学版） ›› 2026, Vol. 50 ›› Issue (3) : 229-238.

PDF(5834 KB)

国家林草科技领军期刊
中国精品科技期刊
中国高校百佳科技期刊
江苏省新闻出版政府奖期刊奖
RCCSE林学权威期刊（A+）
CSCD核心期刊
Scopus数据库收录期刊
中文核心期刊
SCD核心期刊

作者加群：102861116

微信公众号：南京林业大学学报

高级检索

PDF(5834 KB)

南京林业大学学报（自然科学版） ›› 2026, Vol. 50 ›› Issue (3) : 229-238. DOI: 10.12302/j.issn.1000-2006.202509033

研究论文

基于多尺度特征的双编码Transformer湿地区域分割

赵垣锟 ,
胡春华 ^*

作者信息 +

Wetland area segmentation based on multi-scale features dual-coding Transformer

ZHAO Yuankun ,
HU Chunhua ^*

Author information +

文章历史 +

摘要

【目的】为了高效提取无人机湿地图像中的区域信息，有效划分湿地功能区域，快速精确地获取林地、湖泊等区域的数据，设计基于双编码结构和Transformer的无人机湿地遥感图像分割网络MfdFormer。【方法】在江苏洪泽湖湿地采集图像数据制作语义分割数据集。数据集包括水产养殖区、耕地、林地、湖泊和其他地块5个类别。MfdFormer网络采用主次编码结构，结合Transformer和微型解码通道对图像进行语义分割。主编码通道使用金字塔下采样模块，次编码通道由语义补全滑动窗口注意力模块与普通注意力模块组成。主次解码通道并行连接，减少下采样模块引起的特征损失，提高分割精度，并保持小的网络参数和推理延迟。在解码通道加入语义融合模块提高对相似类别的区分能力。【结果】使用洪泽湖图像数据进行训练和测试，MfdFormer以2.88的网络参数实现88.07%的平均交并比，网络推理时间为48.69 ms，其中林地分割的交并比为93.13%，平均交并比较Topformer高0.68个百分点，比HRNet高0.76个百分点。同时，在UAVid公共数据集上实现66.23%的平均交并比，比Topformer高1.81个百分点，验证了其先进性。【结论】MfdFormer语义分割网络能够实现洪泽湖湿地无人机图像快速准确的区域分割。

Abstract

【Objective】The precise delineation of wetland ecosystems through unmanned aerial vehicle（UAV）imagery analysis is essential for ecological resource management，particularly in environments with intricate land-cover patterns and dynamic seasonal variations. To overcome persistent limitations in existing semantic segmentation frameworks—specifically，feature degradation during hierarchical downsampling and insufficient discriminative power for semantically overlapping categories（e.g.，aquaculture ponds vs. natural water bodies）—this study introduces MfdFormer，a dual-encoder Transformer-based network optimized for UAV-borne wetland remote sensing. The architecture strategically reconciles computational efficiency with high-precision segmentation requirements，addressing critical operational constraints in real-time environmental monitoring scenarios.【Method】Methodological advancements center on a hierarchically structured dual-encoding paradigm. The primary encoder employs pyramidal spatial reduction modules with depthwise separable convolutions，systematically compressing input resolutions through four stages while preserving discriminative edge features. Complementing this，the secondary encoder deploys a novel information completes multiscale void attention（ICMVA）mechanism，which synergizes localized window-based self-attention with adaptive semantic gap-filling operations. This dual-path configuration enables concurrent capture of fine-grain textures and long-range contextual dependencies，particularly critical for distinguishing spectrally similar vegetation types. The decoding phase incorporates the parameter-decoupling micro-decoders that progressively merge multi-scale features through the channel-wise attention gating，followed by cross-level feature recalibration using 3×3 depthwise convolution. A semantic fusion module is incorporated into the decoding architecture to significantly improve the discriminative capability for morphologically analogous categories.【Result】The systematic evaluation protocol implemented on the Hongze Lake wetland dataset in Jiangsu Province—comprising 1 872 precisely annotated UAV-captured images categorized into five ecologically distinct land cover classes—provides conclusive evidence of MfdFormer's segmentation efficacy. Quantitative analysis reveals the architecture achieves an exact mean intersection-over-union（mIoU）score of 88.07% across all semantic categories，with particularly notable performance in woodland ecosystem delineation attaining 93.13% class-specific IoU. Comparative assessments against established benchmarks under standardized testing conditions demonstrate consistent superiority，surpassing Topformer's segmentation accuracy by 0.68 percentage points and HRNet's baseline performance by 0.76 percentage points in the comprehensive mIoU metrics. Cross-domain validation procedures executed on the UAVid urban remote sensing benchmark further substantiate the model's generalizability，yielding an mIoU of 1.81 percentage points higher than the Topformer's equivalent performance metric. Controlled ablation experiments quantitatively isolate the functional contribution of the interleaved contextual multi-view attention（ICMVA）mechanism through systematic component substitution. Replacement of ICMVA with standard windowed attention architectures results in measurable performance degradation，most acutely observed in texture-heterogeneous regions characterized by mixed vegetation canopies and fragmented hydrological formations，where IoU scores decrease by precisely 1.16 percentage points.【Conclusion】The rational omission of non-critical information in images，combined with randomized multi-positional local feature extraction through iterative minimal sampling，enables effective reconstruction of global contextual information，thereby enhancing the determinacy and accuracy of segmentation boundaries while reducing computational resource demands. For categories exhibiting substantial intra-class shape variance，strategically reducing detailed feature extraction mitigates the network overfitting risks. Multi-dimensional feature fusion demonstrates significant potential in recognizing complex wetland categories，as the integration of heterogeneous feature dimensions facilitates macro-scale object comprehension，ultimately improving the segmentation capability for UAV-acquired wetland imagery. The proposed MfdFormer architecture effectively balances segmentation precision and computational efficiency through its dual-branch feature extraction mechanism and multi-scale semantic integration strategy. Experimental results across heterogeneous datasets validate its robustness in handling complex wetland landscapes characterized by irregular boundaries and high intra-class variance，establishing practical value for large-scale wetland resource monitoring.

导出引用

赵垣锟, 胡春华. 基于多尺度特征的双编码Transformer湿地区域分割[J]. 南京林业大学学报（自然科学版）. 2026, 50(3): 229-238 https://doi.org/10.12302/j.issn.1000-2006.202509033

ZHAO Yuankun, HU Chunhua. Wetland area segmentation based on multi-scale features dual-coding Transformer[J]. Journal of Nanjing Forestry University (Natural Sciences Edition）. 2026, 50(3): 229-238 https://doi.org/10.12302/j.issn.1000-2006.202509033

中图分类号： X14；TP75；S717

参考文献

列表( 原文顺序 | 文献年度倒序 | 文中引用次数倒序 ) 可视化分析

[1]	崔丽娟, 雷茵茹, 张曼胤, 等. 小微湿地研究综述:定义、类型及生态系统服务[J]. 生态学报, 2021, 41(5):2077-2085. CUI L J, LEI Y R, ZHANG M Y, et al. Review on small wetlands:Definition,typology and ecological services[J]. Acta Ecologica Sinica, 2021, 41(5):2077-2085. DOI: 10.5846/stxb202003260699. 本文引用 [1]

[2]

杨楠, 王卫星, 赵祥模. 基于Retinex和改进的最小生成树分割提取模糊航空图像中的河流[J]. 山东农业大学学报(自然科学版), 2017, 48(6):890-896.

YANG

, WANG

W X

, ZHAO

X M

. Rivers on fuzzy aerial images extracted by retinex and improved mini spanning tree segmentation algorithm[J]. Journal of Shandong Agricultural University (Natural Science Edition), 2017, 48(6):890-896. DOI: 10.3969/jissn.1000-2324.2017.06.017.

本文引用 [1]

[3]

赵庆展, 江萍, 王学文, 等. 基于无人机高光谱遥感影像的防护林树种分类[J]. 农业机械学报, 2021, 52(11):190-199.

ZHAO

Q Z

, JIANG

, WANG

X W

, et al. Classification of protection forest tree species based on UAV hyperspectral data[J]. Transactions of the Chinese Society for Agricultural Machinery, 2021, 52(11):190-199. DOI: 10.6041/j.issn.1000-1298.2021.11.020.

本文引用 [1]

[4]

MARTINS

, JUNIOR

J M

, MENEZES

, et al. Image segmentation and classification with SLIC superpixel and convolutional neural network in forest context[C]// IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium.Yokohama, Japan: IEEE, 2019:6543-6546. DOI: 10.1109/igarss.2019.8898969.

本文引用 [1]

[5]	TONG X Y, XIA G S, LU Q K, et al. Land-cover classification with high-resolution remote sensing images using transferable deep models[J]. Remote Sensing of Environment, 2020, 237:111322. DOI: 10.1016/j.rse.2019.111322. 本文引用 [1]

[6]	RONNEBERGER O, FISCHER P, BROX T. Unet:Convolutional networks for biomedical image segmentation[C].Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015.Cham: Springer International Publishing,2015:234-241. DOI: 10.1007/978-3-319-24574-4_28. 本文引用 [1]

[7]	YU C Q, GAO C X, WANG J B, et al. BiSeNet V2:Bilateral network with guided aggregation for real-time semantic segmentation[J]. International Journal of Computer Vision, 2021, 129(11):3051-3068. DOI: 10.1007/s11263-021-01515-2. 本文引用 [2]

[8]	HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019:1314-1324. DOI: 10.1109/iccv.2019.00140. 本文引用 [3]

[9]	林洁如, 朱洪前, 杨国, 等. 基于改进DeepLabv3+的林木图像分割方法[J]. 林业工程学报, 2024, 9(3): 119-126. LINJ J, ZHU H Q, YANG G, et al. Forest image segmentation method based on improved DeepLabv3+[J]. Journal of Forestry Engineering, 2024, 9(3): 119-126. DOI: 10.13360/j.issn.2096-1359.202309002. 本文引用 [1]

[10]	SONG Y D, HE Z Q, QIAN H, et al. Vision transformers for single image dehazing[J]. IEEE Transactions on Image Processing, 2023, 32:1927-1941. DOI: 10.1109/TIP.2023.3256763. 本文引用 [1]

[11]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6):84-90. DOI: 10.1145/3065386. 本文引用 [1]

[12]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas,NV, USA: IEEE, 2016:770-778. DOI: 10.1109/CVPR.2016.90. 本文引用 [1]

[13]	何自芬, 史本杰, 张印辉, 等. 多注意力融合的环高原湖泊遥感影像分割[J]. 电子学报, 2023, 51(4):885-895. HE Z F, SHI B J, ZHANG Y H, et al. Remote sensing image segmentation of around plateau lakes based on multi-attention fusion[J]. Acta Electronica Sinica, 2023, 51(4):885-895. 本文引用 [1]

[14]	XIE E, WANG W, YU Z, et al. SegFormer:Simple and Efficient Design for Semantic Segmentation with Transformers[C]//35th Conference on Neural Information Processing Systems (NeurIPS 2021). Electric Network: NeurIPS, 2021: 12077-12090. DOI: 10.5555/3540261.3541185. 本文引用 [3]

[15]	ZHANG W Q, HUANG Z L, LUO G Z, et al. TopFormer:Token pyramid transformer for mobile semantic segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans,LA,USA: IEEE,2022:12073-12083. DOI: 10.1109/CVPR52688.2022.01177. 本文引用 [4]

[16]	朱小亮, 张海明, 侍猛, 等. 洪泽湖水质现状研究[J]. 当代化工研究, 2023(5):83-85. ZHU X L, ZHANG H M, SHI M, et al. Study on the water quality of Hongze Lake[J]. Modern Chemical Research, 2023(5):83-85. DOI: 10.20087/j.cnki.1672-8114.2023.05.026. 本文引用 [1]

[17]	ZHONG H F, SUN H M, HAN D N, et al. Lake water body extraction of optical remote sensing images based on semantic segmentation[J]. Applied Intelligence, 2022, 52(15):17974-17989. DOI: 10.1007/s10489-022-03345-2. 本文引用 [1]

[18]	SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2:Inverted residuals and linear bottlenecks[C]// 2018 IEEE/CVF Conference on computer vision and pattern recognition.Salt Lake City,UT, USA: IEEE, 2018:4510-4520. DOI: 10.1109/CVPR.2018.00474. 本文引用 [1]

[19]	JIAO J Y, TANG Y M, LIN K Y, et al. DilateFormer:Multi-scale dilated transformer for visual recognition[J]. IEEE Transactions on Multimedia, 2023, 25:8906-8919. DOI: 10.1109/TMM.2023.3243616. 本文引用 [1]

[20]

CHEN

W Y

, JIANG

Z Y

, WANG

Z Y

, et al. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach,CA, USA: IEEE, 2019:8916-8925. DOI: 10.1109/CVPR.2019.00913.

本文引用 [1]

[21]	WANG L B, LI R, ZHANG C, et al. UNetFormer:a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 190:196-214. DOI: 10.1016/j.isprsjprs.2022.06.008. 本文引用 [1]

[22]	LYU Y, VOSSELMAN G, XIA G S, et al. UAVid:a semantic segmentation dataset for UAV imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 165:108-119. DOI: 10.1016/jisprsjprs.2020.05.009. 本文引用 [1]