Wetland area segmentation based on multi-scale features dual-coding Transformer

ZHAO Yuankun, HU Chunhua

Journal of Nanjing Forestry University (Natural Sciences Edition) ›› 2026, Vol. 50 ›› Issue (3) : 229-238.

PDF(5834 KB)
PDF(5834 KB)
Journal of Nanjing Forestry University (Natural Sciences Edition) ›› 2026, Vol. 50 ›› Issue (3) : 229-238. DOI: 10.12302/j.issn.1000-2006.202509033

Wetland area segmentation based on multi-scale features dual-coding Transformer

Author information +
History +

Abstract

【Objective】The precise delineation of wetland ecosystems through unmanned aerial vehicle(UAV)imagery analysis is essential for ecological resource management,particularly in environments with intricate land-cover patterns and dynamic seasonal variations. To overcome persistent limitations in existing semantic segmentation frameworks—specifically,feature degradation during hierarchical downsampling and insufficient discriminative power for semantically overlapping categories(e.g.,aquaculture ponds vs. natural water bodies)—this study introduces MfdFormer,a dual-encoder Transformer-based network optimized for UAV-borne wetland remote sensing. The architecture strategically reconciles computational efficiency with high-precision segmentation requirements,addressing critical operational constraints in real-time environmental monitoring scenarios.【Method】Methodological advancements center on a hierarchically structured dual-encoding paradigm. The primary encoder employs pyramidal spatial reduction modules with depthwise separable convolutions,systematically compressing input resolutions through four stages while preserving discriminative edge features. Complementing this,the secondary encoder deploys a novel information completes multiscale void attention(ICMVA)mechanism,which synergizes localized window-based self-attention with adaptive semantic gap-filling operations. This dual-path configuration enables concurrent capture of fine-grain textures and long-range contextual dependencies,particularly critical for distinguishing spectrally similar vegetation types. The decoding phase incorporates the parameter-decoupling micro-decoders that progressively merge multi-scale features through the channel-wise attention gating,followed by cross-level feature recalibration using 3×3 depthwise convolution. A semantic fusion module is incorporated into the decoding architecture to significantly improve the discriminative capability for morphologically analogous categories.【Result】The systematic evaluation protocol implemented on the Hongze Lake wetland dataset in Jiangsu Province—comprising 1 872 precisely annotated UAV-captured images categorized into five ecologically distinct land cover classes—provides conclusive evidence of MfdFormer's segmentation efficacy. Quantitative analysis reveals the architecture achieves an exact mean intersection-over-union(mIoU)score of 88.07% across all semantic categories,with particularly notable performance in woodland ecosystem delineation attaining 93.13% class-specific IoU. Comparative assessments against established benchmarks under standardized testing conditions demonstrate consistent superiority,surpassing Topformer's segmentation accuracy by 0.68 percentage points and HRNet's baseline performance by 0.76 percentage points in the comprehensive mIoU metrics. Cross-domain validation procedures executed on the UAVid urban remote sensing benchmark further substantiate the model's generalizability,yielding an mIoU of 1.81 percentage points higher than the Topformer's equivalent performance metric. Controlled ablation experiments quantitatively isolate the functional contribution of the interleaved contextual multi-view attention(ICMVA)mechanism through systematic component substitution. Replacement of ICMVA with standard windowed attention architectures results in measurable performance degradation,most acutely observed in texture-heterogeneous regions characterized by mixed vegetation canopies and fragmented hydrological formations,where IoU scores decrease by precisely 1.16 percentage points.【Conclusion】The rational omission of non-critical information in images,combined with randomized multi-positional local feature extraction through iterative minimal sampling,enables effective reconstruction of global contextual information,thereby enhancing the determinacy and accuracy of segmentation boundaries while reducing computational resource demands. For categories exhibiting substantial intra-class shape variance,strategically reducing detailed feature extraction mitigates the network overfitting risks. Multi-dimensional feature fusion demonstrates significant potential in recognizing complex wetland categories,as the integration of heterogeneous feature dimensions facilitates macro-scale object comprehension,ultimately improving the segmentation capability for UAV-acquired wetland imagery. The proposed MfdFormer architecture effectively balances segmentation precision and computational efficiency through its dual-branch feature extraction mechanism and multi-scale semantic integration strategy. Experimental results across heterogeneous datasets validate its robustness in handling complex wetland landscapes characterized by irregular boundaries and high intra-class variance,establishing practical value for large-scale wetland resource monitoring.

Key words

semantic segmentation / Hongze Lake wetland / area segmentation / dual coding / sliding window attention

Cite this article

Download Citations
ZHAO Yuankun , HU Chunhua. Wetland area segmentation based on multi-scale features dual-coding Transformer[J]. Journal of Nanjing Forestry University (Natural Sciences Edition). 2026, 50(3): 229-238 https://doi.org/10.12302/j.issn.1000-2006.202509033

References

[1]
崔丽娟, 雷茵茹, 张曼胤, 等. 小微湿地研究综述:定义、类型及生态系统服务[J]. 生态学报, 2021, 41(5):2077-2085.
CUI L J, LEI Y R, ZHANG M Y, et al. Review on small wetlands:Definition,typology and ecological services[J]. Acta Ecologica Sinica, 2021, 41(5):2077-2085. DOI: 10.5846/stxb202003260699.
[2]
杨楠, 王卫星, 赵祥模. 基于Retinex和改进的最小生成树分割提取模糊航空图像中的河流[J]. 山东农业大学学报(自然科学版), 2017, 48(6):890-896.
YANG N, WANG W X, ZHAO X M. Rivers on fuzzy aerial images extracted by retinex and improved mini spanning tree segmentation algorithm[J]. Journal of Shandong Agricultural University (Natural Science Edition), 2017, 48(6):890-896. DOI: 10.3969/jissn.1000-2324.2017.06.017.
[3]
赵庆展, 江萍, 王学文, 等. 基于无人机高光谱遥感影像的防护林树种分类[J]. 农业机械学报, 2021, 52(11):190-199.
ZHAO Q Z, JIANG P, WANG X W, et al. Classification of protection forest tree species based on UAV hyperspectral data[J]. Transactions of the Chinese Society for Agricultural Machinery, 2021, 52(11):190-199. DOI: 10.6041/j.issn.1000-1298.2021.11.020.
[4]
MARTINS J, JUNIOR J M, MENEZES G, et al. Image segmentation and classification with SLIC superpixel and convolutional neural network in forest context[C]// IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium.Yokohama, Japan: IEEE, 2019:6543-6546. DOI: 10.1109/igarss.2019.8898969.
[5]
TONG X Y, XIA G S, LU Q K, et al. Land-cover classification with high-resolution remote sensing images using transferable deep models[J]. Remote Sensing of Environment, 2020, 237:111322. DOI: 10.1016/j.rse.2019.111322.
[6]
RONNEBERGER O, FISCHER P, BROX T. Unet:Convolutional networks for biomedical image segmentation[C].Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015.Cham: Springer International Publishing,2015:234-241. DOI: 10.1007/978-3-319-24574-4_28.
[7]
YU C Q, GAO C X, WANG J B, et al. BiSeNet V2:Bilateral network with guided aggregation for real-time semantic segmentation[J]. International Journal of Computer Vision, 2021, 129(11):3051-3068. DOI: 10.1007/s11263-021-01515-2.
[8]
HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019:1314-1324. DOI: 10.1109/iccv.2019.00140.
[9]
林洁如, 朱洪前, 杨国, 等. 基于改进DeepLabv3+的林木图像分割方法[J]. 林业工程学报, 2024, 9(3): 119-126.
LINJ J, ZHU H Q, YANG G, et al. Forest image segmentation method based on improved DeepLabv3+[J]. Journal of Forestry Engineering, 2024, 9(3): 119-126. DOI: 10.13360/j.issn.2096-1359.202309002.
[10]
SONG Y D, HE Z Q, QIAN H, et al. Vision transformers for single image dehazing[J]. IEEE Transactions on Image Processing, 2023, 32:1927-1941. DOI: 10.1109/TIP.2023.3256763.
[11]
KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6):84-90. DOI: 10.1145/3065386.
[12]
HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas,NV, USA: IEEE, 2016:770-778. DOI: 10.1109/CVPR.2016.90.
[13]
何自芬, 史本杰, 张印辉, 等. 多注意力融合的环高原湖泊遥感影像分割[J]. 电子学报, 2023, 51(4):885-895.
HE Z F, SHI B J, ZHANG Y H, et al. Remote sensing image segmentation of around plateau lakes based on multi-attention fusion[J]. Acta Electronica Sinica, 2023, 51(4):885-895.
[14]
XIE E, WANG W, YU Z, et al. SegFormer:Simple and Efficient Design for Semantic Segmentation with Transformers[C]//35th Conference on Neural Information Processing Systems (NeurIPS 2021). Electric Network: NeurIPS, 2021: 12077-12090. DOI: 10.5555/3540261.3541185.
[15]
ZHANG W Q, HUANG Z L, LUO G Z, et al. TopFormer:Token pyramid transformer for mobile semantic segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans,LA,USA: IEEE,2022:12073-12083. DOI: 10.1109/CVPR52688.2022.01177.
[16]
朱小亮, 张海明, 侍猛, 等. 洪泽湖水质现状研究[J]. 当代化工研究, 2023(5):83-85.
ZHU X L, ZHANG H M, SHI M, et al. Study on the water quality of Hongze Lake[J]. Modern Chemical Research, 2023(5):83-85. DOI: 10.20087/j.cnki.1672-8114.2023.05.026.
[17]
ZHONG H F, SUN H M, HAN D N, et al. Lake water body extraction of optical remote sensing images based on semantic segmentation[J]. Applied Intelligence, 2022, 52(15):17974-17989. DOI: 10.1007/s10489-022-03345-2.
[18]
SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2:Inverted residuals and linear bottlenecks[C]// 2018 IEEE/CVF Conference on computer vision and pattern recognition.Salt Lake City,UT, USA: IEEE, 2018:4510-4520. DOI: 10.1109/CVPR.2018.00474.
[19]
JIAO J Y, TANG Y M, LIN K Y, et al. DilateFormer:Multi-scale dilated transformer for visual recognition[J]. IEEE Transactions on Multimedia, 2023, 25:8906-8919. DOI: 10.1109/TMM.2023.3243616.
[20]
CHEN W Y, JIANG Z Y, WANG Z Y, et al. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach,CA, USA: IEEE, 2019:8916-8925. DOI: 10.1109/CVPR.2019.00913.
[21]
WANG L B, LI R, ZHANG C, et al. UNetFormer:a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 190:196-214. DOI: 10.1016/j.isprsjprs.2022.06.008.
[22]
LYU Y, VOSSELMAN G, XIA G S, et al. UAVid:a semantic segmentation dataset for UAV imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 165:108-119. DOI: 10.1016/jisprsjprs.2020.05.009.
PDF(5834 KB)

Accesses

Citation

Detail

Sections
Recommended
The full text is translated into English by AI, aiming to facilitate reading and comprehension. The core content is subject to the explanation in Chinese.

/