南京林业大学学报(自然科学版) ›› 2022, Vol. 46 ›› Issue (5): 152-160.doi: 10.12302/j.issn.1000-2006.202106023
收稿日期:
2021-06-19
修回日期:
2021-09-05
出版日期:
2022-09-30
发布日期:
2022-10-19
通讯作者:
牟凤云
基金资助:
CHEN Jiankun(), MU Fengyun(), ZHANG Yongchuan, TIAN Tian, WANG Junxiu
Received:
2021-06-19
Revised:
2021-09-05
Online:
2022-09-30
Published:
2022-10-19
Contact:
MU Fengyun
摘要:
【目的】比较分析XGBoost模型、LightGBM模型、随机森林模型(RF)、K最近邻模型(KNN)、长短期记忆神经网络(LSTM)、决策树模型(DT)共6个PM2.5浓度预测模型,以准确、及时预测环境PM2.5浓度。【方法】基于重庆市合川区2020年全年空气质量监测数据和气象数据,通过最大相关最小冗余算法(MRMR)进行数据降维选择最优特征子集,作为模型的输入,逐一进行PM2.5浓度预测;考虑到不同季节PM2.5浓度差异较大,故分季节预测了PM2.5浓度;为了探究各模型预测性能,计算了各模型运行时间和内存占用,并基于PM2.5与特征变量的相关性和特征变量的重要性探讨了模型预测性能季节性差异原因。【结果】模型总体预测精度从高到低排序为 XGBoost、RF、LightGBM、LSTM、KNN、DT模型;预测性能方面,6个模型均表现为秋冬季节预测精度高于春夏季节;LightGBM模型可在保证模型精度的情况下,大幅减少模型训练时间和内存占用;特征重要性显示PM10浓度、气温和气压的重要性高,O3浓度、风向和NO2浓度重要性相对较弱。【结论】采取MRMR方法进行数据降维选取的最优特征子集能较好地预测PM2.5浓度;相比较而言,XGBoost、RF、LightGBM、LSTM模型在PM2.5浓度预测上具有较优性能,其中综合性能较好的为LightGBM模型。
中图分类号:
陈建坤,牟凤云,张用川,等. 基于多机器学习模型的逐小时PM2.5浓度预测对比[J]. 南京林业大学学报(自然科学版), 2022, 46(5): 152-160.
CHEN Jiankun, MU Fengyun, ZHANG Yongchuan, TIAN Tian, WANG Junxiu. Comparative analysis of hourly PM2.5 prediction based on multiple machine learning models[J].Journal of Nanjing Forestry University (Natural Science Edition), 2022, 46(5): 152-160.DOI: 10.12302/j.issn.1000-2006.202106023.
表1
站点逐小时监测数据样本"
数据时间 time | 质量浓度/(μg·m-3) content | 气温/℃ air temperature | 湿度/% humidity | 风速/ (m·s-1) wind speed | 风向/(°) wind direction | 气压/kPa air pressure | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
PM2.5 | PM10 | CO | SO2 | NO2 | O3 | ||||||
2020-01-23 00:00:00 | 86 | 116 | 1.256 | 20 | 34 | 26 | 12.57 | 77.83 | 0.19 | 4 | 99.360 98 |
2020-01-23 01:00:00 | 88 | 119 | 1.350 | 22 | 48 | 10 | 12.67 | 79.87 | 0.15 | 11 | 99.364 36 |
2020-01-23 02:00:00 | 87 | 118 | 1.292 | 20 | 39 | 10 | 12.24 | 81.78 | 0.22 | 40 | 99.356 69 |
2020-01-23 03:00:00 | 86 | 116 | 1.230 | 19 | 35 | 13 | 12.12 | 81.43 | 0.16 | 36 | 99.355 83 |
2020-01-23 04:00:00 | 79 | 109 | 1.194 | 18 | 26 | 23 | 11.85 | 83.48 | 0.18 | 30 | 99.311 74 |
2020-01-23 05:00:00 | 92 | 115 | 1.194 | 19 | 28 | 18 | 11.79 | 83.86 | 0.18 | 67 | 99.277 15 |
2020-01-23 06:00:00 | 81 | 108 | 1.166 | 19 | 27 | 16 | 11.64 | 85.29 | 0.16 | 48 | 99.220 48 |
2020-01-23 07:00:00 | 82 | 109 | 1.200 | 18 | 32 | 11 | 11.58 | 85.88 | 0.23 | 31 | 99.237 00 |
2020-01-23 08:00:00 | 79 | 102 | 1.202 | 19 | 35 | 5 | 11.59 | 85.55 | 0.20 | 50 | 99.308 14 |
2020-01-23 09:00:00 | 80 | 98 | 1.248 | 17 | 30 | 9 | 11.60 | 86.42 | 0.20 | 83 | 99.375 48 |
2020-01-23 10:00:00 | 88 | 105 | 1.253 | 17 | 28 | 11 | 11.79 | 86.28 | 0.20 | 112 | 99.445 52 |
2020-01-23 11:00:00 | 95 | 121 | 1.375 | 16 | 33 | 12 | 12.48 | 84.48 | 0.27 | 158 | 99.490 83 |
2020-01-23 12:00:00 | 97 | 123 | 1.385 | 15 | 39 | 10 | 12.71 | 83.67 | 0.26 | 159 | 99.495 35 |
2020-01-23 13:00:00 | 99 | 125 | 1.330 | 15 | 32 | 19 | 13.29 | 81.55 | 0.43 | 136 | 99.416 11 |
2020-01-23 14:00:00 | 109 | 142 | 1.399 | 14 | 40 | 18 | 14.24 | 76.73 | 0.27 | 286 | 99.329 84 |
2020-01-23 15:00:00 | 101 | 135 | 1.332 | 14 | 29 | 38 | 14.78 | 72.54 | 0.34 | 296 | 99.249 16 |
2020-01-23 16:00:00 | 83 | 120 | 1.245 | 12 | 25 | 60 | 13.74 | 75.04 | 0.43 | 292 | 99.227 36 |
2020-01-23 17:00:00 | 92 | 133 | 1.272 | 14 | 27 | 58 | 13.57 | 76.78 | 0.32 | 295 | 99.229 41 |
2020-01-23 18:00:00 | 58 | 90 | 1.232 | 11 | 26 | 61 | 13.32 | 75.50 | 0.27 | 298 | 99.303 92 |
2020-01-23 19:00:00 | 89 | 132 | 1.299 | 13 | 35 | 50 | 12.58 | 80.99 | 0.39 | 256 | 99.384 96 |
2020-01-23 20:00:00 | 107 | 139 | 1.358 | 15 | 41 | 34 | 11.89 | 84.82 | 0.25 | 333 | 99.472 48 |
2020-01-23 21:00:00 | 113 | 142 | 1.417 | 16 | 50 | 16 | 11.32 | 87.04 | 0.24 | 11 | 99.582 50 |
2020-01-23 22:00:00 | 114 | 144 | 1.330 | 17 | 43 | 14 | 10.98 | 89.70 | 0.18 | 18 | 99.644 05 |
2020-01-23 23:00:00 | 113 | 140 | 1.293 | 16 | 36 | 11 | 10.80 | 90.43 | 0.14 | 13 | 99.655 27 |
表2
6个预测模型及其参数确定"
模型 model | 模型特点 model characteristic | 参数确定 parameter determination |
---|---|---|
XGBoost | 基于决策树的集成机器学习算法,以梯度提升(gradient boost)为框架,由多个弱分类器集成而构建强分类器[ | n_estimators=300,ooster="dart", max_depth=9,learning_rate=0.1,reg_lambda=0.2 |
随机森林 RF | 基于分类树算法,通过对大量分类树的汇总提高预测精度,对于异常值和噪声具有很好的容忍度,且预测时不易出现过拟合,是一种非线性建模的工具[ | n_estimators=600,oob_score=True |
LightGBM | 一种梯度提升决策树框架,具有训练速度快、效率高、内存占用低、准确性高、支持并行和GPU学习、能够处理大规模数据等优点[ | max_depth=9,num_leaves=30, ubsample=0.5,learning_rate=0.1, min_data_in_leaf=21 |
近邻算法 KNN | 近邻算法是将数据集合中每一个记录进行分类的方法。数据预测时KNN对于给定的输入X,在历史输入中搜索找出距离最近的K个特征值,然后对K个特征值进行加权估计即可得到预测值[ | K=11,weights = ‘distance’ |
模型 model | 模型特点 model characteristic | 参数确定 parameter determination |
决策树 DT | 用于分类和回归任务的机器学习算法,可选择一个或多个变量作为输入变量,仅有单一输出,具有训练时间快、内存占用小的特点[ | max_depth: 19, min_samples_leaf: 12, min_samples_split: 6 |
长短期记忆神经 网络LSTM | 特殊的RNNs,网络结构中重复的单元不用,重复的单元被称为memory block(记忆块)。主要包含了3个门(forget gate、input gate、output gate)与1个记忆单元(cell)。网络中的cell state(单元状态)可以控制信息传递给下一时刻。在解决非线性时间序列数据时优势明显[ | output_dim=60,activation=‘relu’, epochs=30, batch_size = 72,loss=‘mae’, optimizer=‘adam’ |
表5
不同季节各模型预测结果"
模型 model | 春季spring | 夏季summer | 秋季autumn | 冬季winter | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R2 | σRMSE | σMAE | σMAPE/ % | R2 | σRMSE | σMAE | σMAPE/ % | R2 | σRMSE | σMAE | σMAPE/ % | R2 | σRMSE | EMAE | σMAPE/ % | ||
XGBoost | 0.905 | 9.127 | 5.479 | 14.852 | 0.900 | 6.364 | 4.107 | 21.752 | 0.922 | 8.367 | 4.680 | 16.131 | 0.948 | 8.085 | 5.720 | 9.697 | |
RF | 0.901 | 9.298 | 5.573 | 15.247 | 0.906 | 6.174 | 4.066 | 21.736 | 0.915 | 8.718 | 4.823 | 16.636 | 0.946 | 8.267 | 5.952 | 10.041 | |
LightGBM | 0.903 | 9.214 | 5.754 | 16.440 | 0.902 | 6.295 | 4.156 | 22.912 | 0.915 | 8.717 | 4.952 | 17.790 | 0.942 | 8.567 | 6.096 | 10.339 | |
DT | 0.861 | 11.019 | 6.453 | 17.430 | 0.876 | 7.098 | 4.681 | 24.701 | 0.895 | 9.687 | 5.451 | 18.936 | 0.927 | 9.585 | 6.848 | 11.441 | |
KNN | 0.890 | 9.805 | 6.157 | 17.669 | 0.889 | 6.693 | 4.472 | 25.308 | 0.905 | 9.210 | 5.322 | 19.870 | 0.939 | 8.753 | 6.311 | 11.238 | |
LSTM | 0.889 | 9.845 | 6.028 | 17.639 | 0.885 | 6.822 | 4.455 | 24.485 | 0.932 | 7.816 | 5.033 | 18.477 | 0.937 | 8.877 | 6.342 | 10.949 |
[1] | FUZZI S, BALTENSPERGER U, CARSLAW K, et al. Particulate matter,air quality and climate:lessons learned and future needs[J]. Atmos Chem Phys, 2015, 15(14):8217-8299.DOI:10.5194/acp-15-8217-2015. |
[2] | CESARI D, DE BENEDETTO G E, BONASONI P, et al. Seasonal variability of PM2.5 and PM10 composition and sources in an urban background site in southern Italy[J]. Sci Total Environ, 2018, 612:202-213.DOI:10.1016/j.scitotenv.2017.08.230. |
[3] | MANISALIDIS I, STAVROPOULOU E, STAVROPOULOS A, et al. Environmental and health impacts of air pollution:a review[J]. Front Public Health, 2020, 8:14.DOI:10.3389/fpubh.2020.00014. |
[4] | KIM K H, KABIR E, KABIR S. A review on the human health impact of airborne particulate matter[J]. Environ Int, 2015, 74:136-143.DOI:10.1016/j.envint.2014.10.005. |
[5] | LIU H Y, DUNEA D, IORDACHE S, et al. A review of airborne particulate matter effects on young children's respiratory symptoms and diseases[J]. Atmosphere, 2018, 9(4):150.DOI:10.3390/atmos9040150. |
[6] | CHOI S, KIM K H, KIM K, et al. Association between post-diagnosis particulate matter exposure among 5-year cancer survivors and cardiovascular disease risk in three metropolitan areas from south Korea[J]. Int J Environ Res Public Health, 2020, 17(8):2841.DOI:10.3390/ijerph17082841. |
[7] | ATKINSON R W, KANG S, ANDERSON H R, et al. Epidemiological time series studies of PM2.5 and daily mortality and hospital admissions:a systematic review and meta-analysis[J]. Thorax, 2014, 69(7):660-665.DOI:10.1136/thoraxjnl-2013-204492. |
[8] | ZHANG Y, BOCQUET M, MALLET V, et al. Real-time air quality forecasting,part I: history,techniques,and current status[J]. Atmos Environ, 2012, 60:632-655.DOI:10.1016/j.atmosenv.2012.06.031. |
[9] | 李锋, 朱彬, 安俊岭, 等. 2013年12月初长江三角洲及周边地区重霾污染的数值模拟[J]. 中国环境科学, 2015, 35(7):1965-1974. |
LI F, ZHU B, AN J L, et al. Modeling study of a severe haze episode occurred over the Yangtze River Delta and its surrounding regions during early December,2013[J]. China Environ Sci, 2015, 35(7):1965-1974.DOI:10.3969/j.issn.1000-6923.2015.07.008. | |
[10] | 周广强, 谢英, 吴剑斌, 等. 基于WRF-Chem模式的华东区域PM2.5预报及偏差原因[J]. 中国环境科学, 2016, 36(8):2251-2259. |
ZHOU G Q, XIE Y, WU J B, et al. WRF-Chem based PM2.5 forecast and bias analysis over the East China region[J]. China Environ Sci, 2016, 36(8):2251-2259.DOI:10.3969/j.issn.1000-6923.2016.08.002. | |
[11] | DENNIS R L, BYUN D W, NOVAK J H, et al. The next generation of integrated air quality modeling:EPA's models-3[J]. Atmos Environ, 1996, 30(12):1925-1938.DOI:10.1016/1352-2310(95)00174-3. |
[12] | CHEN Q Q, TAYLOR D. Transboundary atmospheric pollution in southeast Asia:current methods,limitations and future developments[J]. Crit Rev Environ Sci Technol, 2018, 48(16/17/18):997-1029.DOI:10.1080/10643389.2018.1493337. |
[13] | SHIMADERA H, KOJIMA T, KONDO A. Evaluation of air quality model performance for simulating long-range transport and local pollution of PM2.5 in Japan[J]. Adv Meteorol, 2016, 2016:5694251.DOI:10.1155/2016/5694251. |
[14] | 郑毅, 朱成璋. 基于深度信念网络的PM2.5预测[J]. 山东大学学报(工学版), 2014, 44(6):19-25. |
ZHENG Y, ZHU C Z. A prediction method of atmospheric PM2.5 based on DBNs[J]. J Shandong Univ (Eng Sci), 2014, 44(6):19-25.DOI:10.6040/j.issn.1672-3961.1.2014.180. | |
[15] | 曲悦, 钱旭, 宋洪庆, 等. 基于机器学习的北京市PM2.5浓度预测模型及模拟分析[J]. 工程科学学报, 2019, 41(3):401-407. |
QU Y, QIAN X, SONG H Q, et al. Machine-learning-based model and simulation analysis of PM2.5 concentration prediction in Beijing[J]. Chin J Eng, 2019, 41(3):401-407.DOI:10.13374/j.issn2095-9389.2019.03.014. | |
[16] | 李建新, 刘小生, 刘静, 等. 基于MRMR-HK-SVM模型的PM2.5浓度预测[J]. 中国环境科学, 2019, 39(6):2304-2310. |
LI J X, LIU X S, LIU J, et al. Prediction of PM2.5 concentration based on MRMR-HK-SVM model[J]. China Environ Sci, 2019, 39(6):2304-2310. DOI:10.3969/j.issn.1000-6923.2019.06.009. | |
[17] | 宋国君, 国潇丹, 杨啸, 等. 沈阳市PM2.5浓度ARIMA-SVM组合预测研究[J]. 中国环境科学, 2018, 38(11):4031-4039. |
SONG G J, GUO X D, YANG X, et al. ARIMA-SVM combination prediction of PM2.5 concentration in Shenyang[J]. China Environ Sci, 2018, 38(11):4031-4039.DOI:10.3969/j.issn.1000-6923.2018.11.005. | |
[18] | 康俊锋, 黄烈星, 张春艳, 等. 多机器学习模型下逐小时PM2.5预测及对比分析[J]. 中国环境科学, 2020, 40(5):1895-1905. |
KANG J F, HUANG L X, ZHANG C Y, et al. Hourly PM2.5 prediction and its comparative analysis under multi-machine learning model[J]. China Environ Sci, 2020, 40(5):1895-1905.DOI:10.19674/j.cnki.issn1000-6923.2020.0213. | |
[19] | KAYES I, SHAHRIAR S A, HASAN K, et al. The relationships between meteorological parameters and air pollutants in an urban environment[J]. Glob J Environ Sci Manag, 2019, 5(3):265-278.DOI:10.22034/GJESM.2019.03.01. |
[20] | 王黎明, 吴香华, 赵天良, 等. 基于距离相关系数和支持向量机回归的PM2.5浓度滚动统计预报方案[J]. 环境科学学报, 2017, 37(4):1268-1276. |
WANG L M, WU X H, ZHAO T L, et al. A scheme for rolling statistical forecasting of PM2.5 concentrations based on distance correlation coefficient and support vector regression[J]. Acta Sci Circumstantiae, 2017, 37(4):1268-1276.DOI:10.13671/j.hjkxxb.2016.0345. | |
[21] | JUHOS I, MAKRA L, TÓTH B. Forecasting of traffic origin NO and NO2 concentrations by support vector machines and neural networks using principal component analysis[J]. Simul Model Pract Theory, 2008, 16(9):1488-1502.DOI:10.1016/j.simpat.2008.08.006. |
[22] | 王占山, 李云婷, 陈添, 等. 2013年北京市PM2.5的时空分布[J]. 地理学报, 2015, 70(1):110-120. |
WANG Z S, LI Y T, CHEN T, et al. Spatial-temporal characteristics of PM2.5 in Beijing in 2013[J]. Acta Geogr Sin, 2015, 70(1):110-120.DOI:10.11821/dlxb201501009. | |
[23] | 郭立力, 赵春江. 十折交叉检验的支持向量机参数优化算法[J]. 计算机工程与应用, 2009, 45(8):55-57. |
GUO L L, ZHAO C J. Optimizing parameters of support vector machine's model based on genetic algorithm[J]. Comput Eng Appl, 2009, 45(8):55-57.DOI:10.3778/j.issn.1002-8331.2009.08.017. | |
[24] | CHEN T, TONG H, BENESTY M. Xgboost: eXtreme gradient boosting[M]. London: Sage Publications, 2016:931-961. |
[25] | 方匡南, 吴见彬, 朱建平, 等. 随机森林方法研究综述[J]. 统计与信息论坛, 2011, 26(3):32-38. |
FANG K N, WU J B, ZHU J P, et al. A review of technologies on random forests[J]. Stat Inf Forum, 2011, 26(3):32-38.DOI:10.3969/j.issn.1007-3116.2011.03.006. | |
[26] | KE G, MENG Q, FINLEY T, et al. LightGBM: a highly efficient gradient boosting decision tree[C]// Proceeding of the 13th interenational Conference Neural Information Processing Systems, New York: ACM, 2017:3149-3157. |
[27] | 桑应宾. 基于K近邻的分类算法研究[D]. 重庆: 重庆大学, 2009. |
SANG Y B. Research of classification algorithm based on K nearest neighbor[D]. Chongqing: Chongqing University, 2009. | |
[28] | CARRERA-GARCÍA L, MUCHART J, LAZARO J J, et al. Pediatric SMA patients with complex spinal anatomy:implementation and evaluation of a decision-tree algorithm for administration of nusinersen[J]. Eur J Paediatr Neurol, 2021, 31:92-101.DOI:10.1016/j.ejpn.2021.02.009. |
[29] | LI X, PENG L, YAO X J, et al. Long short-term memory neural network for air pollutant concentration predictions:method development and evaluation[J]. Environ Pollut, 2017, 231(Pt 1):997-1004.DOI:10.1016/j.envpol.2017.08.114. |
[30] | PENG H C, LONG F H, DING C. Feature selection based on mutual information:criteria of max-dependency,max-relevance,and min-redundancy[J]. IEEE Trans Pattern Anal Mach Intell, 2005, 27(8):1226-1238.DOI:10.1109/TPAMI.2005.159. |
[31] | BAE JE, CHOI H, SHIN D W, et al. Fine particulate matter (PM2.5) inhibits ciliogenesis by increasing SPRR3 expression via c-Jun activation in RPE cells and skin keratinocytes[J]. Sci Rep, 2019, 9(1):3994.DOI:10.1038/s41598-019-40670-y. |
[32] | 环境保护部. HJ 633-2012:环境空气质量指数(AQI)技术规定(试行)[EB/OL]. [2021-05-10]. http://www.mee.gov.cn/ywgz/fgbz/bz/bzwb/jcffbz/201203/W020120410332725219541.pdf. |
[33] | MA X Y, JIA H L, SHA T, et al. Spatial and seasonal characteristics of particulate matter and gaseous pollution in China: implications for control policy[J]. Environ Pollut, 2019, 248:421-428.DOI:10.1016/j.envpol.2019.02.038. |
[34] | 曾昭亮, 郭建平, 马大喜. 基于江西地区多卫星数据的气溶胶立体分布研究[J]. 大气与环境光学学报, 2016, 11(5):391-400. |
ZENG Z L, GUO J P, MA D X. Research of aerosol three-dimensional distribution based on multi-satellite data over Jiangxi[J]. J Atmos Environ Opt, 2016, 11(5):391-400.DOI:10.3969/j.issn.1673-6141.2016.05.007. |
[1] | 赵凌霄, 李智扬, 屈磊磊. 基于EMD和CatBoost算法的改进时间序列模型——以大连市PM2.5预测为例[J]. 南京林业大学学报(自然科学版), 2024, 48(3): 268-274. |
[2] | 李史欣, 张福全, 林海峰. 基于机器学习算法的森林火灾风险评估研究[J]. 南京林业大学学报(自然科学版), 2023, 47(5): 49-56. |
[3] | 王云霓, 曹恭祥, 徐丽宏, 陈胜楠. 内蒙古大青山华北落叶松人工林蒸散特征及其影响因子[J]. 南京林业大学学报(自然科学版), 2023, 47(4): 148-156. |
[4] | 侯秀娟, 闫晓云, 王波, 李心愿, 包红光. 夏季干旱半干旱城市公园绿地空气负离子与空气颗粒物变化特征[J]. 南京林业大学学报(自然科学版), 2022, 46(4): 212-220. |
[5] | 罗凤敏, 高君亮, 辛智鸣, 郝玉光, 李新乐, 段瑞兵. 乌兰布和沙漠绿洲防护林体系小气候效应研究[J]. 南京林业大学学报(自然科学版), 2021, 45(5): 143-152. |
[6] | 郭天威, 陆春锋, 王君櫹, 刘瑞程, 周生路. 基于三生空间耦合的生态安全格局构建与优化——以扬州市为例[J]. 南京林业大学学报(自然科学版), 2021, 45(5): 133-142. |
[7] | 黄雅茹, 辛智鸣, 李永华, 马迎宾, 董雪, 罗凤敏, 李新乐, 段瑞兵. 乌兰布和沙漠人工梭梭茎干液流季节变化及其与气象因子的关系[J]. 南京林业大学学报(自然科学版), 2020, 44(6): 131-139. |
[8] | KazuyoshiFUTAI. 对松萎蔫病的综合理解[J]. 南京林业大学学报(自然科学版), 2010, 34(01): 148-149. |
[9] | 曹福亮,方升佐,吕士行,徐锡增,唐罗忠. I-69/55杨速生丰产栽培原理与实践[J]. 南京林业大学学报(自然科学版), 1994, 18(03): 77-82. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||