南京林业大学学报(自然科学版) ›› 2022, Vol. 46 ›› Issue (5): 152-160.doi: 10.12302/j.issn.1000-2006.202106023

• 研究论文 • 上一篇    下一篇

基于多机器学习模型的逐小时PM2.5浓度预测对比

陈建坤(), 牟凤云(), 张用川, 田甜, 王俊秀   

  1. 重庆交通大学智慧城市学院,重庆 400074
  • 收稿日期:2021-06-19 修回日期:2021-09-05 出版日期:2022-09-30 发布日期:2022-10-19
  • 通讯作者: 牟凤云
  • 基金资助:
    国家重点研发计划(2019YFB2102503);重庆市自然科学基金项目(cstc2019jcyj-msxmX0626)

Comparative analysis of hourly PM2.5 prediction based on multiple machine learning models

CHEN Jiankun(), MU Fengyun(), ZHANG Yongchuan, TIAN Tian, WANG Junxiu   

  1. School of Smart City, Chongqing Jiaotong University, Chongqing 400074,China
  • Received:2021-06-19 Revised:2021-09-05 Online:2022-09-30 Published:2022-10-19
  • Contact: MU Fengyun

摘要:

【目的】比较分析XGBoost模型、LightGBM模型、随机森林模型(RF)、K最近邻模型(KNN)、长短期记忆神经网络(LSTM)、决策树模型(DT)共6个PM2.5浓度预测模型,以准确、及时预测环境PM2.5浓度。【方法】基于重庆市合川区2020年全年空气质量监测数据和气象数据,通过最大相关最小冗余算法(MRMR)进行数据降维选择最优特征子集,作为模型的输入,逐一进行PM2.5浓度预测;考虑到不同季节PM2.5浓度差异较大,故分季节预测了PM2.5浓度;为了探究各模型预测性能,计算了各模型运行时间和内存占用,并基于PM2.5与特征变量的相关性和特征变量的重要性探讨了模型预测性能季节性差异原因。【结果】模型总体预测精度从高到低排序为 XGBoost、RF、LightGBM、LSTM、KNN、DT模型;预测性能方面,6个模型均表现为秋冬季节预测精度高于春夏季节;LightGBM模型可在保证模型精度的情况下,大幅减少模型训练时间和内存占用;特征重要性显示PM10浓度、气温和气压的重要性高,O3浓度、风向和NO2浓度重要性相对较弱。【结论】采取MRMR方法进行数据降维选取的最优特征子集能较好地预测PM2.5浓度;相比较而言,XGBoost、RF、LightGBM、LSTM模型在PM2.5浓度预测上具有较优性能,其中综合性能较好的为LightGBM模型。

关键词: PM2.5预测, 机器学习, 最大相关最小冗余(MRMR), 气象因子

Abstract:

【Objective】Comparative analysis of the XGBoost model, LightGBM model, random forest model (RF), K nearest neighbor model (KNN), long short-term memory neural network (LSTM), and the decision tree model (DT), a total of six PM2.5 concentration prediction models was undertaken to ensure accurate and timely prediction of the ambient PM2.5 concentration.【Method】Based on a full-year of air quality monitoring data and the meteorological data of Hechuan District, Chongqing City in 2020, the maximum minimum redundancy algorithm (MRMR) was used to reduce the data dimensionality to select the optimal feature subset, which is used as the model input. The PM2.5 concentration prediction was then undertaken one at a time. Considering that the PM2.5 concentration varies considerably during different seasons, the PM2.5 concentration was predicted according to season. This was undertaken to explore the prediction performance of each model and the running time and memory usage of each model were calculated. Based on the correlation between PM2.5 and the characteristic variables and the importance of the characteristic variables, the causes of the seasonal differences in model prediction performance are discussed.【Result】The overall prediction accuracy of the model is ranked as XGBoost, RF, LightGBM, LSTM, KNN, and the DT models. In terms of the prediction performance, the six models all show that the prediction accuracy in autumn and winter is higher than that of spring and summer. The LightGBM model can considerably reduce the training time and memory occupation of the model while ensuring the model accuracy. The importance of these features shows that the importance of PM10, temperature, and the air pressure is high, while the importance of O3, wind direction, and NO2 is relatively weak.【Conclusion】The optimal feature subset selected using the MRMR method for data dimensionality reduction can better predict the PM2.5 concentration. In comparison, the XGBoost, RF, LightGBM, and the LSTM models have higher performance in PM2.5 prediction, among them, the Light GBM has better comprehensive performance.

Key words: PM2.5 prediction, machine learning, maximum correlation minimum redundancy (MRMR), meteorological factors

中图分类号: