*Result*: Integration of WRF-Chem Model-Based, Satellite-Based, and Ground-Based Observation Data to Predict PM2.5 Concentration by Machine Learning Approach.

Title:
Integration of WRF-Chem Model-Based, Satellite-Based, and Ground-Based Observation Data to Predict PM2.5 Concentration by Machine Learning Approach.
Source:
Atmosphere; Nov2025, Vol. 16 Issue 11, p1304, 31p
Geographic Terms:
Database:
Complementary Index

*Further Information*

*Fine particulate matter (PM2.5) is a critical environmental and health concern in northern Thailand, where haze episodes are strongly influenced by biomass burning, meteorological variability, and complex topography. This study aims to (1) analyze and select input variables for PM2.5 prediction by integrating WRF-Chem outputs, satellite data, and ground observations, and (2) evaluate the predictive performance of four machine learning (ML) algorithms—Random Forest (RF), XGBoost, CNN3D, and ConvLSTM—during the 2024 haze season (January–May). The dataset included hourly PM2.5 observations from 54 stations, the WRF-Chem-simulated PM2.5 and meteorological variables, satellite-based fire data, and geographical data. To improve consistency with ground-based data, WRF-Chem PM2.5 values were bias-corrected for the training and validation phases prior to ML learning. Among Linear Regression, RF, XGBoost, Artificial Neural Network (ANN), and Convolutional Neural Network (CNN) tested for bias correction, RF achieved the best performance (R = 0.78, RMSE = 29.28 µg/m<sup>3</sup>); the RF-corrected WRF-Chem PM2.5 was then used as an input to the forecasting stage. Variable selection was supported by correlation, VIF, feature importance, and SHAP analyses. The results indicate that RF provided the most reliable predictions, achieving a correlation of R = 0.867 and the lowest RMSE of 27.6 µg/m<sup>3</sup> when using the SHAP+VIF-selected input set (seven variables: PM2.5_lag1, PM2.5_lag24, T2, RH2, Precip, Burned Area, NDVI). Notably, RF remained the top performer, predicting PM2.5 more accurately than the other algorithms during high-pollution conditions, specifically Air Quality Index (AQI) "Unhealthy for Sensitive Groups" (high) and "Unhealthy" (very high). Taken together, RF set the performance bar across both stages, with XGBoost ranked second, whereas CNN3D and ConvLSTM performed considerably worse. These findings emphasize the effectiveness of ensemble tree-based algorithms combined with bias-corrected WRF-Chem outputs and strategic variable selection in supporting accurate hourly PM2.5 predictions for air quality management in biomass burning regions. [ABSTRACT FROM AUTHOR]

Copyright of Atmosphere is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)*