*Result*: Inference of drowning sites of cases in the Pearl river based on microbial community profiling and random forest algorithm.
*Further Information*
*Accurate inference of drowning sites remains a critical challenge in forensic investigations, particularly for corpses recovered from dynamic aquatic environments. Conventional methods, such as diatom testing, are limited by the absence or scarcity of diatoms in certain water bodies, labor-intensive morphological identification, and challenges in distinguishing morphologically similar species. In this study, we explored the feasibility of inferring drowning sites in human cases by integrating pulmonary microbial community profiling with machine learning. A total of 56 lung tissue samples from confirmed drowning victims were collected from four regions of the Pearl River’s Guangzhou section, including the central urban waterfront (site1), mid-reach brackish transition zone (site2), southern estuarine outflow zone (site3), and eastern tributary confluence (site4). High-throughput sequencing of the 16 S rRNA gene (V3 – V4 region) was performed to characterize microbial community composition. Significant spatial heterogeneity in pulmonary microbiota was observed across drowning sites, as demonstrated by alpha diversity analysis, unweighted UniFrac-based principal coordinates analysis, and differential abundance testing. Linear discriminant analysis effect size (LEfSe) further identified 111 differentially abundant microbial taxa, providing biological interpretation of spatial microbial variation among groups. To enable drowning site inference, microbial features at the genus level were subjected to feature engineering using a hybrid strategy combining variance thresholding and the Boruta algorithm. Through this process, 32 genera—including <italic>Ralstonia</italic>, <italic>Sphingomonas</italic>, <italic>Akkermansia</italic>, and <italic>Faecalibacterium</italic>—were selected as key microbial markers for geolocation. Multiple classification models, including Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM), and Logistic Regression (LR), were constructed and compared. The RF model exhibited the superior predictive performance, achieving a test set accuracy of 92.3% and a macro-average area under the receiver operating characteristic curve (AUC) of 0.949. External validation using five independent cases further confirmed the model’s practical utility, correctly predicting the drowning sites for four of the victims. Overall, This study preliminarily demonstrates the feasibility of inferring drowning locations through pulmonary microbiome analysis combined with machine learning in human samples, demonstrating the novel application of this approach to human cases. Future efforts should expand geographic sampling and integrate environmental metadata to enhance methodological robustness. [ABSTRACT FROM AUTHOR]*