Machine Learning Study Highlights Power of Data Augmentation in Lung Cancer Prediction

Published Date: September 10, 2025
By News Release

A new study is shedding light on how data augmentation techniques can dramatically improve the performance of machine learning models used for lung cancer risk prediction. The work addresses one of the most common obstacles in applying AI to medical datasets: class imbalance, where one category of cases vastly outnumbers the other, leading to skewed and unreliable outcomes.

Lung cancer remains the world’s leading cause of cancer deaths, claiming an estimated 1.8 million lives each year. While survival rates improve significantly when the disease is caught early, most patients are diagnosed at a late stage due to the absence of clear early symptoms. Machine learning approaches hold promise for earlier identification of high-risk individuals by analyzing structured patient data such as demographics, smoking history, and symptoms. But in practice, imbalanced datasets—where cancer-positive cases are far fewer than negatives—often hinder predictive accuracy.

To explore solutions, researchers tested a variety of resampling strategies combined with different machine learning classifiers. The dataset included 309 patient records with 16 attributes ranging from lifestyle habits like smoking and alcohol consumption to clinical indicators such as coughing, wheezing, chest pain, and fatigue. The researchers applied nine augmentation methods—including SMOTE, ADASYN, Borderline SMOTE, SMOTENC, and K-Means SMOTE—alongside ten classification models such as Logistic Regression, SVM, Random Forest, XGBoost, and Multi-Layer Perceptron (MLP).

Using 5-fold stratified cross-validation and hyperparameter tuning, the team systematically compared how each pairing performed on key metrics such as accuracy, recall, F1-score, and AUC-ROC. Among all tested combinations, K-Means SMOTE coupled with MLP stood out, achieving 93.55% accuracy and a 96.76% AUC-ROC score. This pairing proved particularly effective at balancing sensitivity and specificity, capturing subtle patterns in the data without overfitting.

Other augmentation-classifier combinations also performed well. SMOTE paired with XGBoost produced an AUC-ROC of 95.83%, while Random Oversampling with SVM yielded 96.06%. In contrast, simpler approaches like ADASYN combined with Decision Trees lagged behind, highlighting that not all methods are equally effective for handling imbalance in small clinical datasets.

To ensure transparency in predictions, the study incorporated LIME, an explainable AI tool that highlights which features influenced the model’s decisions. Across different models, clinically relevant factors such as smoking, coughing, fatigue, yellow fingers, and shortness of breath consistently emerged as key predictors. These insights add a layer of interpretability that is essential for clinician trust and potential integration into real-world workflows.

Despite its encouraging results, the study’s authors caution that the findings are preliminary. The dataset was small and atypically skewed, with lung cancer cases making up nearly 88% of entries—much higher than real-world prevalence. This raises concerns about overestimating performance. Additionally, important variables such as smoking intensity, family history, and genetic markers were not included. Without external validation on larger and more representative cohorts, the models cannot yet be considered clinically deployable.

Still, the research underscores the importance of carefully selecting augmentation techniques to strengthen machine learning performance in healthcare. By showing that specific augmentation-classifier pairings can both boost accuracy and provide interpretable outputs, the study lays groundwork for developing more reliable tools to aid in early lung cancer detection—a critical step in improving patient outcomes.