Researchers from Southeast University and the Jiangsu Provincial Center for Disease Prevention and Control compare logistic regression with seven machine learning methods—like GA-RF, GRNN, and PNN—on SNP data from 1,338 noise-exposed workers. They use cross-validation and hyperparameter tuning to evaluate accuracy, AUC, and F-scores for predicting noise-induced hearing loss.

Key points

  • Dataset of 1,338 noise-exposed workers genotyped at 88 SNP loci.
  • GA-RF achieved top accuracy (84.4%), F-score (0.773), R² (0.757), and AUC (0.752).
  • GRNN and PNN used hyperparameter-optimized neural nets, with GRNN hitting 97.5% accuracy on select SNP combos.
  • Classical ML (DT, GBDT, KNN, XGBoost) showed varied improvements over logistic regression.
  • Logistic regression’s AUC capped at 0.704, while ML methods uncovered nonlinear SNP interactions.

Why it matters: Applying advanced machine learning to high-dimensional SNP datasets reveals nuanced genetic risk factors for occupational hearing loss, surpassing traditional statistical models. This approach enables earlier, more precise identification of susceptible workers, paving the way for personalized prevention strategies in occupational health.

Q&A

  • What is noise-induced hearing loss?
  • What role do SNP loci play here?
  • How does GA-RF work?
  • Why use GRNN and PNN?
  • What metrics evaluate model performance?
Copy link
Facebook X LinkedIn WhatsApp
Share post via...


Read full article

Machine Learning in Genetic Risk Prediction

Introduction: The application of machine learning (ML) to genetic data is transforming our understanding of disease susceptibility. In occupations with high noise exposure, identifying workers at risk of noise-induced hearing loss (NIHL) requires analyzing complex genetic variations known as single nucleotide polymorphisms (SNPs). This course explains how ML methods handle high-dimensional SNP datasets to predict health outcomes.

Key Concepts

  • Single Nucleotide Polymorphism (SNP): A SNP is a single-base difference in DNA sequence among individuals. SNPs in specific genes can influence how the inner ear responds to noise and oxidative stress.
  • High-Dimensional Data: Genotyping hundreds of SNPs creates a dataset with many features per individual. Traditional statistical models struggle with such data due to multicollinearity and nonlinear interactions.
  • Machine Learning Models: ML algorithms like Random Forests (RF), Decision Trees (DT), and neural networks (PNN, GRNN) can learn patterns from complex datasets. They automatically capture interactions among SNPs and environmental factors without predefined assumptions.

Popular ML Approaches

  1. Random Forest and GA-RF: Random Forest builds an ensemble of decision trees on bootstrap samples. GA-RF integrates a genetic algorithm to optimize parameters (e.g., tree depth, feature subsets), improving predictive accuracy on SNP data.
  2. Neural Networks (PNN & GRNN): Probabilistic Neural Networks classify samples based on estimated probability density functions for each class. Generalized Regression Neural Networks estimate continuous outputs via kernel smoothing. Both models handle nonlinearity and high feature counts.
  3. Gradient Boosting (XGBoost, GBDT): These methods iteratively fit tree-based learners to residual errors, building strong predictors. Hyperparameter tuning is critical to avoid overfitting.

Model Evaluation Metrics

  • Accuracy: Percentage of correctly predicted cases and controls.
  • Recall (Sensitivity): Ability to identify actual hearing loss cases.
  • Precision: Proportion of predicted cases that are true cases.
  • F-score: Harmonic mean of recall and precision.
  • AUC (ROC): Measures discriminative power across all thresholds.
  • Nagelkerke R²: Pseudo R² for binary outcomes, indicating model fit.

Applications in Occupational Health

By applying ML to SNP panels and exposure data, occupational health practitioners can:

  • Screen workers for genetic susceptibility to NIHL.
  • Implement targeted preventive measures (e.g., personalized hearing protection).
  • Monitor high-risk individuals more closely over time.

Understanding ML fundamentals and careful model evaluation ensures reliable disease risk predictions from genetic data, driving personalized medicine in workplace safety.

Comparison between logistic regression and machine learning algorithms on prediction of noise-induced hearing loss and investigation of SNP loci