Researchers at Khalifa University and ASPIREPMRIAD applied nested cross-validation on de-identified SEHA EHR data, training nine ML models with both automated and expert-driven feature selection. A Naive Bayes classifier achieved 0.96 AUC, highlighting dental and respiratory codes for cost-effective early mucopolysaccharidosis detection.

Key points

  • Domain-expert feature selection identifies dental and respiratory codes (e.g., acute gingivitis, bronchitis) critical for MPS prediction.
  • Naive Bayes classifier achieves 0.96 AUC, 0.93 accuracy, and 0.91 F1-score using EHR-derived features.
  • Nested cross-validation with SMOTE balancing validates nine ML models across five feature selection strategies on 1186 EHR covariates.

Why it matters: This non-invasive, AI-driven screening transforms rare disease diagnostics by flagging mucopolysaccharidosis risk from routine EHR data, enabling earlier intervention and better outcomes.

Q&A

  • What is mucopolysaccharidosis?
  • Why choose Naive Bayes for diagnosis?
  • What is nested cross-validation?
  • How does feature selection improve model accuracy?
Copy link
Facebook X LinkedIn WhatsApp
Share post via...


Read full article

Machine Learning for Rare Disease Diagnosis

Introduction: Rare diseases affect millions globally but often suffer delayed diagnosis. Machine learning (ML) uses statistical algorithms to find patterns in data, offering a new way to flag potential cases early.

Electronic Health Records (EHR): Doctors record diagnoses, lab results, and symptoms in digital systems called EHR. These records hold thousands of coded entries per patient, forming rich data for ML models.

Key Steps in ML Diagnosis:

  • Data Collection: Extract de-identified records from hospitals, ensuring patient privacy.
  • Data Preprocessing: Convert diagnosis codes into numerical features, balance data using techniques like SMOTE to avoid bias.
  • Feature Selection: Reduce thousands of codes to the most informative ones. Methods include statistical tests (chi-square), regularization (LASSO), and expert input.
  • Model Training: Use algorithms such as Naive Bayes, decision trees, and support vector machines. Nested cross-validation tests and tunes models on separate data splits.
  • Model Interpretation: Explain predictions using Shapley additive explanations (SHAP) and local interpretable model-agnostic explanations (LIME).

Understanding Naive Bayes: This algorithm calculates the probability that a patient has a rare disease based on each feature independently. For example, if acute gingivitis appears more often in MPS patients, the model boosts overall risk when this code is present.

Why Feature Selection Matters: Thousands of diagnosis codes can overwhelm an algorithm, leading to overfitting. By selecting only key features—like specific dental or respiratory codes—models learn to focus on the most relevant signs of a disease.

Training Strategy: Nested Cross-Validation: To gauge generalization, data is split into outer folds for testing and inner folds for tuning. This double validation ensures unbiased performance estimates and robust hyperparameter selection.

Balancing Classes: Rare diseases have fewer cases than common conditions. Techniques like SMOTE generate synthetic examples of the minority class to help models learn patterns without ignoring rare case data.

Interpretable Models: In healthcare, trust is critical. SHAP and LIME highlight which features contributed most to a prediction, allowing doctors to verify that the model relies on medically credible signs.

Case Study: Mucopolysaccharidosis (MPS): A metabolic disorder where enzyme deficiencies cause tissue buildup. Early signs include dental anomalies and frequent infections. An AI model flagged these codes in EHR with high accuracy, enabling cost-effective early screening.

Benefits for Patients:

  1. Faster Diagnosis: Automated scans of existing records can alert physicians before symptoms worsen.
  2. Non-Invasive Screening: No need for genetic tests or biopsies until high-risk patients are identified.
  3. Scalable Solution: Hospitals can deploy ML dashboards to monitor thousands of patients in real time.

Key Terms:

  • Overfitting: When a model learns noise instead of general patterns, performing poorly on new data.
  • Regularization: Techniques like L1 (LASSO) penalize complex models to prevent overfitting.
  • Hyperparameters: Settings such as tree depth or regularization strength; optimized via Bayesian methods.

Real-World Implementation: Hospitals can integrate an ML pipeline that updates models with new records and flags high-risk patients, prompting follow-up exams and specialist referrals.

Conclusion: Machine learning transforms rare disease diagnostics by mining routine data for hidden patterns. With transparent, validated models, clinicians can catch diseases earlier and improve patient outcomes.

Comparison of machine learning models for mucopolysaccharidosis early diagnosis using UAE medical records