Inonu University researchers apply four machine learning algorithms—Random Forest, SVM, XGBoost and KNN—to complete blood count parameters to predict polycythaemia vera. After balancing the dataset with SMOTE and training on hemoglobin, hematocrit, white cell and platelet values, the XGBoost model attains an area under the curve of 0.99 and 94% accuracy, demonstrating AI’s potential to reduce reliance on expensive diagnostics like JAK2 mutation assays and bone marrow biopsy.

Key points

  • XGBoost model classifies PV with 0.99 AUC and 94% accuracy based on CBC features.
  • SMOTE oversampling addresses 82:1402 class imbalance before 80:20 train-test split.
  • PLT contributed 42.4% to model predictions, highlighting platelet count’s diagnostic value.

Why it matters: This study shows that machine learning on routine CBC can screen polycythaemia vera accurately, cutting diagnostic costs and invasiveness.

Q&A

  • What is the Synthetic Minority Oversampling Technique (SMOTE)?
  • How does XGBoost differ from other machine learning models?
  • Why use complete blood count (CBC) parameters for disease prediction?
  • What are the standard diagnostic tests for polycythaemia vera?
Copy link
Facebook X LinkedIn WhatsApp
Share post via...


Read full article

Machine Learning in Biomedical Research

Machine learning (ML) applies algorithms and statistical models to extract patterns from data and make predictions without explicit programming. In biomedical research, ML enables researchers to analyze complex datasets—such as genomic sequences, imaging scans, and laboratory test results—to identify disease markers, predict outcomes, and personalize treatments. Common ML tasks include classification (e.g., disease vs. healthy), regression (e.g., predicting biomarker levels), and clustering (e.g., grouping similar patient profiles).

ML methods can be broadly categorized into supervised learning, where models learn from labeled examples, and unsupervised learning, where models identify hidden structures in unlabeled data. Supervised models require careful splitting of data into training, validation, and test sets to avoid overfitting, whereas unsupervised methods like principal component analysis or clustering help uncover novel biological relationships. Interpretability tools such as SHAP (Shapley Additive explanations) provide insights into feature contributions, making ML outputs more transparent to clinicians and researchers.

Complete Blood Count (CBC) Parameters

A complete blood count (CBC) is one of the most common and inexpensive diagnostic tests in healthcare. It measures key blood components:

  • Hemoglobin (HGB): the oxygen-carrying protein in red blood cells.
  • Hematocrit (HCT): the proportion of blood volume occupied by red cells.
  • White Blood Cell Count (WBC): immune cells that fight infections.
  • Platelet Count (PLT): small cells that help blood clot.

Normal CBC ranges vary by age and sex, and they can shift with aging due to changes in bone marrow activity, immune system modulation, and overall physiological resilience. In longevity research, tracking CBC parameters over time helps monitor systemic inflammation, anemia prevalence, and clotting propensity—key factors linked to healthy aging and disease risk.

Applying Machine Learning to CBC Data

ML models can reveal patterns in CBC data that correlate with specific diseases or aging processes. Here is a typical workflow:

  1. Data Collection: Assemble a large dataset of CBC results tagged with outcomes, such as disease status or age-related decline metrics.
  2. Data Preprocessing: Clean and normalize data, address missing values through imputation, and encode categorical variables if present.
  3. Handling Imbalance: Apply techniques like SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples for underrepresented classes, improving model fairness.
  4. Model Selection and Training: Choose appropriate algorithms—decision trees, XGBoost, support vector machines, neural networks—and train them on feature sets.
  5. Model Evaluation: Use cross-validation and holdout test sets to calculate metrics like accuracy, precision, recall, F1-score, and ROC AUC, ensuring robust performance.
  6. Feature Importance and Interpretation: Employ tools like SHAP values or permutation importance to identify which CBC parameters most influence predictions, guiding biological understanding.

In diseases like polycythaemia vera, ML models trained on CBC parameters can achieve near-perfect classification, reducing reliance on expensive tests. Extending this approach to longevity science, researchers can use longitudinal CBC data to predict risks of age-related conditions, monitor responses to lifestyle interventions, and personalize health maintenance strategies.

By integrating ML-driven analysis of routine blood tests into clinical practice, healthcare providers can implement early warning systems, tailor screening schedules, and optimize preventive care. This democratizes access to advanced diagnostics, paving the way for more proactive and personalized longevity research and healthcare.

Moreover, coupling ML predictions with digital health platforms and wearable devices can create continuous monitoring systems, alerting individuals and clinicians to early deviations from healthy aging biomarkers. This synergy between data science and everyday health data holds promise for proactive longevity management.