An interdisciplinary team from the University of Hong Kong and Shenzhen University introduces OMMT-PredNet, a multimodal deep learning framework that fuses high-resolution oral images with encoded clinical data. It concurrently detects epithelial dysplasia and predicts time-to-event malignant transformation, enabling non-invasive oral cancer screening and personalized risk stratification.

Key points

  • OMMT-PredNet integrates ResNet50 with dual CBAM modules to spotlight lesion texture and spatial features in oral images without manual ROI annotation.
  • A textual feature encoder transforms encoded demographics, clinical subtype, and lesion characteristics into embeddings, which are concatenated with image features for multimodal fusion.
  • Multi-task learning uses cross-entropy for dysplasia classification, BCE with logits for malignant transformation scoring, and Cox proportional hazards loss for time-to-event risk prediction (AUCs 0.9592 and 0.9219).

Why it matters: This multimodal AI approach streamlines non-invasive oral cancer screening, improving early detection and personalized monitoring over conventional biopsy-based methods.

Q&A

  • What is oral epithelial dysplasia?
  • How does CBAM enhance model accuracy?
  • What role does Cox proportional hazards play in prediction?
  • Why fuse images and clinical text?
Copy link
Facebook X LinkedIn WhatsApp
Share post via...


Read full article

Multimodal AI in Medical Imaging for Risk Prediction

Introduction
Multimodal AI combines image-based and textual data to improve diagnostic accuracy in healthcare. This approach leverages deep learning techniques to extract meaningful patterns from diverse inputs, such as high-resolution clinical photographs and electronic health records (EHRs), enabling more comprehensive disease assessment.

Deep Learning Foundations
Deep learning uses neural networks with multiple layers to learn hierarchical representations of data. Convolutional Neural Networks (CNNs) specialize in image analysis by applying filters that detect edges, textures, and shapes. Fully connected layers process textual or structured data by learning weighted combinations of input features.

Key Components

  • ResNet50 Backbone: A 50-layer CNN designed with residual connections to mitigate vanishing gradient issues. Each residual block learns the difference between input and desired output, enabling deeper networks without performance degradation.
  • Convolutional Block Attention Module (CBAM): Applies channel attention to weigh feature maps and spatial attention to highlight important regions. CBAM refines CNN outputs by focusing on lesion-specific textures while suppressing irrelevant background.
  • Textual Feature Encoding: Patient demographics, clinical history, lesion subtype, and other categorical data are converted into numerical embeddings via fully connected layers with ReLU activation. Dropout layers prevent overfitting by randomly disabling neurons during training.

Multimodal Fusion

After independent encoding, image and textual feature vectors are concatenated into a unified representation. This fusion integrates visual and clinical information, providing the AI with a holistic view of disease markers and risk factors.

Multi-Task Learning

  1. Dysplasia Classification: Cross-entropy loss trains the model to categorize epithelial dysplasia into clinically relevant grades.
  2. Risk Scoring: Binary cross-entropy with logits loss optimizes malignant transformation likelihood at each time point.
  3. Survival Analysis: Cox proportional hazards loss handles censored time-to-event data, teaching the model to predict hazard functions and survival probabilities over time.

Model Training and Validation

Data are split into training and test sets at the patient level, ensuring no overlap. Five-fold cross-validation with bootstrapping estimates robust performance metrics including AUC, concordance index (c-Index), sensitivity, specificity, and Brier scores.

Clinical Applications

  • Non-Invasive Triage: AI flags high-risk lesions for biopsy, reducing unnecessary procedures.
  • Personalized Monitoring: Time-dependent risk curves guide follow-up intervals tailored to individual patients.
  • Resource Optimization: Automated image analysis eases diagnostic workload in settings with limited specialist access.

Future Directions

Integration of molecular biomarkers, deployment on portable devices, and expansion to other precancerous conditions can further enhance real-world impact. Community-driven data sharing and federated learning may improve generalizability across diverse populations.

Next-generation AI framework for comprehensive oral leukoplakia evaluation and management