A team from Kyoto University, Osaka University, and US collaborators introduces MLOmics, an open-access cancer multi-omics database. It integrates mRNA, miRNA, DNA methylation, and CNV datasets through standardized preprocessing, feature alignment, and statistical selection. This resource supports pan-cancer classification, subtype clustering, and imputation using uniform datasets and fair benchmarking.

Key points

  • Integrates 8,314 TCGA patient samples across 32 cancer types with mRNA, miRNA, methylation, and CNV omics profiles.
  • Implements standardized preprocessing including FPKM conversion, limma normalization, GAIA CNV annotation, and unified gene ID alignment.
  • Delivers 20 ready-to-use datasets for classification, clustering, and imputation with rigorous benchmarking using statistical and deep learning baselines.

Why it matters: By providing uniform, task-ready multi-omics datasets, MLOmics accelerates reproducible cancer ML research and enables robust model evaluation.

Q&A

  • What is multi-omics?
  • How does MLOmics preprocess omics data?
  • What are the Original, Aligned, and Top feature scales?
  • Which machine learning tasks does MLOmics support?
Copy link
Facebook X LinkedIn WhatsApp
Share post via...


Read full article

Multi-omics Data Integration

Multi-omics combines various layers of biological data—such as genomics, transcriptomics (gene expression), epigenomics (methylation), and proteomics—to give a holistic view of cellular processes. By integrating these datasets, researchers can capture complex interactions between DNA, RNA, proteins, and chemical modifications that underlie health and disease.

  • Genomics: DNA sequence and copy number data revealing genetic variations.
  • Transcriptomics: mRNA and microRNA expression levels indicating gene activity.
  • Epigenomics: DNA methylation patterns that regulate gene expression.
  • Proteomics: Protein abundance and modifications reflecting functional outcomes.

Integration involves aligning features across samples, normalizing measurements, and selecting relevant signals—transforming raw data into analysis-ready tables that machine learning algorithms can use to identify patterns and build predictive models.

Machine Learning in Cancer Genomics

Machine learning applies computational models to detect patterns in large-scale biological datasets. In cancer genomics, these methods help classify tumor types, predict patient outcomes, and discover biomarkers by learning from multi-omics profiles.

  1. Data Preprocessing: Convert raw counts to standardized units (e.g., FPKM for gene expression), filter noise, and handle missing values.
  2. Feature Engineering: Align genes across omics layers, normalize values (z-score), and select significant features via statistical tests like ANOVA.
  3. Model Training: Use algorithms such as random forests, support vector machines, or deep neural networks to learn relationships between omics features and cancer subtypes.
  4. Evaluation: Assess performance using metrics like precision, recall, F1-score for classification; silhouette score or survival analysis p-values for clustering; and mean squared error for imputation tasks.

By standardizing workflows and providing ready-made datasets, platforms like MLOmics lower the barrier to entry—empowering researchers without advanced bioinformatics expertise to develop robust, reproducible cancer models.

MLOmics: Cancer Multi-Omics Database for Machine Learning