Preface
This notebook entry is largely generated by ChatGPT based on the scripts that resulted from a long and successful ChatGPT session. I’ve added links to data files. The work was motivated by previous exploratory work that involved running nf-core/differentialabundance on labeled data from Study1 (referred to as ‘A’ here), and Study5 (referred to as ‘B’ here), and a data set that combined the two studies’ data (referred to a ‘C’ here). The original exploration studied overlap between the three significance-filtered DEG results from the respective nf-core/differntialabundance runs. The present work takes a more comprehensive meta-analysis approach (described technically below) to produce an efficient panel of just six genes, appearing to be a promising basis for a classifier, subject to further validation.
See the README file in the code base for more technical details, including descriptions of two follow-on scripts for plotting and concrete model exploration.
High-Level Overview
This project implements a two-script pipeline to develop a small, robust, and reproducible classifier from oyster gene-expression studies. The pipeline is explicitly designed to separate feature discovery from classifier construction, reducing overfitting across batches (domains).
-
Script 1 (R): Performs a meta-analysis of differential-expression (DE) results across two studies to rank features (genes) by reproducibility, effect consistency, and heterogeneity. Outputs prioritized gene tiers.
-
Script 2 (Python): Trains a sparse classifier using stability selection with L1 logistic regression, validates generalization across batches using Leave-One-Group-Out (LOSO), and selects the minimal panel of genes meeting robustness criteria.
Script 1 (R): Meta-analysis + Tiering
Inputs
- Per-study DE results: A, B, and combined C (columns:
gene_id, log2FC, SE, p, padj, baseMean
). - Feature universe = intersection of A and B gene IDs.
Outputs
meta_ranked.csv
: meta-estimates, heterogeneity (I²), reproducibility score.tiers.csv
: Tier 1–3 assignments + flags.panel_candidates_tier12.txt
: Tier1 ∪ Tier2 shortlist.
Steps (CS/ML analogy)
-
Feature universe alignment → avoid schema drift.
-
Meta-estimates → Fixed Effect (weighted mean) or Random Effect (add between-study variance).
-
Heterogeneity (I²) → decide FE vs RE.
-
Sign consistency → require agreement across A & B.
-
Scoring function
score = -log10(p_meta) – λ·I² + bonuses
-
Tiering →
- Tier 1: replicated, low I².
- Tier 2: weaker but supportive.
- Tier 3: only combined C supports.
Analogy: Feature screening with cross-domain stability checks.
Script 2 (Python): Classifier Training with Stability Selection + LOSO
Inputs
- Expression:
DESEQ2_NORM_all.vst.tsv
(genes × samples, VST). - Metadata:
rnaseq_diffabundance_study1and5_samplesheet_filled.csv
. - Optional:
panel_candidates_tier12.txt
.
Outputs
stability_selection.csv
: feature stability probabilities.panel_size_sweep_metrics.csv
: metrics vs panel size.final_panel_gene_list.txt
: chosen set.loso_by_heldout.csv
: batch-wise performance.
Steps
- Data plumbing → transpose to samples × genes, align to metadata.
- Optional restriction → Tier1 ∪ Tier2 features only.
- Stability selection → repeated subsampling + sparse logistic regression (L1/EN).
- Redundancy pruning → remove |r|>0.9 correlations.
- Panel-size sweep → evaluate 6–24 genes, pick smallest within ΔAUROC ≤ 0.02 of best.
- Calibration → Platt scaling, equal-weight averaging across batches.
Analogy: Sparse model with domain generalization checks.
ASCII Flow
Files
│
├── DESEQ2_NORM_all.vst.tsv
├── rnaseq_diffabundance_study1and5_samplesheet_filled.csv
└── DE tables: A, B, C
↓
[Script 1: R] Meta-analysis + Tiering
→ meta_ranked.csv
→ tiers.csv
→ panel_candidates_tier12.txt
↓
[Script 2: Python] Classifier + Validation
→ stability_selection.csv
→ panel_size_sweep_metrics.csv
→ final_panel_gene_list.txt
→ loso_by_heldout.csv
Glossary
- VST: Variance-Stabilizing Transform (normalize variance across counts).
- Meta-analysis: Combine effect sizes across studies; FE assumes same effect, RE allows heterogeneity.
- I²: % of variability due to between-study heterogeneity.
- LOSO: Leave-One-Group-Out cross-validation (train on 2 batches, test on 1).
- Stability selection: Feature selection under resampling + sparsity.
- Calibration (Platt): Adjust scores to calibrated probabilities.
Key Risks and Mitigations
- Data leakage → run selection within folds; treat R shortlist as prior only.
- Batch imbalance → equal-weight metrics; optional downsampling of large batch A.
- Overfitting via correlation → redundancy pruning.
- Parameter sensitivity → document thresholds and rerun sensitivity checks.
Adaptable Knobs
- Tiering (R): I² cutoff, λ in score, p-value thresholds.
- Classifier (Python): # resamples, subsample fraction, L1/EN penalty, correlation threshold, panel sizes, ΔAUROC tolerance, whether to restrict to Tier1∪Tier2.
Runtime Notes
- R script: seconds–minutes (cheap).
- Python script: resample-heavy;
- Tier1∪Tier2 (hundreds of features, ~500 resamples): 5–20 min.
- Full set (10k+ genes, 1000 resamples): 30–90+ min.
- Main cost drivers: # resamples, feature count, panel-size sweep.
Equations (Minimal Reference)
- FE meta:
β_FE = Σ(β_i / SE_i²) / Σ(1 / SE_i²) - I²:
I² = max(0, (Q – (k–1))/Q) × 100% - L1 logistic:
minβ [ logloss(Xβ, y) + α‖β‖₁ ]
Additional Considerations
- Run stability selection inside each LOSO fold for strict no-leakage.
- Save seeds, configs, hashes for reproducibility.
- Evaluate calibration with reliability curves, not just AUROC.
- Optionally benchmark against linear SVMs.
- If class priors differ by batch, consider prior correction.