Context
This note documents how the 2-stage LOSO pipeline was designed and then revised to reduce leakage risk. It is based on:
ChatGPT_Minimal_Biomarker_Longest_Gene_selection_strategy_commented.pdf(main design thread),ChatGPT_Minimal_Biomarker_Gene_classifier_explanation_2_commented.pdf(summary + explicit risk/mitigation framing),- current code in this directory, and
previous_versioncode representing the pre-pivot implementation.
Primary goal: a small, reproducible gene panel and an auditable classifier that generalizes across batches/studies.
What the initial design proposed
From the main transcript, the intended architecture was:
- Stage 1 (R): cross-study ranking/tiering using FE/RE meta-analysis, heterogeneity (
I²), sign consistency, and tier rules. - Stage 2 (Python): stability selection + redundancy pruning + panel-size sweep + LOSO evaluation.
- select the smallest panel size within
ΔAUROC <= 0.02of the best-performing size.
This exact structure appears in the implemented scripts:
run_tiering.R(meta + tiering),lasso_prune_onefold.py(stability selection + pruning + per-fold model fitting),1_run_loso_pipeline.py(orchestrates fold-wise execution),2_summarize_loso_results.py(aggregate reporting).
Why we pivoted
The secondary transcript emphasized key risks (especially leakage and domain imbalance), and that framing drove implementation changes.
Pre-pivot behavior (previous_version/lasso_prune_loso_size.py)
The old script:
- loaded all samples,
- ran stability selection on the full dataset,
- ranked/pruned genes globally,
- then performed LOSO evaluation.
That means held-out groups influenced feature ranking before evaluation. Even with LOSO scoring later, this leakage of held-out (test) data into feature ranking can make performance estimates optimistic.
Current leakage-aware behavior
The revised pipeline enforces fold isolation:
1_run_loso_pipeline.pydefines train/test samples per held-out batch.- It calls
run_tiering.Rwith--train_samples, so tiering/meta-ranking are fold-specific. - It calls
lasso_prune_onefold.pyfor that fold only. - In
lasso_prune_onefold.py, stability selection, pruning; held-out data are only used for fold evaluation outputs and panel-size evaluation.
This converts the process from “global feature discovery + LOSO score” to nested per-fold feature discovery + fold-held-out testing, which better matches the intended domain-generalization objective.
Implemented risk mitigations (from transcript to code)
1) Leakage control
- Implemented via fold-specific training sample lists (
train_samples.txt) and per-fold outputs inresults_08_Dec_2025/loso_*. - The Dec-2025 directory name is intentional and reflects when that analysis run was produced, even though this notebook write-up is dated 2026-06-09.
- Tiering and feature ranking are no longer run once on all samples.
2) Cross-domain reproducibility emphasis
run_tiering.Rcomputes FE/RE meta quantities and heterogeneity (I²), then tier assignments using sign and significance rules.- Tier 1/2 outputs constrain candidate features passed to modeling (
panel_candidates.txt).
3) Minimal-panel selection discipline
lasso_prune_onefold.pyevaluates fixed panel sizes[6, 8, 10, 12, 16, 24]and records AUROC/AUPR/Brier.- Final fold model is refit on the best fold-specific size and exported with coefficients/predictions.
4) Probability calibration and reporting
- Calibrated classifiers (
CalibratedClassifierCV) are used for fold models. 2_summarize_loso_results.pyconsolidates fold metrics, coefficients, pooled predictions, ROC, and calibration plots.
Outcome
The implementation now matches the key methodological intent from both transcripts:
- retain reproducibility-first screening,
- build small interpretable panels,
- and evaluate generalization with stricter fold isolation to reduce leakage risk.