2-stage LOSO classifier pipeline (leakage-aware revision)

Context

This note documents how the 2-stage LOSO pipeline was designed and then revised to reduce leakage risk. It is based on:

ChatGPT_Minimal_Biomarker_Longest_Gene_selection_strategy_commented.pdf (main design thread),
ChatGPT_Minimal_Biomarker_Gene_classifier_explanation_2_commented.pdf (summary + explicit risk/mitigation framing),
current code in this directory, and
previous_version code representing the pre-pivot implementation.

Primary goal: a small, reproducible gene panel and an auditable classifier that generalizes across batches/studies.

What the initial design proposed

From the main transcript, the intended architecture was:

Stage 1 (R): cross-study ranking/tiering using FE/RE meta-analysis, heterogeneity (I²), sign consistency, and tier rules.
Stage 2 (Python): stability selection + redundancy pruning + panel-size sweep + LOSO evaluation.
select the smallest panel size within ΔAUROC <= 0.02 of the best-performing size.

This exact structure appears in the implemented scripts:

run_tiering.R (meta + tiering),
lasso_prune_onefold.py (stability selection + pruning + per-fold model fitting),
1_run_loso_pipeline.py (orchestrates fold-wise execution),
2_summarize_loso_results.py (aggregate reporting).

Why we pivoted

The secondary transcript emphasized key risks (especially leakage and domain imbalance), and that framing drove implementation changes.

Pre-pivot behavior (`previous_version/lasso_prune_loso_size.py`)

The old script:

loaded all samples,
ran stability selection on the full dataset,
ranked/pruned genes globally,
then performed LOSO evaluation.

That means held-out groups influenced feature ranking before evaluation. Even with LOSO scoring later, this leakage of held-out (test) data into feature ranking can make performance estimates optimistic.

Current leakage-aware behavior

The revised pipeline enforces fold isolation:

1_run_loso_pipeline.py defines train/test samples per held-out batch.
It calls run_tiering.R with --train_samples, so tiering/meta-ranking are fold-specific.
It calls lasso_prune_onefold.py for that fold only.
In lasso_prune_onefold.py, stability selection, pruning; held-out data are only used for fold evaluation outputs and panel-size evaluation.

This converts the process from “global feature discovery + LOSO score” to nested per-fold feature discovery + fold-held-out testing, which better matches the intended domain-generalization objective.

Implemented risk mitigations (from transcript to code)

1) Leakage control

Implemented via fold-specific training sample lists (train_samples.txt) and per-fold outputs in results_08_Dec_2025/loso_*.
The Dec-2025 directory name is intentional and reflects when that analysis run was produced, even though this notebook write-up is dated 2026-06-09.
Tiering and feature ranking are no longer run once on all samples.

2) Cross-domain reproducibility emphasis

run_tiering.R computes FE/RE meta quantities and heterogeneity (I²), then tier assignments using sign and significance rules.
Tier 1/2 outputs constrain candidate features passed to modeling (panel_candidates.txt).

3) Minimal-panel selection discipline

lasso_prune_onefold.py evaluates fixed panel sizes [6, 8, 10, 12, 16, 24] and records AUROC/AUPR/Brier.
Final fold model is refit on the best fold-specific size and exported with coefficients/predictions.

4) Probability calibration and reporting

Calibrated classifiers (CalibratedClassifierCV) are used for fold models.
2_summarize_loso_results.py consolidates fold metrics, coefficients, pooled predictions, ROC, and calibration plots.

Outcome

The implementation now matches the key methodological intent from both transcripts:

retain reproducibility-first screening,
build small interpretable panels,
and evaluate generalization with stricter fold isolation to reduce leakage risk.