Two-Script Pipeline for Gene-Expression Classifier

Preface

This notebook entry is largely generated by ChatGPT based on the scripts that resulted from a long and successful ChatGPT session. I’ve added links to data files. The work was motivated by previous exploratory work that involved running nf-core/differentialabundance on labeled data from Study1 (referred to as ‘A’ here), and Study5 (referred to as ‘B’ here), and a data set that combined the two studies’ data (referred to a ‘C’ here). The original exploration studied overlap between the three significance-filtered DEG results from the respective nf-core/differntialabundance runs. The present work takes a more comprehensive meta-analysis approach (described technically below) to produce an efficient panel of just six genes, appearing to be a promising basis for a classifier, subject to further validation.

See the README file in the code base for more technical details, including descriptions of two follow-on scripts for plotting and concrete model exploration.

High-Level Overview

This project implements a two-script pipeline to develop a small, robust, and reproducible classifier from oyster gene-expression studies. The pipeline is explicitly designed to separate feature discovery from classifier construction, reducing overfitting across batches (domains).

Script 1 (R): Performs a meta-analysis of differential-expression (DE) results across two studies to rank features (genes) by reproducibility, effect consistency, and heterogeneity. Outputs prioritized gene tiers.
Script 2 (Python): Trains a sparse classifier using stability selection with L1 logistic regression, validates generalization across batches using Leave-One-Group-Out (LOSO), and selects the minimal panel of genes meeting robustness criteria.

Script 1 (R): Meta-analysis + Tiering

Inputs

Per-study DE results: A, B, and combined C (columns: gene_id, log2FC, SE, p, padj, baseMean).
Feature universe = intersection of A and B gene IDs.

Outputs

meta_ranked.csv: meta-estimates, heterogeneity (I²), reproducibility score.
tiers.csv: Tier 1–3 assignments + flags.
panel_candidates_tier12.txt: Tier1 ∪ Tier2 shortlist.

Steps (CS/ML analogy)

Feature universe alignment → avoid schema drift.
Meta-estimates → Fixed Effect (weighted mean) or Random Effect (add between-study variance).
Heterogeneity (I²) → decide FE vs RE.
Sign consistency → require agreement across A & B.

Scoring function

score = -log10(p_meta) – λ·I² + bonuses

Tiering →
- Tier 1: replicated, low I².
- Tier 2: weaker but supportive.
- Tier 3: only combined C supports.

Analogy: Feature screening with cross-domain stability checks.

Script 2 (Python): Classifier Training with Stability Selection + LOSO

Inputs

Expression:
DESEQ2_NORM_all.vst.tsv (genes × samples, VST).
Metadata:
rnaseq_diffabundance_study1and5_samplesheet_filled.csv.
Panel candidates:
panel_candidates_tier12.txt. Note that this file was originally optional, but without it, many logistic regressions failed to converge, and in general, the likelihood of overfitting was high.

Outputs

stability_selection.csv: feature stability probabilities.
panel_size_sweep_metrics.csv: metrics vs panel size.
final_panel_gene_list.txt: chosen set.
loso_by_heldout.csv: batch-wise performance.

Steps

Data plumbing → transpose to samples × genes, align to metadata.
Restriction → Tier1 ∪ Tier2 features only. Necessary to aid logistic regression convergence.
Stability selection → repeated subsampling + sparse logistic regression (L1/EN).
Redundancy pruning → remove |r|>0.9 correlations.
Panel-size sweep → evaluate 6–24 genes, pick smallest within ΔAUROC ≤ 0.02 of best.
Calibration → Platt scaling, equal-weight averaging across batches.

Analogy: Sparse model with domain generalization checks.

ASCII Flow

Files
│
├── DESEQ2_NORM_all.vst.tsv
├── rnaseq_diffabundance_study1and5_samplesheet_filled.csv
└── DE tables: A, B, C
      ↓
[Script 1: R] Meta-analysis + Tiering
  → meta_ranked.csv
  → tiers.csv
  → panel_candidates_tier12.txt
      ↓
[Script 2: Python] Classifier + Validation
  → stability_selection.csv
  → panel_size_sweep_metrics.csv
  → final_panel_gene_list.txt
  → loso_by_heldout.csv

Glossary

VST: Variance-Stabilizing Transform (normalize variance across counts).
Meta-analysis: Combine effect sizes across studies; FE assumes same effect, RE allows heterogeneity.
I²: % of variability due to between-study heterogeneity.
LOSO: Leave-One-Group-Out cross-validation (train on 2 batches, test on 1).
Stability selection: Feature selection under resampling + sparsity.
Calibration (Platt): Adjust scores to calibrated probabilities.

Key Risks and Mitigations

Data leakage → run selection within folds; treat R shortlist as prior only.
Batch imbalance → equal-weight metrics; optional downsampling of large batch A.
Overfitting via correlation → redundancy pruning.
Parameter sensitivity → document thresholds and rerun sensitivity checks.

Adaptable Knobs

Tiering (R): I² cutoff, λ in score, p-value thresholds.
Classifier (Python): # resamples, subsample fraction, L1/EN penalty, correlation threshold, panel sizes, ΔAUROC tolerance, whether to restrict to Tier1∪Tier2.

Runtime Notes

R script: seconds–minutes (cheap).
Python script: resample-heavy;
- Tier1∪Tier2 (hundreds of features, ~500 resamples): 5–20 min.
- Full set (10k+ genes, 1000 resamples): 30–90+ min.
- Main cost drivers: # resamples, feature count, panel-size sweep.

Equations (Minimal Reference)

FE meta:
β_FE = Σ(β_i / SE_i²) / Σ(1 / SE_i²)
I²:
I² = max(0, (Q – (k–1))/Q) × 100%
L1 logistic:
minβ [ logloss(Xβ, y) + α‖β‖₁ ]

Additional Considerations

Run stability selection inside each LOSO fold for strict no-leakage.
Save seeds, configs, hashes for reproducibility.
Evaluate calibration with reliability curves, not just AUROC.
Optionally benchmark against linear SVMs.
If class priors differ by batch, consider prior correction.