Preface

This notebook entry is largely generated by ChatGPT based on the scripts that resulted from a long and successful ChatGPT session. I’ve added links to data files. The work was motivated by previous exploratory work that involved running nf-core/differentialabundance on labeled data from Study1 (referred to as ‘A’ here), and Study5 (referred to as ‘B’ here), and a data set that combined the two studies’ data (referred to a ‘C’ here). The original exploration studied overlap between the three significance-filtered DEG results from the respective nf-core/differntialabundance runs. The present work takes a more comprehensive meta-analysis approach (described technically below) to produce an efficient panel of just six genes, appearing to be a promising basis for a classifier, subject to further validation.

See the README file in the code base for more technical details, including descriptions of two follow-on scripts for plotting and concrete model exploration.

High-Level Overview

This project implements a two-script pipeline to develop a small, robust, and reproducible classifier from oyster gene-expression studies. The pipeline is explicitly designed to separate feature discovery from classifier construction, reducing overfitting across batches (domains).

  • Script 1 (R): Performs a meta-analysis of differential-expression (DE) results across two studies to rank features (genes) by reproducibility, effect consistency, and heterogeneity. Outputs prioritized gene tiers.

  • Script 2 (Python): Trains a sparse classifier using stability selection with L1 logistic regression, validates generalization across batches using Leave-One-Group-Out (LOSO), and selects the minimal panel of genes meeting robustness criteria.


Script 1 (R): Meta-analysis + Tiering

Inputs

  • Per-study DE results: A, B, and combined C (columns: gene_id, log2FC, SE, p, padj, baseMean).
  • Feature universe = intersection of A and B gene IDs.

Outputs

  • meta_ranked.csv: meta-estimates, heterogeneity (I²), reproducibility score.
  • tiers.csv: Tier 1–3 assignments + flags.
  • panel_candidates_tier12.txt: Tier1 ∪ Tier2 shortlist.

Steps (CS/ML analogy)

  1. Feature universe alignment → avoid schema drift.

  2. Meta-estimates → Fixed Effect (weighted mean) or Random Effect (add between-study variance).

  3. Heterogeneity (I²) → decide FE vs RE.

  4. Sign consistency → require agreement across A & B.

  5. Scoring function

    score = -log10(p_meta) – λ·I² + bonuses
    
  6. Tiering

    • Tier 1: replicated, low I².
    • Tier 2: weaker but supportive.
    • Tier 3: only combined C supports.

Analogy: Feature screening with cross-domain stability checks.


Script 2 (Python): Classifier Training with Stability Selection + LOSO

Inputs

Outputs

Steps

  1. Data plumbing → transpose to samples × genes, align to metadata.
  2. Optional restriction → Tier1 ∪ Tier2 features only.
  3. Stability selection → repeated subsampling + sparse logistic regression (L1/EN).
  4. Redundancy pruning → remove |r|>0.9 correlations.
  5. Panel-size sweep → evaluate 6–24 genes, pick smallest within ΔAUROC ≤ 0.02 of best.
  6. Calibration → Platt scaling, equal-weight averaging across batches.

Analogy: Sparse model with domain generalization checks.


ASCII Flow

Files
│
├── DESEQ2_NORM_all.vst.tsv
├── rnaseq_diffabundance_study1and5_samplesheet_filled.csv
└── DE tables: A, B, C
      ↓
[Script 1: R] Meta-analysis + Tiering
  → meta_ranked.csv
  → tiers.csv
  → panel_candidates_tier12.txt
      ↓
[Script 2: Python] Classifier + Validation
  → stability_selection.csv
  → panel_size_sweep_metrics.csv
  → final_panel_gene_list.txt
  → loso_by_heldout.csv

Glossary

  • VST: Variance-Stabilizing Transform (normalize variance across counts).
  • Meta-analysis: Combine effect sizes across studies; FE assumes same effect, RE allows heterogeneity.
  • : % of variability due to between-study heterogeneity.
  • LOSO: Leave-One-Group-Out cross-validation (train on 2 batches, test on 1).
  • Stability selection: Feature selection under resampling + sparsity.
  • Calibration (Platt): Adjust scores to calibrated probabilities.

Key Risks and Mitigations

  • Data leakage → run selection within folds; treat R shortlist as prior only.
  • Batch imbalance → equal-weight metrics; optional downsampling of large batch A.
  • Overfitting via correlation → redundancy pruning.
  • Parameter sensitivity → document thresholds and rerun sensitivity checks.

Adaptable Knobs

  • Tiering (R): I² cutoff, λ in score, p-value thresholds.
  • Classifier (Python): # resamples, subsample fraction, L1/EN penalty, correlation threshold, panel sizes, ΔAUROC tolerance, whether to restrict to Tier1∪Tier2.

Runtime Notes

  • R script: seconds–minutes (cheap).
  • Python script: resample-heavy;
    • Tier1∪Tier2 (hundreds of features, ~500 resamples): 5–20 min.
    • Full set (10k+ genes, 1000 resamples): 30–90+ min.
    • Main cost drivers: # resamples, feature count, panel-size sweep.

Equations (Minimal Reference)

  • FE meta:
    β_FE = Σ(β_i / SE_i²) / Σ(1 / SE_i²)
  • I²:
    I² = max(0, (Q – (k–1))/Q) × 100%
  • L1 logistic:
    minβ [ logloss(Xβ, y) + α‖β‖₁ ]

Additional Considerations

  • Run stability selection inside each LOSO fold for strict no-leakage.
  • Save seeds, configs, hashes for reproducibility.
  • Evaluate calibration with reliability curves, not just AUROC.
  • Optionally benchmark against linear SVMs.
  • If class priors differ by batch, consider prior correction.