Validation & Pitfalls

Overview

Validation is critical for ensuring biomarker panels generalize beyond the discovery cohort. This guide covers validation strategies and common pitfalls encountered in this project, with emphasis on what actually went wrong and how to avoid it.

Validation Strategies

1. Leave-One-Study-Out (LOSO) Cross-Validation

What it is: Train classifier on all datasets except one, test on the held-out dataset

Why it's essential:

Tests cross-study generalization
Avoids overfitting to study-specific effects
Simulates real-world scenario: predict phenotypes in new, unseen studies

Implementation:

datasets = [dataset1, dataset2, dataset3, dataset4, dataset5]

for test_idx, test_dataset in enumerate(datasets):
    # Training set: all except test_dataset
    train_datasets = [ds for i, ds in enumerate(datasets) if i != test_idx]

    X_train = combine_expression_data(train_datasets, gene_panel)
    y_train = combine_labels(train_datasets)

    # Fit classifier
    classifier.fit(X_train, y_train)

    # Test on held-out study
    X_test = test_dataset.expression[gene_panel]
    y_test = test_dataset.labels

    accuracy = classifier.score(X_test, y_test)
    print(f"LOSO Fold {test_idx+1} (test={test_dataset.name}): {accuracy:.3f}")

Interpretation:

✅ Good: Consistent accuracy (e.g., 0.80-0.90) across all folds
⚠️ Warning: One fold significantly lower → investigate study-specific effects
❌ Poor: High variance in accuracy or < 0.70 average → panel doesn't generalize

Big Lesson #4: Training Set in Test Set

Never include the training study in the test set when evaluating cross-study performance. This leads to data leakage and falsely optimistic results.

✅ Correct: LOSO (train on studies 1,2,3; test on study 4)
❌ Wrong: Train on studies 1,2,3,4; test on studies 1,2,3,4

Related: Issue #44

2. Within-Study Cross-Validation

What it is: Hold out samples within a single study (k-fold or train/test split)

When to use:

Initial model development and parameter tuning
Assessing overfitting within a study
When you only have one study available (but be cautious about generalizability)

Limitations:

Does NOT test cross-study generalization
May overestimate performance if study-specific effects are strong
Use as a first step, but follow with LOSO

3. Independent Validation Cohort

Gold standard: Test on completely independent dataset never used during development

Challenges:

Requires finding additional suitable datasets
May have different experimental designs or phenotype definitions
Labor-intensive to process and integrate

Status in this project: Issue #54 (find more datasets) - postponed

Common Pitfalls

1. Data Leakage

Definition: Information from test set "leaks" into training process, causing overly optimistic performance estimates

Leakage Scenario 1: Normalization Across Training+Test

❌ Wrong:

# Normalize all data together
all_data_normalized = vst_transform(combine(train_data, test_data))

# Then split
X_train = all_data_normalized[train_indices]
X_test = all_data_normalized[test_indices]  # LEAKAGE!

✅ Correct:

# Fit normalization on training data only
vst_params = fit_vst(train_data)

# Apply to train and test separately
X_train = apply_vst(train_data, vst_params)
X_test = apply_vst(test_data, vst_params)

Leakage Scenario 2: Feature Selection on All Data

❌ Wrong:

# Select features using all data
selected_genes = select_top_degs(all_data, phenotypes)  # LEAKAGE!

# Then split for training/testing
X_train = train_data[selected_genes]
X_test = test_data[selected_genes]

✅ Correct:

# Select features using training data only
selected_genes = select_top_degs(train_data, train_phenotypes)

# Apply to test
X_train = train_data[selected_genes]
X_test = test_data[selected_genes]

In this project: LOSO properly avoids leakage by keeping test studies completely separate during training

2. Batch Effects

Definition: Non-biological variation between studies due to technical differences (sequencing platform, library prep, time, lab)

Big Lessons #1 & #2: When Batch Effects Dominate

Study-specific effects were consistently stronger than trait effects
Attempted corrections (COMBAT, RemoveBatchEffect) had little effect on trait separation
Pivot: Abandoned integrated analysis in favor of post-data integration

Observations:

PCA plots showed clustering by study, not by phenotype
Trait-based separation was weak even after batch correction
Integration worked only when combining very similar studies (e.g., study 4 injected + study 5)

Batch Correction Attempts

Tried:

Issue #18: COMBAT and RemoveBatchEffect

Outcome: ❌ Insufficient improvement

Lesson: Batch correction is not a panacea. When study effects are too strong, no correction method will recover trait signal.

Post-Data Integration Solution

Instead of correcting batches and pooling:

Analyze each study independently
Identify genes significant in multiple studies
Build classifier on reproducible genes

Advantage: Reproducibility across studies implicitly handles batch effects

Related: Timeline - February 2025

3. Normalization Timing & Method

Understanding the Pipeline

In nf-core/differentialabundance:

Raw counts are input
PCAs are generated before differential abundance analysis (on raw or lightly filtered counts)
Normalization (VST, rlog, TPM) happens during DESeq2 analysis
Batch correction (if enabled) is applied post-normalization

Normalization Insight from July 2025

Understanding when normalization happens is critical for interpreting QC plots:

Initial PCAs show raw/lightly-processed data
Don't expect trait separation in initial PCAs if normalization hasn't been applied
Post-analysis PCAs on normalized data are more informative

Related: Issue #34, Timeline - July 2025

Normalization Methods

VST (Variance-Stabilizing Transformation):

Recommended for visualization and clustering
Requires sufficient genes (> 1000) for stable estimation
Breaks down with small gene sets (< 100)

VST with Small Gene Sets

In the stepwise approach, Step 1 filtering left only ~50 genes. DESeq2's VST could not work properly with such a small set.

Symptom: Unreliable dispersion estimates, poor model fit
Solution: Use alternative normalization (TMM, TPM) or abandon stepwise approach

Related: Stepwise Pipeline

4. Oversimplified Phenotype Definitions

Big Lesson #2: Traits Were Oversimplified

Initial approach used generalized trait labels across studies:

"Stress" vs. "Control"
"Resistant" vs. "Sensitive"

Problem: Different studies measured different stressors with different designs:

Study 1: Disease challenge, time point A
Study 2: Temperature stress, time point B
Study 3: Disease challenge, time point C, different infection route

Result: Trait effects were much weaker than study-specific effects

Solution: Use more specific, harmonized phenotype definitions

Compare only studies with similar experimental designs
Within-study trait comparisons first, then assess reproducibility across studies
Document and respect phenotype heterogeneity

5. Small Sample Sizes

Challenges:

Insufficient power to detect true DEGs
High variance in estimates
Overfitting risk (model learns noise, not signal)

Mitigations:

Use regularized models (LASSO, elastic net) to reduce overfitting
Focus on genes reproducible across studies (increases effective n)
Report confidence intervals, not just point estimates

In this project: Combined datasets (e.g., Issue #34 merged study 4 + 5) to increase sample size

6. Innate vs. Reactive Biomarkers

Big Lesson #3: Biomarkers in Controls

Biomarkers may be constitutively expressed (innate) in resistant vs. sensitive individuals, even in the absence of stress.

Implication: If you filter based on "stress-responsive" (control vs. treated), you will remove innate biomarkers.

Detection:

Compare resistant vs. sensitive in control samples only
If significant → innate biomarker
If not significant in controls but significant in treated → reactive biomarker

Pipeline choice:

✅ Two-step classifier preserves innate biomarkers
❌ Stepwise approach removes innate biomarkers by design

Related:

Validation Checklist

Before claiming a validated biomarker panel:

[ ] Cross-study validation: LOSO accuracy > 70% (ideally > 80%)
[ ] No data leakage: Normalization, feature selection, and hyperparameter tuning done independently per fold
[ ] Batch effect assessment: Confirmed genes are reproducible across studies, not driven by batch
[ ] Sample size adequacy: Each phenotype group has n > 10 per study (if possible)
[ ] Phenotype definition: Clear, specific, and documented; comparable across studies used
[ ] Innate/reactive characterization: Assessed whether biomarkers are pre-existing or induced
[ ] Biological validation: Genes have known or plausible functional roles in stress response

Troubleshooting Guide

Poor LOSO Performance

Symptom: Accuracy < 0.70 or high variance across folds

Possible causes:

Study-specific effects too strong → Filter datasets or use more stringent reproducibility criteria in Step 1
Phenotype heterogeneity → Ensure comparable phenotype definitions
Small sample sizes → Combine compatible studies or seek additional data
Overfitting → Increase regularization penalty or reduce feature set

Genes Don't Validate

Symptom: DEGs from discovery cohort not significant in validation cohort