Pipeline: Stepwise Differential Abundance
Overview
The stepwise differential abundance approach attempts to identify resilience biomarkers by filtering genes in two sequential steps:
- Step 1: Identify stress-responsive genes (control vs. treated)
- Step 2: Identify resistance-associated genes (from Step 1 genes, compare resistant vs. sensitive)
Status: ⚠️ Partially validated - works in some contexts but has critical limitations
When to Use This Approach
✅ Use when:
- You want to identify genes that are both stress-responsive AND differentiate phenotypes
- You have clear control and treatment groups within studies
- Your biological hypothesis is that biomarkers are reactive (induced by stress)
❌ Don't use when:
- You suspect innate biomarkers (constitutively different between resistant/sensitive)
- You have very small gene sets (< 100 genes) after filtering
- Study design doesn't include both controls and treated samples
The Pipeline
Prerequisites
- Per-dataset gene count matrices
- Metadata with phenotype and treatment labels
- Reference genome/transcriptome annotation
Step 1: Identify Stress-Responsive Genes
Goal: Find genes differentially expressed between control and treated samples (regardless of phenotype)
Method: Use DESeq2 or similar differential expression tool
# Pseudocode example
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = metadata,
design = ~ treatment # control vs. treated
)
dds <- DESeq(dds)
results_step1 <- results(dds, alpha = 0.05)
# Filter for significant genes
stress_responsive <- results_step1[results_step1$padj < 0.05, ]
Output: List of stress-responsive genes (typically 50-5000 genes depending on study and threshold)
Step 2: Identify Resistance-Associated Genes
Goal: From Step 1 genes, find those that differentiate resistant from sensitive phenotypes
Method: Subset counts to Step 1 genes, then run differential expression between phenotypes
# Subset to stress-responsive genes only
counts_filtered <- counts[rownames(counts) %in% rownames(stress_responsive), ]
# Compare resistant vs. sensitive on filtered genes
dds2 <- DESeqDataSetFromMatrix(
countData = counts_filtered,
colData = metadata,
design = ~ phenotype # resistant vs. sensitive
)
dds2 <- DESeq(dds2)
results_step2 <- results(dds2, alpha = 0.05)
Output: Candidate resilience biomarkers
Implementation Example: Dataset 1
From Issue #41 and notebook post:
Results:
- Step 1 produced ~50 stress-responsive genes
- Step 2 identified only 1 significant gene
Analysis in: analyses/stepwise_differentialabundance/
Critical Limitations
1. DESeq2 Breaks Down with Small Gene Sets
Small Gene Set Problem
When Step 1 produces < 100 genes, DESeq2's variance stabilizing transformation (VST) becomes unreliable. With only ~50 genes:
- Not enough data for robust dispersion estimation
- VST assumes thousands of genes for stable transformation
- Results may be spurious
Observed: DESeq2 on 50 genes produced weak/unconvincing results
2. Removes Innate Biomarkers
Big Lesson #3: Filtering Out Innate Signals
Biomarkers may exist in control groups if they represent innate resilience (genes constitutively different in resistant vs. sensitive individuals, even before stress).
The stepwise approach removes these by design because Step 1 filters for differential expression between control/treated.
If a gene is expressed at different baseline levels in resistant vs. sensitive oysters but doesn't change with stress, it will be filtered out in Step 1.
See Issue #53 and innate gene expression analysis
3. Study-Specific Effects Still Dominate
Even with stepwise filtering:
- Batch effects persist
- Per-dataset analysis required (not cross-study integration)
- Results may not generalize to other studies
Decision Tree: Should You Use This Pipeline?
graph TD
A[Start: Biomarker Discovery Goal] --> B{Do you have control<br/>and treated groups?}
B -->|No| C[Use classifier approach]
B -->|Yes| D{Do you believe biomarkers<br/>are reactive to stress?}
D -->|No / Unsure| C
D -->|Yes| E{Will Step 1 yield<br/>> 100 genes?}
E -->|Unsure / No| C
E -->|Yes| F[Use stepwise approach]
F --> G{Does Step 2 yield<br/>convincing results?}
G -->|No| C
G -->|Yes| H[Validate with LOSO]
C --> I[Two-Step Classifier Pipeline]
Alternative: Two-Step Classifier
If the stepwise approach doesn't work for your data, consider the Two-Step Classifier, which:
- Doesn't filter out innate biomarkers
- Handles any dataset design (doesn't require controls)
- Uses reproducibility scoring instead of sequential filtering
- Produced the validated 6-gene panel in this project
Validation Requirements
If you proceed with stepwise approach:
- Within-study validation: Can the genes predict phenotypes in held-out samples from the same study?
- Cross-study validation: Use Leave-One-Study-Out (LOSO) - see Validation & Pitfalls
- Biological validation: Are genes functionally related to stress response pathways?
Related Resources
- Issue #41: Stepwise approach
- Notebook: Stepwise on dataset 1
- Issue #53: Innate vs. reactive
- nf-core/differentialabundance docs
Next: See the more successful Two-Step Classifier approach, or learn about Validation & Pitfalls