Pipeline: Stepwise Differential Abundance

Overview

The stepwise differential abundance approach attempts to identify resilience biomarkers by filtering genes in two sequential steps:

Step 1: Identify stress-responsive genes (control vs. treated)
Step 2: Identify resistance-associated genes (from Step 1 genes, compare resistant vs. sensitive)

Status: ⚠️ Partially validated - works in some contexts but has critical limitations

When to Use This Approach

✅ Use when:

You want to identify genes that are both stress-responsive AND differentiate phenotypes
You have clear control and treatment groups within studies
Your biological hypothesis is that biomarkers are reactive (induced by stress)

❌ Don’t use when:

You suspect innate biomarkers (constitutively different between resistant/sensitive)
You have very small gene sets (< 100 genes) after filtering
Study design doesn’t include both controls and treated samples

The Pipeline

Prerequisites

Per-dataset gene count matrices
Metadata with phenotype and treatment labels
Reference genome/transcriptome annotation

Step 1: Identify Stress-Responsive Genes

Goal: Find genes differentially expressed between control and treated samples (regardless of phenotype)

Method: Use DESeq2 or similar differential expression tool

# Pseudocode example
dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = metadata,
  design = ~ treatment  # control vs. treated
)
dds <- DESeq(dds)
results_step1 <- results(dds, alpha = 0.05)

# Filter for significant genes
stress_responsive <- results_step1[results_step1$padj < 0.05, ]

Output: List of stress-responsive genes (typically 50-5000 genes depending on study and threshold)

Step 2: Identify Resistance-Associated Genes

Goal: From Step 1 genes, find those that differentiate resistant from sensitive phenotypes

Method: Subset counts to Step 1 genes, then run differential expression between phenotypes

# Subset to stress-responsive genes only
counts_filtered <- counts[rownames(counts) %in% rownames(stress_responsive), ]

# Compare resistant vs. sensitive on filtered genes
dds2 <- DESeqDataSetFromMatrix(
  countData = counts_filtered,
  colData = metadata,
  design = ~ phenotype  # resistant vs. sensitive
)
dds2 <- DESeq(dds2)
results_step2 <- results(dds2, alpha = 0.05)

Output: Candidate resilience biomarkers

Implementation Example: Dataset 1

From Issue #41 and notebook post:

Results:

Step 1 produced ~50 stress-responsive genes
Step 2 identified only 1 significant gene

Analysis in: analyses/stepwise_differentialabundance/

Critical Limitations

1. DESeq2 Breaks Down with Small Gene Sets

!!! danger “Small Gene Set Problem” When Step 1 produces < 100 genes, DESeq2’s variance stabilizing transformation (VST) becomes unreliable. With only ~50 genes:

- Not enough data for robust dispersion estimation
- VST assumes thousands of genes for stable transformation
- Results may be spurious

Observed: DESeq2 on 50 genes produced weak/unconvincing results

2. Removes Innate Biomarkers

!!! failure “Big Lesson #3: Filtering Out Innate Signals” Biomarkers may exist in control groups if they represent innate resilience (genes constitutively different in resistant vs. sensitive individuals, even before stress).

**The stepwise approach removes these by design** because Step 1 filters for differential expression between control/treated.

If a gene is expressed at different baseline levels in resistant vs. sensitive oysters but doesn't change with stress, it will be filtered out in Step 1.

See Issue #53 and innate gene expression analysis

3. Study-Specific Effects Still Dominate

Even with stepwise filtering:

Batch effects persist
Per-dataset analysis required (not cross-study integration)
Results may not generalize to other studies

Decision Tree: Should You Use This Pipeline?

graph TD
    A[Start: Biomarker Discovery Goal] --> B{Do you have control<br/>and treated groups?}
    B -->|No| C[Use classifier approach]
    B -->|Yes| D{Do you believe biomarkers<br/>are reactive to stress?}
    D -->|No / Unsure| C
    D -->|Yes| E{Will Step 1 yield<br/>> 100 genes?}
    E -->|Unsure / No| C
    E -->|Yes| F[Use stepwise approach]
    F --> G{Does Step 2 yield<br/>convincing results?}
    G -->|No| C
    G -->|Yes| H[Validate with LOSO]
    C --> I[Two-Step Classifier Pipeline]

Alternative: Two-Step Classifier

If the stepwise approach doesn’t work for your data, consider the Two-Step Classifier, which:

Doesn’t filter out innate biomarkers
Handles any dataset design (doesn’t require controls)
Uses reproducibility scoring instead of sequential filtering
Produced the validated 6-gene panel in this project

Validation Requirements

If you proceed with stepwise approach:

Within-study validation: Can the genes predict phenotypes in held-out samples from the same study?
Cross-study validation: Use Leave-One-Study-Out (LOSO) - see Validation & Pitfalls
Biological validation: Are genes functionally related to stress response pathways?

Next: See the more successful Two-Step Classifier approach, or learn about Validation & Pitfalls