Big Lessons Learned
Distilled, numbered insights for researchers adapting this work to new species or stressors.
These lessons emerged from 10+ months of iterative analysis. Each one represents a decision point where the team had to abandon a promising approach and rethink the strategy.
Lesson 1: Integrated Analysis Fails with Noisy, Weak-Signal Data
→ Use post-data integration instead
When datasets are noisy — too much within- and across-study variation — and the biological signal is not strong enough to overcome that noise, pooling all samples together makes things worse, not better.
What happened: Pooling all datasets produced PCAs dominated by study-specific clustering. Trait-based (tolerant vs. sensitive) separation was invisible.
What to do instead:
- Analyze each dataset independently
- Identify genes that are differentially expressed in multiple datasets
- Score genes by reproducibility and consistency across studies
- Build a classifier on the reproducible signal
See: Process Narrative — Phase 1, Issue #12
Lesson 2: Trait Definitions Must Be Specific and Comparable
→ Generic “stress vs. control” is insufficient
Early attempts used broad trait labels (e.g., “stressed” vs. “control”) across studies that had different stressors, time points, and experimental designs. The result: study effects dominated trait effects in every analysis.
What happened: Batch correction (COMBAT, RemoveBatchEffect) could not recover trait signal because the trait definitions were fundamentally incompatible across studies.
What to do instead:
- Define phenotypes precisely (e.g., “tolerant to P. marinus injection at Day 7” — not “stressed”)
- Compare only studies with harmonized phenotype definitions
- Validate phenotype comparability before combining any data
See: Process Narrative — Phase 2, Research Context
Lesson 3: Innate Biomarkers Live in Control Groups
→ Stepwise filtering removes them by design
The intuitive hypothesis is that resilience biomarkers are reactive — genes that change in response to stress differently in tolerant vs. sensitive individuals. But the data showed the opposite: the strongest signals were innate, present even in unstressed controls.
What happened: The stepwise approach (Step 1: filter for stress-responsive genes; Step 2: compare tolerant vs. sensitive from Step 1 genes) systematically discarded genes that were constitutively different between phenotypes. Only 1 significant gene survived this filter in Dataset 1.
What to do instead:
- Test both hypotheses: compare tolerant vs. sensitive in both control and treated samples
- If the signal is similar in both conditions → innate biomarker
- Use the two-step classifier approach, which does not filter based on stress response
See: Process Narrative — Phase 5, Issue #53, Notebook: Innate gene expression
Lesson 4: Training-Set Leakage Inflates Accuracy
→ Prevent leakage at every step: feature selection, normalization, and tuning
Data leakage — where information from the test set influences the training process — is the most common source of falsely optimistic accuracy in biomarker discovery.
What happened: Early evaluations included training samples in the test set during cross-study validation. This inflated apparent accuracy.
Common leakage points:
- Normalization: If you normalize train + test together, test data influenced the normalization parameters → leakage
- Feature selection: If you select genes using all data before splitting → leakage
- Hyperparameter tuning: If you tune regularization using the test set → leakage
What to do instead:
- Use Leave-One-Study-Out (LOSO) validation: train on all studies except one, test on the held-out study
- Fit normalization parameters on training data only; apply to test
- Select features using training data only
- Never include the training study in the test set for cross-study evaluation
See: Validation & Pitfalls, Issue #44
Lesson 5: Pipeline Internals Matter
→ PCAs in nf-core/differentialabundance are generated before normalization
A subtle but important technical lesson: when using nf-core/differentialabundance, the QC PCA plots are generated from raw (or lightly filtered) counts before DESeq2 normalization occurs. This means early PCAs do not reflect the data that will actually be used in differential abundance analysis.
What happened: Initial PCAs showed poor trait separation, leading to early concern about data quality. Upon understanding the pipeline order, it became clear that these PCAs were not representing the normalized data.
What to do:
- Do not draw conclusions about biological signal from pre-normalization PCAs
- Interpret post-analysis PCAs (generated after normalization) for biology
- If you want normalized PCAs, explicitly request them after running DESeq2
See: Research Context — Normalization Timing, Issue #34
Lesson 6: Technology Differences Require Different Parameters
→ TAG-seq ≠ standard RNA-seq
Not all RNA-seq data is the same. When one study used TAG-seq (3′ tag sequencing) and was processed with standard RNA-seq parameters, the results contained GC bias artifacts that undermined downstream analysis.
What happened: The Johnson dataset (Dataset 5) used TAG-seq technology. Standard FastP parameters for full-length RNA-seq were inappropriate: wrong adapter trimming, incorrect quality filters, and GC bias.
What to do:
- Identify the library type for every dataset before processing
- Use technology-appropriate parameters (especially for adapter trimming and quality filtering)
- Run QC checks (MultiQC, FastQC) and look for GC bias before proceeding to differential abundance
See: Process Narrative — Phase 3, Issue #26, Notebook: TAG-seq FastP params
Summary Table
| Lesson | Core Failure | Better Approach |
|---|---|---|
| 1. Integrated analysis | Study effects dominate trait signal | Post-data integration: analyze separately, compare |
| 2. Trait definitions | Incomparable phenotype labels across studies | Specific, harmonized phenotype definitions |
| 3. Innate biomarkers | Stepwise filter removes constitutive signals | Test tolerant vs. sensitive in both control and treated |
| 4. Leakage | Training data in test set → inflated accuracy | LOSO; fit normalization and feature selection on training only |
| 5. Pipeline internals | Pre-norm PCAs misinterpreted | Understand and use post-normalization QC outputs |
| 6. Library type | Wrong parameters for TAG-seq | Identify and use technology-appropriate processing |
Next: See Methods & Pipelines for how these lessons shaped the validated analysis approaches.