Research Context & Problem Framing

Why this study, why these data, and what “resilience” means here.

What is “Resilience” in This Project?

In the context of this research, resilience refers to the ability of Crassostrea virginica (Eastern oyster) to tolerate or resist stress from Perkinsus marinus (parasite causing Dermo disease) infection. It’s a generalization of the specific traits resistance and tolerance. (Discussion)

Intial efforts purposely conflated the traits of resistance (low infection intensity; antonym susceptibility) and tolerance (survival in the presence of significant infection; antonyn sensitivity) in an attempt to discover any overarching resilience markers.

Later efforts distinguished these traits, noting correlations between markers for each.

The Datasets

This project integrates multiple RNA-seq datasets from C. virginica exposed to P. marinus:

Primary Datasets

Dataset	Species	Stressor	Library Type	Notes
Dataset 1	C. virginica	P. marinus injection, distinct doses	Standard RNA-seq	Day 7 samples used for primary analyses
Dataset 2	C. virginica	P. marinus outplanted exposure	TAG-seq	Requires different FastP parameters; GC bias discovered (issue #26)
Dataset 3	C. virginica, C. gigas	P. marinus injection	Standard RNA-seq	Day 7 samples used for primary analyses
Dataset 4	C. virginica	P. marinus injection	Standard RNA-seq	Injected group samples; combined with Dataset 5 in July 2025

Additional datasets were analyzed for specific sub-questions (methylation, de novo transcriptome, cross-species comparisons).

Dataset Characteristics & Challenges

Batch effects: Study-specific effects consistently stronger than trait effects across all integration attempts
Sample size limitations: Variable n across studies (typically 5–15 samples per phenotype group)
Time point variation: Sampling at different days post-exposure across studies
Technology differences: Dataset 2 used TAG-seq (3′ tag RNA-seq) rather than standard RNA-seq, requiring separate parameter optimization (issue #26, issue #28)
Phenotype comparability: Tolerant/sensitive labels defined differently across studies; required harmonization before cross-study analysis

Key Constraints & Design Decisions

1. Integration vs. Post-Integration Analysis

!!! example “Big Lesson #1: When Integrated Analysis Fails” Integrated data analysis does not work when data is noisy (too much within and/or across study variation) and signal is not strong enough.

**Initial approach (Failed):**
- Pooled all datasets together
- Attempted batch correction (COMBAT, RemoveBatchEffect)
- Expected to see trait-based clustering

**Outcome:** Study-specific effects dominated; trait separation was poor

**Pivot:** Adopted post-data integration approach:
1. Analyze each dataset independently
2. Compare results across datasets
3. Identify reproducible signatures

2. Fixed vs. Random Effects

Early attempts didn’t properly account for:

Sample size differences across studies
Control group considerations
Study as a random effect in mixed models

3. Normalization Timing

!!! info “When Normalization Happens” In the nf-core/differentialabundance pipeline:

- PCAs are generated **before** differential abundance analysis
- Normalization (VST, TPM) happens during analysis
- Batch correction (if applied) occurs on normalized counts

Understanding this order is critical for interpreting preliminary QC plots

4. Innate vs. Reactive Biomarkers

!!! success “Big Lesson #3: Don’t Filter Out Innate Signals” Biomarkers may exist in control groups if they represent innate resilience traits (genes that are constitutively different in resistant vs. sensitive individuals).

**Implication:** The stepwise approach (filter controls first) may inadvertently remove true biomarkers

**Resolution:** Developed alternative classifier approach that preserves innate signals (see [Two-Step Classifier](/field-guide-src/docs/pipelines/classifier-path/))

See issue #53 and notebook post for full innate vs. reactive analysis.

Biological Context

Perkinsus marinus (Dermo)

Major pathogen in oyster aquaculture
Causes dermo disease (mortality and reduced growth)
Variable infection response across oyster populations
Understanding resistance is key to breeding programs

Molecular Signatures of Resilience

This project aims to identify gene expression patterns that:

Predict tolerance/resistance phenotypes
Are reproducible across independent studies
Are biologically interpretable (pathway-informed)
Could be assayable in breeding programs (minimal gene panels)

Research Questions Evolution

Initial Questions (December 2024)

Can we identify shared stress response genes across multiple studies?
Do RNA-seq datasets cluster by trait when integrated?

Refined Questions (January-April 2025)

How do batch effects and study design differences limit integration?
Can batch correction methods recover trait-based signal?
Which normalization approaches work best for meta-analysis?

Current Questions (August-September 2025)

What is the minimal gene set that distinguishes tolerant from sensitive oysters?
Are biomarkers innate or reactive (expressed before vs. after stress)?
Can classifiers trained on one study predict phenotypes in independent studies?

Success Criteria

A successful biomarker panel should:

✅ Show reproducible differential expression across datasets
✅ Maintain predictive accuracy in Leave-One-Study-Out (LOSO) validation
✅ Be small enough for practical assay development (< 10 genes)
✅ Have biological interpretability (annotatable, pathway-linked)
✅ Distinguish phenotypes in both control and treated conditions

Next: Read the Process Narrative to see how these constraints shaped the project’s decisions, or jump to Methods & Pipelines for validated workflows.