Process Narrative
A chronological account of decisions, pivots, and surprises — the honest story of what happened.
This narrative documents the evolution of analysis approaches from December 2024 through September 2025, organized by phase. Each phase highlights the key decision or pivot that defined that period.
Phase 1 (Dec 2024 – Jan 2025): Integrated Analysis Attempt
Why Integration Seemed Obvious
The natural starting point was to pool all available datasets together and run a combined analysis — this approach maximizes sample size and, in theory, enables detection of weak signals.
Week of Dec 4, 2024
Attempted: Running nf-core pipeline on 4 combined datasets
Outcome: ❌ Failed - compute resource limits and too many unaccounted nuances
Key Insight
"Too many nuances that can't be accounted for when everything is pooled"
Related:
January 2025: Differential Abundance Beginnings
Jan 3 activities:
- Issue #3: Created merged metadata
- Issue #4: Differential abundance initial approach
- Toy example: ✅ Success
- GTF file issues encountered
- Applied mutual information to Perkinsus datasets
Outcome: Merged counts table obtained, but separation by trait was weak
Jan 16:
Attempted: Subset data to improve separation
Outcome: ❌ Didn't help significantly
Considerations identified:
- Fixed vs. random effects
- Sample size limitations
- Control group treatment
- Batch corrections needed
Issues opened:
Big Lesson #1: When Integration Fails
Integrated data analysis does not work when data is noisy (too much within and/or across study variation) and signal is not strong enough.
Pivot discussed: Meta-analysis approach (inspired by BMC paper with Erin Witkop, Dina co-author)
Phase 2 (Feb 2025): Confronting Batch Effects
Key finding: Study-specific effects are much stronger than trait effects
Attempted:
- Issue #18: Batch correction
- COMBAT
- RemoveBatchEffect
Outcome: ❌ Little effect on improving trait-based separation
Big Lesson #2: Traits Were Oversimplified
Generalized trait definitions across studies were insufficient. Study-specific effects dominated.
Alternative approach: Limit variation within study by subsetting for common time points, compare DEGs against control groups
Phase 3 (Apr – Jun 2025): Shift to Per-Dataset Independent Analysis; TAG-seq Discovery
Strategy Shift: Independent Dataset Analysis
New approach: Running DifferentialAbundance on each dataset independently
TAG-seq Parameter Problems (April 2025)
- Issue #26: What flags to use? Discovered GC bias
- Critical finding: Johnson dataset used TAG-seq; initial RNA-seq analysis parameters were inappropriate
- Issue #28: Rerun Johnson data with different parameters
Related:
Deferred Work (April 2025)
- Issue #29 & #31: Ran differential abundance on all datasets together but didn't interpret results yet
Could return to this
Results exist but interpretation was postponed to focus on per-dataset approach
Per-Dataset Deep Dives (June 2025)
Issue #32: Attempted differentialabundance on datasets separately
- Steve: Focused on study 5
- Shelly: Focused on study 1
- Goal: Understand parameters and how to best run differential abundance
Phase 4 (Jul 2025): Normalization and Study-Combination Experiments
Study Combination Experiment
Issue #34: Combine study 4 (injected group) + study 5
Research question: Will study 4 injected group cluster with resistance or susceptible group from study 5?
Learning: Gained deeper understanding of the differentialabundance pipeline
- PCAs are generated before any differential abundance analysis happens
- Normalization timing matters for interpretation
Batch correction attempt:
- Compared PCAs with and without batch correction on top 500 most variable genes
- Result: Starting to see evidence of innate trait
Could return to this
Analysis exists showing 567 genes with significant differential abundance (DESeq). Could revisit to see if these genes show greater clustering than the top 500 most variable.
Related:
Outstanding Questions
Issue #39: Compare DE results from papers and create list of DEGs/markers
Remaining to be done
Systematic comparison with published literature still pending
Phase 5 (Aug 2025): Stepwise Differential Abundance; Discovery of Innate-Signal Problem
GSEA Integration
Activity: Run differentialabundance with GSEA
- Created GMT file with gene descriptions
- Issue #45: Understand GSEA (not completed)
Related:
Stepwise Differential Abundance
Issue #41: Two-step approach
Steps:
- Step 1: Controls vs. treated (identify stress-responsive genes)
- Step 2: Resistant vs. sensitive (from step 1 genes)
Implementation: analyses/stepwise_differentialabundance/
Shelly's results on dataset 1:
- Only 1 significant gene found
- Problem: DESeq isn't ideal for highly pared-down gene sets (VST couldn't work well with only ~50 genes)
Big Lesson #3: Biomarkers in Controls
Are we removing biomarkers that are innate? If resilience biomarkers are constitutively expressed (present in controls), the stepwise filtering approach removes them!
Phase 6 (Aug – Sep 2025): Two-Step Classifier Development; 6-Gene Panel Validation
Classifier Development Begins
Issue #42: Validate SR320 classification results
Question: Are the ~50 markers convincing about the difference between sensitive vs. resistant?
Issue #43: SR320's AI model
Issue #44: Combined datasets 1 & 5
Comparison: Integrated data analysis vs. post-data integration approaches
Result: ✅ 6-gene classifier completed!
Pipeline:
- Step 1: Rank genes based on:
- Reproducibility across datasets
- Consistency of directionality in expression differences
- Heterogeneity assessment
- Step 2: Logistic regression to find minimum gene set for good classification
Six-Gene Classifier Panel Identified
Strong separation between tolerant and sensitive phenotypes achieved
Big Lesson #4: Training Set in Test Set
Only include training set in test set if exploring within study. If trying to predict phenotypes in other studies, you should definitely not include the training data in the test set (avoid overfitting/leakage).
Related:
September 2025: Validation & Characterization
Cross-Study Comparison
Issue #36: Run differentialabundance independently for each dataset and compare DEGs
Theme: Post-data integration → do we see more overlap?
Could return to this
Good question about post-data integration vs. integrated data analysis, but uncertain about subsetting approach mentioned in the issue
Status: Not completed
6-Gene Panel Characterization
Issue #49: Plot 6 genes to gain confidence
Goal: Visualize how well these 6 genes distinguish phenotypes
Related:
Issue #53: Innate vs. reactive gene expression
Critical insight: Biomarkers may be constitutively different in resistant vs. sensitive oysters (innate), not just reactive to stress
Related:
Key Takeaways
See Big Lessons Learned for a distilled list of insights from this project.
Successful Methodologies
✅ Per-dataset differential abundance analysis
✅ Post-data integration (compare results across datasets)
✅ Two-step classifier pipeline (reproducibility + logistic regression)
✅ Leave-One-Study-Out (LOSO) validation
✅ Innate vs. reactive DEG characterization
Open Questions
- Systematic comparison with published DEG lists
- Optimal visualization of 6-gene panel across studies
- Additional dataset integration for broader validation
Next: Read the distilled Big Lessons Learned, or jump to Methods & Pipelines