Year in Review: Timeline of Discovery

This timeline documents the evolution of analysis approaches from December 2024 through September 2025, highlighting methodological pivots, key lessons, and the path to a validated 6-gene classifier.


December 2024: Initial Methods Exploration

Week of Dec 4

Attempted: Running nf-core pipeline on 4 combined datasets

Outcome: ❌ Failed - compute resource limits and too many unaccounted nuances

!!! quote “Key Insight” “Too many nuances that can’t be accounted for when everything is pooled”

Related:


January 2025: Differential Abundance Beginnings

Jan 3

Activities:

  • Issue #3: Created merged metadata
  • Issue #4: Differential abundance initial approach
    • Toy example: ✅ Success
    • GTF file issues encountered
    • Applied mutual information to Perkinsus datasets

Outcome: Merged counts table obtained, but separation by trait was weak

Related:

  • Notebook: 2025-01-03_RNAseq_all_AI_diffexp.ipynb

Jan 16

Attempted: Subset data to improve separation

Outcome: ❌ Didn’t help significantly

Considerations identified:

  • Fixed vs. random effects
  • Sample size limitations
  • Control group treatment
  • Batch corrections needed

Issues opened:

!!! failure “Big Lesson #1: When Integration Fails” Integrated data analysis does not work when data is noisy (too much within and/or across study variation) and signal is not strong enough.

Pivot discussed: Meta-analysis approach (inspired by BMC paper with Erin Witkop, Dina co-author)


February 2025: Confronting Batch Effects

Key finding: Study-specific effects are much stronger than trait effects

Attempted:

  • Issue #18: Batch correction
    1. COMBAT
    2. RemoveBatchEffect

Outcome: ❌ Little effect on improving trait-based separation

!!! failure “Big Lesson #2: Traits Were Oversimplified” Generalized trait definitions across studies were insufficient. Study-specific effects dominated.

Alternative approach: Limit variation within study by subsetting for common time points, compare DEGs against control groups


April 2025: Per-Dataset Analysis & TAG-seq Issues

Strategy Shift: Independent Dataset Analysis

New approach: Running DifferentialAbundance on each dataset independently

TAG-seq Parameter Problems

  • Issue #26: What flags to use? Discovered GC bias
  • Critical finding: Johnson dataset used TAG-seq; initial RNA-seq analysis parameters were inappropriate
  • Issue #28: Rerun Johnson data with different parameters

Related:

Deferred Work

  • Issue #29 & #31: Ran differential abundance on all datasets together but didn’t interpret results yet

!!! note “Could return to this” Results exist but interpretation was postponed to focus on per-dataset approach


June 2025: Focused Dataset Analysis

Per-Dataset Deep Dives

Issue #32: Attempted differentialabundance on datasets separately

  • Steve: Focused on study 5
  • Shelly: Focused on study 1
  • Goal: Understand parameters and how to best run differential abundance

July 2025: Combining Studies & Understanding Normalization

Study Combination Experiment

Issue #34: Combine study 4 (injected group) + study 5

Research question: Will study 4 injected group cluster with resistance or susceptible group from study 5?

Learning: Gained deeper understanding of the differentialabundance pipeline

  • PCAs are generated before any differential abundance analysis happens
  • Normalization timing matters for interpretation

Batch correction attempt:

  • Compared PCAs with and without batch correction on top 500 most variable genes
  • Result: Starting to see evidence of innate trait

!!! note “Could return to this” Analysis exists showing 567 genes with significant differential abundance (DESeq). Could revisit to see if these genes show greater clustering than the top 500 most variable.

Related:

Outstanding Questions

Issue #39: Compare DE results from papers and create list of DEGs/markers

!!! warning “Remaining to be done” Systematic comparison with published literature still pending


August 2025: GSEA & Stepwise Approach Development

GSEA Integration

Activity: Run differentialabundance with GSEA

  • Created GMT file with gene descriptions
  • Issue #45: Understand GSEA (not completed)

Related:

Stepwise Differential Abundance

Issue #41: Two-step approach

Steps:

  1. Step 1: Controls vs. treated (identify stress-responsive genes)
  2. Step 2: Resistant vs. sensitive (from step 1 genes)

Implementation: analyses/stepwise_differentialabundance/

Shelly’s results on dataset 1:

  • Only 1 significant gene found
  • Problem: DESeq isn’t ideal for highly pared-down gene sets (VST couldn’t work well with only ~50 genes)

!!! danger “Big Lesson #3: Biomarkers in Controls” Are we removing biomarkers that are innate? If resilience biomarkers are constitutively expressed (present in controls), the stepwise filtering approach removes them!

Classifier Development Begins

Issue #42: Validate SR320 classification results

Question: Are the ~50 markers convincing about the difference between sensitive vs. resistant?

!!! note “Revisit: Make plots” Need visualization to assess convincingness of candidate markers

Issue #43: SR320’s AI model

Issue #44: Combined datasets 1 & 5

Comparison: Integrated data analysis vs. post-data integration approaches

Result: ✅ 6-gene classifier completed!

Pipeline:

  • Step 1: Rank genes based on:
    • Reproducibility across datasets
    • Consistency of directionality in expression differences
    • Heterogeneity assessment
  • Step 2: Logistic regression to find minimum gene set for good classification

!!! success “Six-Gene Classifier Panel Identified” Strong separation between tolerant and sensitive phenotypes achieved

!!! warning “Big Lesson #4: Training Set in Test Set” Only include training set in test set if exploring within study. If trying to predict phenotypes in other studies, you should definitely not include the training data in the test set (avoid overfitting/leakage).

Related:


September 2025: Validation & Characterization

Cross-Study Comparison

Issue #36: Run differentialabundance independently for each dataset and compare DEGs

Theme: Post-data integration → do we see more overlap?

!!! note “Could return to this” Good question about post-data integration vs. integrated data analysis, but uncertain about subsetting approach mentioned in the issue

Status: Not completed

Integration Attempts

Issue #46: Integrate all data and run through differentialabundance pipeline

!!! note “Could revisit” Another integration attempt; not completed

Issue #47: (Not pursued; no need to revisit)

6-Gene Panel Characterization

Issue #49: Plot 6 genes to gain confidence

Goal: Visualize how well these 6 genes distinguish phenotypes

Related:

Issue #51: Replot heatmap with improved clustering and labels

!!! note “Revisit this” Visualization improvements pending

Issue #52: Coverage density plots

Status: Still needs notebook entry

Innate vs. Reactive Analysis

Issue #53: Innate vs. reactive gene expression

Critical insight: Biomarkers may be constitutively different in resistant vs. sensitive oysters (innate), not just reactive to stress

Related:

Future Datasets

Issue #54: Find more datasets

Status: Postponed


Key Takeaways

Four Big Lessons

  1. Integrated analysis fails with noisy, weak-signal data → pivot to post-data integration
  2. Oversimplified trait definitions → study effects dominate trait effects
  3. Innate biomarkers exist in controls → don’t filter them out in stepwise approaches
  4. Training/test set leakage → critical for cross-study validation

Successful Methodologies

✅ Per-dataset differential abundance analysis
✅ Post-data integration (compare results across datasets)
✅ Two-step classifier pipeline (reproducibility + logistic regression)
✅ Leave-One-Study-Out (LOSO) validation
✅ Innate vs. reactive DEG characterization

Open Questions

  • Systematic comparison with published DEG lists
  • Optimal visualization of 6-gene panel across studies
  • Additional dataset integration for broader validation

Next: Explore the validated pipelines in detail: Decision Tree Two-Step Classifier