Proposed Field Guide Outline

Audience: Researchers with similar omics datasets (potentially different species or stressors) who want to learn from this project’s lessons, methods, and pipelines for biomarker discovery.

0. Start Here (Landing Page — revised to be succinct)

Format: Abstract-style, ~250–300 words total.

Content:

Purpose — This field guide documents considerations for building a biomarker discovery pipeline from RNA-seq meta-analysis, with an emphasis on lessons learned, reproducibility, and practical recommendations for researchers working with noisy multi-study omics data.
Grant mission — One to two sentences drawn from the project narrative: developing molecular resilience biomarkers to support selective breeding and management of shellfish aquaculture under disease pressure. (Full narrative: ProjectSummaryandNarrative.pdf)
What was done — Two to three sentences: integrated RNA-seq datasets from Crassostrea virginica exposed to Perkinsus marinus, conducted differential abundance analyses, and developed a two-step classifier pipeline.
Key results — One to two sentences: identified a validated 6-gene classifier panel that distinguishes tolerant from sensitive oyster phenotypes, confirmed via Leave-One-Study-Out (LOSO) cross-validation.
Intended application — One sentence: this guide is a reusable template for researchers facing batch effects, weak signals, and overfitting risks in multi-study biomarker discovery.
How to navigate this guide — A short table or bullet list pointing to the main sections below.

1. Research Context & Problem Framing

Why this study, why these data, and what “resilience” means here.

Background: Dermo disease (P. marinus) and its impact on oyster aquaculture
Biological question: Can gene expression predict tolerance/resistance phenotypes?
Dataset overview: study descriptions, species, stressors, sample sizes
Phenotype definition (tolerant vs. sensitive) and how it evolved

2. Process Narrative (replaces “Year in Review”)

A chronological account of decisions, pivots, and surprises — the honest story of what happened.

Phase 1 (Dec 2024 – Jan 2025): Integrated analysis attempt; why it failed
Phase 2 (Feb 2025): Confronting batch effects; COMBAT and RemoveBatchEffect trials
Phase 3 (Apr – Jun 2025): Shift to per-dataset independent analysis; TAG-seq parameter discovery
Phase 4 (Jul 2025): Normalization and study-combination experiments
Phase 5 (Aug 2025): Stepwise differential abundance; discovery of innate-signal problem
Phase 6 (Aug – Sep 2025): Two-step classifier development; 6-gene panel validation

3. Big Lessons Learned

Distilled, numbered insights for researchers adapting this work to new species or stressors.

Integrated analysis fails with noisy, weak-signal data → use post-data integration instead
Trait definitions must be specific and comparable → generic “stress vs. control” is insufficient
Innate biomarkers live in control groups → stepwise filtering removes them by design
Training-set leakage inflates accuracy → prevent leakage at every step (feature selection, normalization, tuning)
Pipeline internals matter → PCAs in nf-core/differentialabundance are generated before normalization
Technology differences require different parameters → TAG-seq ≠ standard RNA-seq

4. Methods & Pipelines

Practical, decision-first guidance for implementing the analyses.

4a. Pipeline Decision Guide

Flowchart: Which approach to choose and when (box diagram/flowchart graphic)

4b. Stepwise Differential Abundance Pipeline

Step 1: control vs. treated (stress-responsive genes)
Step 2: resistant vs. sensitive (from Step 1 genes)
Normalization approach; known failure modes

4c. Two-Step Classifier Pipeline (primary validated approach)

Step 1: Reproducibility and directionality scoring across datasets (box diagram/flowchart graphic)
Step 2: Logistic regression to minimize feature set
How the 6-gene panel was identified

4d. Validation & Pitfalls

Avoiding overfitting; train/test hygiene
Leave-One-Study-Out (LOSO) protocol
Detecting and handling batch effects
Cross-study generalizability considerations

5. Analysis Code & Source Code

Entry point for reproducibility — where to find and how to run everything.

GitHub repository structure overview
Key analysis scripts with brief descriptions
How to run pipelines from scratch (environment setup, inputs, outputs)
Links to relevant notebook entries (organized by pipeline/phase, not just chronologically)

6. Glossary

Terms defined in the context of this project.

Domain terms (e.g., Dermo, tolerance phenotype)
Statistical/computational methods (DESeq2, VST, LOSO, logistic regression)
Acronyms and abbreviations

7. Sources & References

Primary source material underlying the guide.

Notebook posts index (analysis logs)
GitHub issues index (decision trail)
External publications cited

Current Section	Proposed Section	Change
Start Here	Start Here	Rewritten as concise abstract
Problem Framing	Research Context & Problem Framing	Expanded with dataset details
Year in Review	Process Narrative	Renamed; framed as narrative
(none)	Big Lessons Learned	New section
Pipelines (3 sub-pages)	Methods & Pipelines (4 sub-pages)	Restructured; decision guide added
(none)	Analysis Code & Source Code	New section
Glossary	Glossary	Unchanged
Sources (posts + issues)	Sources & References	Moved to end; scoped as references

This outline is a proposal for review. Full implementation will proceed after approval.