Proposed Field Guide Outline
Audience: Researchers with similar omics datasets (potentially different species or stressors) who want to learn from this project’s lessons, methods, and pipelines for biomarker discovery.
0. Start Here (Landing Page — revised to be succinct)
Format: Abstract-style, ~250–300 words total.
Content:
-
Purpose — This field guide documents considerations for building a biomarker discovery pipeline from RNA-seq meta-analysis, with an emphasis on lessons learned, reproducibility, and practical recommendations for researchers working with noisy multi-study omics data.
-
Grant mission — One to two sentences drawn from the project narrative: developing molecular resilience biomarkers to support selective breeding and management of shellfish aquaculture under disease pressure. (Full narrative: ProjectSummaryandNarrative.pdf)
-
What was done — Two to three sentences: integrated RNA-seq datasets from Crassostrea virginica exposed to Perkinsus marinus, conducted differential abundance analyses, and developed a two-step classifier pipeline.
-
Key results — One to two sentences: identified a validated 6-gene classifier panel that distinguishes tolerant from sensitive oyster phenotypes, confirmed via Leave-One-Study-Out (LOSO) cross-validation.
-
Intended application — One sentence: this guide is a reusable template for researchers facing batch effects, weak signals, and overfitting risks in multi-study biomarker discovery.
-
How to navigate this guide — A short table or bullet list pointing to the main sections below.
1. Research Context & Problem Framing
Why this study, why these data, and what “resilience” means here.
- Background: Dermo disease (P. marinus) and its impact on oyster aquaculture
- Biological question: Can gene expression predict tolerance/resistance phenotypes?
- Dataset overview: study descriptions, species, stressors, sample sizes
- Phenotype definition (tolerant vs. sensitive) and how it evolved
2. Process Narrative (replaces “Year in Review”)
A chronological account of decisions, pivots, and surprises — the honest story of what happened.
- Phase 1 (Dec 2024 – Jan 2025): Integrated analysis attempt; why it failed
- Phase 2 (Feb 2025): Confronting batch effects; COMBAT and RemoveBatchEffect trials
- Phase 3 (Apr – Jun 2025): Shift to per-dataset independent analysis; TAG-seq parameter discovery
- Phase 4 (Jul 2025): Normalization and study-combination experiments
- Phase 5 (Aug 2025): Stepwise differential abundance; discovery of innate-signal problem
- Phase 6 (Aug – Sep 2025): Two-step classifier development; 6-gene panel validation
3. Big Lessons Learned
Distilled, numbered insights for researchers adapting this work to new species or stressors.
- Integrated analysis fails with noisy, weak-signal data → use post-data integration instead
- Trait definitions must be specific and comparable → generic “stress vs. control” is insufficient
- Innate biomarkers live in control groups → stepwise filtering removes them by design
- Training-set leakage inflates accuracy → prevent leakage at every step (feature selection, normalization, tuning)
- Pipeline internals matter → PCAs in
nf-core/differentialabundanceare generated before normalization - Technology differences require different parameters → TAG-seq ≠ standard RNA-seq
4. Methods & Pipelines
Practical, decision-first guidance for implementing the analyses.
4a. Pipeline Decision Guide
- Flowchart: Which approach to choose and when (box diagram/flowchart graphic)
4b. Stepwise Differential Abundance Pipeline
- Step 1: control vs. treated (stress-responsive genes)
- Step 2: resistant vs. sensitive (from Step 1 genes)
- Normalization approach; known failure modes
4c. Two-Step Classifier Pipeline (primary validated approach)
- Step 1: Reproducibility and directionality scoring across datasets (box diagram/flowchart graphic)
- Step 2: Logistic regression to minimize feature set
- How the 6-gene panel was identified
4d. Validation & Pitfalls
- Avoiding overfitting; train/test hygiene
- Leave-One-Study-Out (LOSO) protocol
- Detecting and handling batch effects
- Cross-study generalizability considerations
5. Analysis Code & Source Code
Entry point for reproducibility — where to find and how to run everything.
- GitHub repository structure overview
- Key analysis scripts with brief descriptions
- How to run pipelines from scratch (environment setup, inputs, outputs)
- Links to relevant notebook entries (organized by pipeline/phase, not just chronologically)
6. Glossary
Terms defined in the context of this project.
- Domain terms (e.g., Dermo, tolerance phenotype)
- Statistical/computational methods (DESeq2, VST, LOSO, logistic regression)
- Acronyms and abbreviations
7. Sources & References
Primary source material underlying the guide.
- Notebook posts index (analysis logs)
- GitHub issues index (decision trail)
- External publications cited
Navigation Changes Summary
| Current Section | Proposed Section | Change |
|---|---|---|
| Start Here | Start Here | Rewritten as concise abstract |
| Problem Framing | Research Context & Problem Framing | Expanded with dataset details |
| Year in Review | Process Narrative | Renamed; framed as narrative |
| (none) | Big Lessons Learned | New section |
| Pipelines (3 sub-pages) | Methods & Pipelines (4 sub-pages) | Restructured; decision guide added |
| (none) | Analysis Code & Source Code | New section |
| Glossary | Glossary | Unchanged |
| Sources (posts + issues) | Sources & References | Moved to end; scoped as references |
This outline is a proposal for review. Full implementation will proceed after approval.