GitHub Issues Index
This index organizes the GitHub issues from the Cvirg_Pmarinus_RNAseq repository that were referenced in the project timeline and field guide. Issues are grouped thematically for easier navigation.
All issue links point to: https://github.com/Resilience-Biomarkers-for-Aquaculture/Cvirg_Pmarinus_RNAseq/issues/[number]
Data Preparation & Integration
Issue #3: Created Merged Metadata
Timeline: January 2025
Status: Completed
Summary: Initial effort to create unified metadata across multiple RNA-seq datasets for integrated analysis.
Issue #9: Add Study 5
Timeline: January 2025
Status: Completed
Summary: Incorporated an additional dataset (Study 5) into the analysis pipeline to increase sample size and assess reproducibility.
Issue #54: Find More Datasets
Timeline: September 2025
Status: Postponed
Summary: Exploration of additional RNA-seq datasets for broader validation. Deferred to focus on existing dataset analysis.
Differential Abundance Analysis
Issue #4: Differential Abundance Initial Approach
Timeline: January 2025
Status: Completed
Summary: First attempt at differential abundance analysis. Toy example succeeded, but encountered GTF file issues and weak trait separation in full datasets.
Key finding: Mutual information analysis on Perkinsus datasets showed insufficient separation between tolerant/sensitive groups.
Issue #12: Decide Best Methods and Execute
Timeline: January 2025
Status: Completed
Summary: Decision point for selecting optimal differential abundance methodologies after initial exploration revealed data integration challenges.
Issue #29: Differential Abundance on All Datasets Together
Timeline: April 2025
Status: Deferred
Summary: Ran differential abundance on all datasets together but results were not yet interpreted. Could return to this analysis.
Issue #31: Interpret Combined Dataset Results
Timeline: April 2025
Status: Deferred
Related: Analysis results
Summary: Follow-up to #29; interpretation was postponed to focus on per-dataset approaches.
Issue #32: Run Differential Abundance on Datasets Separately
Timeline: June 2025
Status: Completed
Summary: Pivot to per-dataset analysis. Steve focused on Study 5, Shelly focused on Study 1. Goal was understanding optimal parameters for differential abundance pipeline.
Issue #36: Compare DEGs Across Datasets
Timeline: September 2025
Status: Not completed
Summary: Compare differential abundance results run independently for each dataset. Theme: post-data integration approach. Question remains about reproducibility vs. integrated data analysis, but subsetting approach is uncertain.
Issue #46: Integrate All Data Through Differential Abundance Pipeline
Timeline: September 2025
Status: Not completed
Summary: Another attempt at integrated data analysis. Could revisit.
Batch Effects & Normalization
Issue #18: Batch Correction Attempts
Timeline: February 2025
Status: Completed
Summary: Attempted batch correction using COMBAT and RemoveBatchEffect methods. Results showed little improvement in trait-based separation.
Key lesson: Study-specific effects were stronger than trait effects; batch correction couldn’t recover sufficient signal.
Issue #34: Combine Study 4 Injected + Study 5
Timeline: July 2025
Status: Completed
Summary: Experimental combination of compatible studies (Study 4 injected group + Study 5) to increase sample size.
Research question: Would Study 4 injected samples cluster with resistant or susceptible phenotype from Study 5?
Learning: Gained understanding of normalization timing in differentialabundance pipeline (PCAs before analysis, normalization during). Started seeing evidence of innate trait.
Could return to: Revisit analysis on 567 significant DEGs (by DESeq) to see if clustering improves compared to top 500 most variable genes.
Technical Issues & Parameters
Issue #26: Parameter Selection & GC Bias
Timeline: April 2025
Status: Completed
Summary: Identified that Johnson dataset used TAG-seq (not standard RNA-seq) and discovered GC bias. Determined that initial analysis parameters were inappropriate for TAG-seq data.
Issue #28: Rerun Johnson Data with Different Parameters
Timeline: April 2025
Status: Completed
Related: Notebook post
Summary: Reprocessed TAG-seq data with appropriate FastP parameters to address issues identified in #26.
GSEA & Pathway Analysis
Issue #41: Stepwise Differential Abundance Approach
Timeline: August 2025
Status: Completed
Summary: Developed two-step approach: (1) Controls vs. treated, (2) Resistant vs. sensitive from step 1 genes.
Implementation: analyses/stepwise_differentialabundance/
Results on Dataset 1: Only 1 significant gene. DESeq2 struggled with small gene set (~50 genes).
Related: Stepwise notebook
Issue #45: Understand GSEA
Timeline: August-September 2025
Status: Not completed
Summary: Goal was to better understand and apply Gene Set Enrichment Analysis (GSEA) for pathway-level interpretation. Deferred.
Classifier Development & Validation
Issue #42: Validate SR320 Classification Results
Timeline: August 2025
Status: In progress
Summary: Validation of classification results. Question: Are the ~50 candidate markers convincing?
Action needed: Make plots to assess marker quality.
Issue #43: SR320’s AI Model
Timeline: August 2025
Status: Completed
Summary: Development of machine learning classifier for phenotype prediction.
Issue #44: Combined Datasets 1 & 5
Timeline: August 2025
Status: ✅ Completed - 6-gene classifier success
Summary: Comparison of integrated data analysis vs. post-data integration approaches using datasets 1 and 5.
Pipeline:
- Step 1: Rank genes by reproducibility, directionality consistency, and heterogeneity
- Step 2: Logistic regression for minimal gene set
Result: 6-gene classifier panel with strong separation between tolerant and sensitive phenotypes.
Key lesson: Only include training set in test set if exploring within study; for cross-study prediction, keep training and test separate (LOSO validation).
Related: Gene classifier notebook
Issue #47: (Details Unknown)
Timeline: September 2025
Status: Not pursued
Summary: No need to revisit.
Issue #49: Plot 6 Genes to Gain Confidence
Timeline: September 2025
Status: In progress
Summary: Visualization of 6-gene panel performance to assess how well genes distinguish phenotypes across studies.
Related: Six-gene biomarker exploration
Issue #51: Replot Heatmap with Improved Clustering
Timeline: September 2025
Status: To revisit
Summary: Improve heatmap visualizations with better clustering algorithms and labels for clearer interpretation of gene panel performance.
Issue #52: Coverage Density Plots
Timeline: September 2025
Status: Needs notebook entry
Summary: Generate coverage density plots for quality control and validation. Still requires documentation in a notebook post.
Issue #53: Innate vs. Reactive Gene Expression
Timeline: September 2025
Status: Completed
Summary: Critical analysis determining whether biomarkers are innate (constitutively different in controls) or reactive (induced by stress).
Key insight: Biomarkers may exist in control groups if they represent innate resilience traits. The stepwise approach may remove these.
Related: Innate gene expression notebook
Literature Comparison
Issue #39: Compare DE Results from Papers
Timeline: July 2025
Status: Remaining to be done
Summary: Systematic comparison of project DEG results with published literature on oyster stress response. Create consolidated list of known DEGs/markers for validation.
Navigation
Main Guide:
- Start Here - Field guide overview
- Timeline - Detailed project history
- Pipelines - Analysis workflows
Other Sources:
- Notebook Posts Index - Analysis notebooks organized by theme