Analysis Code & Source Code
Entry point for reproducibility — where to find and how to run everything.
GitHub Repository Structure
The primary analysis code lives in the Cvirg_Pmarinus_RNAseq repository:
Cvirg_Pmarinus_RNAseq/
├── analyses/
│ ├── stepwise_differentialabundance/ # Stepwise DE pipeline (Datasets 1 & 5)
│ └── Study1and5ThreeWay/
│ └── two_step_gene_expression_classifier/ # Two-step classifier (R + Python)
├── data/ # Input count matrices and metadata
└── README.md
The main project repository (this site) contains the field guide source and notebook posts.
Key Analysis Scripts
Two-Step Gene Expression Classifier
Location: analyses/Study1and5ThreeWay/two_step_gene_expression_classifier/
| Script | Language | Purpose |
|---|---|---|
| Step 1 script | R | Meta-analysis gene ranking: reproducibility, directionality consistency, heterogeneity |
| Step 2 script | Python | Logistic regression with LASSO; LOSO cross-validation |
Related notebook: Two-script pipeline for gene-expression classifier
Stepwise Differential Abundance
Location: analyses/stepwise_differentialabundance/
Implements the two-step filtering approach: (1) control vs. treated, then (2) resistant vs. sensitive on the filtered gene set. See Stepwise Pipeline for methodology and known limitations.
Related notebook: Step-wise differential abundance on Dataset 1
How to Run Pipelines from Scratch
Environment Setup
R dependencies (for Step 1 and nf-core/differentialabundance post-processing):
install.packages(c("DESeq2", "limma", "ggplot2", "dplyr"))
# or via Bioconductor:
BiocManager::install(c("DESeq2", "limma"))
Python dependencies (for Step 2 classifier):
pip install scikit-learn pandas numpy
nf-core pipelines (for upstream processing):
# Install Nextflow
curl -s https://get.nextflow.io | bash
# Run nf-core/rnaseq (read alignment and quantification)
nextflow run nf-core/rnaseq -profile docker --input samplesheet.csv --genome GRCv ...
# Run nf-core/differentialabundance (per-dataset DE analysis)
nextflow run nf-core/differentialabundance -profile docker ...
See nf-core/rnaseq docs and nf-core/differentialabundance docs for full parameter documentation.
Key Parameter Considerations
- TAG-seq datasets require different FastP parameters (see Lesson 6)
- PCAs in
differentialabundanceare generated before normalization — do not interpret them as post-normalization QC - VST normalization requires > 1000 genes for stable estimation; the stepwise pipeline may violate this
Notebook Posts by Pipeline Phase
For detailed documentation of what was run and why, see notebook posts organized by project phase:
| Phase | Relevant Notebooks |
|---|---|
| Phase 1: Initial Integration | Differential abundance workflow exploration |
| Phase 2: Batch Effects | PCA and annotation comparison |
| Phase 3: TAG-seq & Per-Dataset | Reprocess TAG-seq with FastP params |
| Phase 4: Normalization | Differential abundance on C.virg data |
| Phase 5: Stepwise DE | Step-wise differential abundance Dataset 1, GMT file for GSEA |
| Phase 6: Classifier & Validation | Two-script gene classifier pipeline, Innate gene expression, Six-gene biomarker exploration, Common genes per LOSO fold |
All notebook posts are also indexed in Sources & References.
Next: Browse the complete Sources & References index, or see the Glossary for term definitions.