Analysis Code & Source Code

Entry point for reproducibility — where to find and how to run everything.

GitHub Repository Structure

The primary analysis code lives in the Cvirg_Pmarinus_RNAseq repository:

Cvirg_Pmarinus_RNAseq/
├── analyses/
│   ├── stepwise_differentialabundance/   # Stepwise DE pipeline (Datasets 1 & 5)
│   └── Study1and5ThreeWay/
│       └── two_step_gene_expression_classifier/  # Two-step classifier (R + Python)
├── data/                                 # Input count matrices and metadata
└── README.md

The main project repository (this site) contains the field guide source and notebook posts.

Key Analysis Scripts

Two-Step Gene Expression Classifier

Location: analyses/Study1and5ThreeWay/two_step_gene_expression_classifier/

Script	Language	Purpose
Step 1 script	R	Meta-analysis gene ranking: reproducibility, directionality consistency, heterogeneity
Step 2 script	Python	Logistic regression with LASSO; LOSO cross-validation

Related notebook: Two-script pipeline for gene-expression classifier

Stepwise Differential Abundance

Location: analyses/stepwise_differentialabundance/

Implements the two-step filtering approach: (1) control vs. treated, then (2) resistant vs. sensitive on the filtered gene set. See Stepwise Pipeline for methodology and known limitations.

Related notebook: Step-wise differential abundance on Dataset 1

How to Run Pipelines from Scratch

Environment Setup

R dependencies (for Step 1 and nf-core/differentialabundance post-processing):

install.packages(c("DESeq2", "limma", "ggplot2", "dplyr"))
# or via Bioconductor:
BiocManager::install(c("DESeq2", "limma"))

Python dependencies (for Step 2 classifier):

pip install scikit-learn pandas numpy

nf-core pipelines (for upstream processing):

# Install Nextflow
curl -s https://get.nextflow.io | bash

# Run nf-core/rnaseq (read alignment and quantification)
nextflow run nf-core/rnaseq -profile docker --input samplesheet.csv --genome GRCv ...

# Run nf-core/differentialabundance (per-dataset DE analysis)
nextflow run nf-core/differentialabundance -profile docker ...

See nf-core/rnaseq docs and nf-core/differentialabundance docs for full parameter documentation.

Key Parameter Considerations

TAG-seq datasets require different FastP parameters (see Lesson 6)
PCAs in differentialabundance are generated before normalization — do not interpret them as post-normalization QC
VST normalization requires > 1000 genes for stable estimation; the stepwise pipeline may violate this

Notebook Posts by Pipeline Phase

For detailed documentation of what was run and why, see notebook posts organized by project phase:

Phase	Relevant Notebooks
Phase 1: Initial Integration	Differential abundance workflow exploration
Phase 2: Batch Effects	PCA and annotation comparison
Phase 3: TAG-seq & Per-Dataset	Reprocess TAG-seq with FastP params
Phase 4: Normalization	Differential abundance on C.virg data
Phase 5: Stepwise DE	Step-wise differential abundance Dataset 1, GMT file for GSEA
Phase 6: Classifier & Validation	Two-script gene classifier pipeline, Innate gene expression, Six-gene biomarker exploration, Common genes per LOSO fold

All notebook posts are also indexed in Sources & References.

Next: Browse the complete Sources & References index, or see the Glossary for term definitions.