Pipeline: Two-Step Classifier
Overview
The two-step classifier approach identifies a minimal gene panel that distinguishes tolerant from sensitive phenotypes through:
- Step 1: Rank genes based on reproducibility, directionality consistency, and heterogeneity across datasets
- Step 2: Use logistic regression to find the minimum gene set that achieves good classification
Status: ✅ Validated - Produced the working 6-gene classifier panel
When to Use This Approach
✅ Use when:
- You want a minimal gene panel for practical application (< 10 genes)
- You have multiple independent datasets to assess reproducibility
- You need cross-study validation (LOSO)
- You don’t want to exclude innate biomarkers (constitutively different)
✅ Advantages over stepwise:
- Doesn’t require control groups
- Preserves innate biomarkers
- Reproducibility-focused (prioritizes genes consistent across studies)
- Designed for minimal feature sets (logistic regression handles small gene counts)
The Pipeline
Prerequisites
- Multiple RNA-seq datasets with comparable phenotype labels (tolerant/sensitive)
- Gene count matrices for each dataset
- Metadata with phenotype labels
- Shared gene annotation across datasets
Input Data Structure
This approach uses post-data integration:
- Analyze each dataset independently first
- Compare results across datasets
- Identify reproducible signals
Dataset 1: [counts, metadata with phenotype]
Dataset 2: [counts, metadata with phenotype]
Dataset 3: [counts, metadata with phenotype]
...
Important: Do NOT pool datasets before analysis (see Problem Framing - Big Lesson #1)
Step 1: Reproducibility-Based Gene Ranking
Goal
Identify genes that consistently differentiate phenotypes across multiple studies
Criteria
For each gene, assess:
- Reproducibility: Gene is differentially expressed in multiple datasets
- Directionality consistency: Direction of fold change is same across datasets (e.g., always upregulated in tolerant)
- Heterogeneity: Variance in expression within phenotype groups
Implementation Approach
# Pseudocode for Step 1
for each_dataset in datasets:
# Run differential expression
deg_results[dataset] = DESeq2(
counts=dataset.counts,
design="~ phenotype"
)
# Score genes across datasets
gene_scores = {}
for gene in all_genes:
# Count how many datasets show significant DE
reproducibility = count_significant_across_datasets(gene, deg_results)
# Check directionality consistency
fold_changes = [deg_results[ds][gene].log2FC for ds in datasets]
directionality = 1.0 if all_same_sign(fold_changes) else 0.0
# Assess heterogeneity
heterogeneity = calculate_within_group_variance(gene, datasets)
# Combined score
gene_scores[gene] = (
reproducibility * directionality / heterogeneity
)
# Rank genes by score
ranked_genes = sort_descending(gene_scores)
Output
Ranked list of candidate genes, prioritizing those that:
- Appear as DEGs in multiple datasets
- Show consistent direction of change
- Have low within-group variance (high between-group separation)
Step 2: Logistic Regression for Minimal Panel
Goal
Find the smallest set of genes that achieves good classification accuracy
Method
Use logistic regression with regularization (LASSO or elastic net) to select features:
# Pseudocode for Step 2
# Start with top N genes from Step 1 (e.g., top 50-100)
candidate_genes = ranked_genes[:100]
# Prepare training data (combined datasets)
X_train = combined_expression_matrix[candidate_genes]
y_train = phenotype_labels # tolerant vs. sensitive
# Fit logistic regression with LASSO regularization
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(
penalty='l1', # LASSO for feature selection
solver='liblinear',
cv=5, # cross-validation to find optimal regularization
max_iter=1000
)
clf.fit(X_train, y_train)
# Extract selected genes (non-zero coefficients)
selected_genes = candidate_genes[clf.coef_[0] != 0]
print(f"Minimal gene panel: {len(selected_genes)} genes")
print(selected_genes)
Regularization Parameter Selection
- Use cross-validation to find optimal L1 penalty
- Balance between:
- Fewer genes (higher penalty) → simpler assay, may sacrifice accuracy
- More genes (lower penalty) → better accuracy, more complex assay
Implementation Example: 6-Gene Classifier
From Issue #44 and notebook post:
Datasets Used
- Dataset 1: C. virginica with P. marinus exposure
- Dataset 5: C. virginica with P. marinus exposure
Results
Step 1 output: Ranked ~1000s of genes by reproducibility
Step 2 output: 6-gene panel with strong phenotype separation
Performance
✅ Strong separation between tolerant and sensitive phenotypes
✅ Genes are reproducible across independent datasets
✅ Small enough for practical assay development
Related notebook: Two-script pipeline for gene classifier
Validation: Leave-One-Study-Out (LOSO)
!!! warning “Big Lesson #4: Avoid Training Set in Test Set” When evaluating cross-study performance:
- **Within-study validation:** OK to include training samples if only evaluating performance within same study
- **Cross-study validation:** **Never include training study in test set** → use LOSO
Including training data in test set leads to overfitting and falsely optimistic accuracy.
LOSO Procedure
For each dataset:
- Train classifier on all OTHER datasets
- Test on held-out dataset (never seen during training)
- Report accuracy, sensitivity, specificity
# LOSO Cross-Validation
for test_dataset in all_datasets:
training_datasets = [ds for ds in all_datasets if ds != test_dataset]
# Train on all except test_dataset
X_train = combine_datasets(training_datasets, selected_genes)
y_train = get_labels(training_datasets)
clf.fit(X_train, y_train)
# Test on held-out dataset
X_test = test_dataset[selected_genes]
y_test = test_dataset['phenotype']
accuracy = clf.score(X_test, y_test)
print(f"Test on {test_dataset.name}: Accuracy = {accuracy:.3f}")
Expected Performance
- High accuracy (> 80%) across LOSO folds suggests robust, generalizable panel
- Variable accuracy suggests study-specific effects; may need batch correction or dataset filtering
See Validation & Pitfalls for more on LOSO and other validation strategies.
Innate vs. Reactive Biomarkers
A key advantage of this approach: it captures innate biomarkers
!!! success “Innate Biomarker Preservation” Unlike the stepwise approach, the classifier method does not filter based on stress response.
Genes that are constitutively different between resistant/sensitive oysters (even in controls) will be:
- Identified in Step 1 if reproducible across datasets
- Selected in Step 2 if predictive of phenotype
Analysis: Issue #53 and notebook
To characterize whether your biomarkers are innate or reactive:
- Compare gene expression in controls only (resistant vs. sensitive)
- If significant → innate biomarker
- If not significant in controls but significant in treated → reactive biomarker
Workflow Summary
graph TD
A[Multiple Datasets<br/>Independent DE Analysis] --> B[Step 1: Gene Scoring]
B --> C[Reproducibility:<br/>Present in multiple datasets]
B --> D[Directionality:<br/>Consistent fold change sign]
B --> E[Heterogeneity:<br/>Low within-group variance]
C --> F[Ranked Gene List]
D --> F
E --> F
F --> G[Step 2: Logistic Regression<br/>with LASSO]
G --> H[Minimal Gene Panel<br/>e.g., 6 genes]
H --> I[LOSO Validation]
I --> J{Accuracy > 80%<br/>across folds?}
J -->|Yes| K[Validated Classifier]
J -->|No| L[Refine: Add features,<br/>adjust penalty, filter datasets]
Code & Resources
Primary analysis repository:
Cvirg_Pmarinus_RNAseq/analyses/
Key notebook:
Two-script pipeline for gene-expression classifier
Related issues:
Additional exploration:
Practical Considerations
Minimum Dataset Requirements
- At least 2 independent studies (for reproducibility assessment)
- Comparable phenotype definitions (resistant/sensitive)
- Sufficient sample size (n > 10 per group per study recommended)
Computational Requirements
- Moderate: Can run on laptop/workstation
- No need for HPC (unlike full nf-core pipelines)
- Python/R with standard ML libraries (scikit-learn, glmnet)
Assay Development
Once you have a minimal panel:
- 6-10 genes → feasible for qPCR assay
- < 20 genes → feasible for targeted RNA-seq panel
- > 20 genes → may need full RNA-seq (cost-prohibitive for routine screening)
Next: Learn about validation strategies and common pitfalls in Validation & Pitfalls