Main Goals:
- Understand how Diff. Abundance pipeline works
- Understand the differences in immune responses that lead to tolerant and sensitive phenotypes
- compare fed vs. injected animals from study 4 (Steve)
- compare breeds in dosed animals from study 1 (Shelly)
For Study 1 Prosteu et al. 2023, RNAseq analysis included:
- Preliminary RNAseq analysis: PCA of VST transformed data
- Separation of treatment groups
- WGCNA to identify modules related to traits
- DESeq on dose 10^8 samples only
- threshold: transcripts must have >=10 counts across all samples
- comparison of family variable only
- GLM shrinkage applied to Log2 fold change estimates
- BH FDR correction
Question to answer:
- Does breed impact expression under dose-independent infection?
- Dosed vs. control
- Ideally we’d want to test: expression ~ Breed + treatment + Breed:treatment
- Since testing for interactive effects is not straight forward in this pipeline (), we’ll select only dosed animals and compare breeds
- another option to try that we may be able to extract the effects from breed: id,variable,reference,target,blocking treatment_control_dosed_blockrep,treatment,Control,Dose,breed
Prepare input files for pipeline
GTF
The GTF had some errors that Emma found previously. One problem was empty gene_ids and one problem was omission of ‘transcript’ in the 3rd field (see github issue #25). A workaround for transcript not being in the 3rd field is changing the feature type that the pipeline uses from transcript to gene using the --features_gtf_feature_type
parameter.
Correcting the GTF for empty gene_ids:
# check the gene_id fields in entire file
zcat GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz | awk -F"\t" '{print $9}'| awk -F";" '{print $1}' | awk '{print $1, substr($2,1,4)}' | sort | uniq -c | less
5
26 gene_id ""
4 gene_id "ATP
12 gene_id "COX
5 gene_id "CYT
1530292 gene_id "LOC
4 gene_id "ND1
4 gene_id "ND2
4 gene_id "ND3
8 gene_id "ND4
4 gene_id "ND5
4 gene_id "ND6
1175 gene_id "Trn
only 26 lines have empty gene_ids
#check the lines that don't have 'gene_id'
zcat GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz | grep -v 'gene_id'
#gtf-version 2.2
#!genome-build C_virginica-3.0
#!genome-build-accession NCBI_Assembly:GCF_002022765.2
#!annotation-source NCBI Crassostrea virginica Annotation Release 100
###
it’s only the header that doesn’t have gene_id in the line. For lines without gene_id, create one using the position
# make gene ID the position
zcat GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz | awk -F'\t' '{if ($9 ~ /gene_id ""/) {$9 = gensub(/gene_id ""/, "gene_id \"LOC_" $4 "-" $5 "\"", 1, $9);}OFS="\t"; print}' \
#check example that previously had 'gene_id ""'
| grep 'pos:15923..15925'
NC_007175.2 RefSeq exon 15892 15957 . + . gene_id "LOC_15892-15957"; transcript_id "unknown_transcript_1"; anticodon "(pos:15923..15925)"; gbkey "tRNA"; product "tRNA-Ala"; exon_number "1";
#that code works so save new GTF file
cat /gscratch/srlab/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.gtf | awk -F'\t' '{if ($9 ~ /gene_id ""/) {$9 = gensub(/gene_id ""/, "gene_id \"LOC_" $4 "-" $5 "\"", 1, $9);}OFS="\t"; print}' > /gscratch/srlab/strigg/GENOMES/GCF_002022765.2_C_virginica-3.0_genomic_noEmptyGeneIDs.gtf
#check new GTF doesn't have empty gene_id fields
grep 'gene_id ""' /gscratch/srlab/strigg/GENOMES/GCF_002022765.2_C_virginica-3.0_genomic_noEmptyGeneIDs.gtf
Sampleheet We’re modifying the samplesheet used in RNAseq pipeline because we need to add the meta data that the contrasts file will use to the samplesheet data
I subsetted the samples and excluded the controls for study 1.
Here is Steve’s samplesheet for example: rnaseq_difabundance_study4_s3_samplesheet.csv. There are repeated columns in this file (Collection_Interval_Days, Collection_Date, Treatment)
Contrasts file
id,variable,reference,target
breed_dose_120_84,BREED,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017084
breed_dose_90_84,BREED,ABC_VIMS_Family_2017090,ABC_VIMS_Family_2017084
breed_dose_90_89,BREED,ABC_VIMS_Family_2017090,ABC_VIMS_Family_2017089
breed_dose_120_89,BREED,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017089
breed_dose_120_90,BREED,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017090
breed_dose_89_84,BREED,ABC_VIMS_Family_2017089,ABC_VIMS_Family_2017084
Here are some examples:
- Emma’s
id,variable,reference,target
condition_test_experiment,condition,test,experiment
-Steve’s
id,variable,reference,target,blocking
D6_treatment_control_fed,treatment,Control,Fed,batch
D6_treatment_control_inj,treatment,Control,Injected,batch
Gene counts matrix Steve used merged gene counts from all studies https://raw.githubusercontent.com/Resilience-Biomarkers-for-Aquaculture/Cvirg_Pmarinus_RNAseq/refs/heads/main/data/rnaseq_gene_counts/merged_gene_counts.tsv
Emma’s salmon.merged.gene_counts_length_scaled.tsv
Running the pipeline
- Our matrix is merged_gene_counts so we should be using
gene
for the feature type. So we have to setfeatures_gtf_feature_type
because the default is transcript
screen -S diffabund
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=24:00:00
mamba activate nextflow
#download gene counts matrix
wget https://raw.githubusercontent.com/Resilience-Biomarkers-for-Aquaculture/Cvirg_Pmarinus_RNAseq/refs/heads/main/data/rnaseq_gene_counts/salmon.merged.gene_counts_length_scaled_dataset1.tsv
#download sample contrasts
wget https://raw.githubusercontent.com/Resilience-Biomarkers-for-Aquaculture/Cvirg_Pmarinus_RNAseq/refs/heads/main/data/differential_abundance_sheets/study1_contrasts.csv
#download samplesheet
wget https://raw.githubusercontent.com/Resilience-Biomarkers-for-Aquaculture/Cvirg_Pmarinus_RNAseq/refs/heads/main/data/differential_abundance_sheets/study1_samplesheet.csv
nextflow run nf-core/differentialabundance \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--study_name dataset_1 \
--input study1_samplesheet.csv \
--contrasts study1_contrasts.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250701_diffabund \
--matrix /gscratch/scrubbed/strigg/analyses/20250701_diffabund/salmon.merged.gene_counts_length_scaled_dataset1.tsv \
--feature_type gene \
--features_gtf_feature_type gene \
--gtf /gscratch/srlab/strigg/GENOMES/GCF_002022765.2_C_virginica-3.0_genomic_noEmptyGeneIDs.gtf
- pipeline results report here: 20250701_diffabund/report
attempt with thresholding
nextflow run nf-core/differentialabundance \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--study_name dataset_1 \
--input study1_samplesheet.csv \
--contrasts study1_contrasts.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250701_diffabund \
--matrix /gscratch/scrubbed/strigg/analyses/20250701_diffabund/salmon.merged.gene_counts_length_scaled_dataset1.tsv \
--feature_type gene \
--features_gtf_feature_type gene \
--filtering_min_abundance 10 \
--filtering_min_proportion 0.75 \
--gtf /gscratch/srlab/strigg/GENOMES/GCF_002022765.2_C_virginica-3.0_genomic_noEmptyGeneIDs.gtf
-
pipeline results report here: 20250701_diffabund/read_thresh10_minSamp075/
-
complete results here: https://gannet.fish.washington.edu/metacarcinus/USDA_MetaOmics/Cvirg_RNAseq/20250701_diffabund/read_thresh10_minSamp075/
Attempt combining resistant and sensitive families as one
84 and 89 = resistant; 90 and 120 = sensitive made a trait column in the samplesheet that indicates resistant or sensitive
contrasts file:
id,variable,reference,target
trait_dose,trait,sensitive,resistant
run pipeline:
nextflow run nf-core/differentialabundance \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--study_name dataset_1 \
--input study1_samplesheet.csv \
--contrasts study1_contrasts.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250702_diffabund \
--matrix /gscratch/scrubbed/strigg/analyses/20250701_diffabund/salmon.merged.gene_counts_length_scaled_dataset1.tsv \
--feature_type gene \
--features_gtf_feature_type gene \
--filtering_min_abundance 10 \
--filtering_min_proportion 0.75 \
--gtf /gscratch/srlab/strigg/GENOMES/GCF_002022765.2_C_virginica-3.0_genomic_noEmptyGeneIDs.gtf
-
pipeline results report here: 20250701_diffabund/read_thresh10_minSamp075/
-
complete results here: https://gannet.fish.washington.edu/metacarcinus/USDA_MetaOmics/Cvirg_RNAseq/20250702_diffabund/
Pipeline Errors
Error in checkListIsSubset(val, samples[[var]], "contrast levels", "sample metadata variable") :
Not all contrast levels (ABC_VIMS_Family_2017190) are available in the sample metadata variabl
e (ABC_VIMS_Family_2017084,ABC_VIMS_Family_2017089,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017084
,ABC_VIMS_Family_2017084,ABC_VIMS_Family_2017089,ABC_VIMS_Family_2017089,ABC_VIMS_Family_2017090,A
BC_VIMS_Family_2017090,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017089,ABC_VIMS_Family_2017084,ABC
_VIMS_Family_2017089,ABC_VIMS_Family_2017090,ABC_VIMS_Family_2017090,ABC_VIMS_Family_2017120,ABC_V
IMS_Family_2017084,ABC_VIMS_Family_2017084,ABC_VIMS_Family_2017089,ABC_VIMS_Family_2017090,ABC_VIM
S_Family_2017120,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017084,ABC_VIMS_Family_2017089,ABC_VIMS_
Family_2017089,ABC_VIMS_Family_2017090,ABC_VIMS_Family_2017084,ABC_VIMS_Family_2017090,ABC_VIMS_Fa
mily_2017120,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017084,ABC_VIMS_Family_2017084,ABC_VIMS_Fami
ly_2017089,ABC_VIMS_Family_2017090,ABC_VIMS_Family_2017090,ABC_VIMS_Family_2017120,ABC_VIMS_Family
_2017089,ABC_VI
Calls: validate_inputs -> read_contrasts -> checkListIsSubset
Execution halted
This error came from an error in my contrast file which had ABC_VIMS_Family_2017190 instead of ABC_VIMS_Family_2017090. I corrected the contrasts file and reran the pipeline with the resume option
I got another error of file name collision and I think it’s because my IDs in my contrast file were all the same so i made them distinct and ran it again.
old contrasts file: id,variable,reference,target breed_dose,BREED,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017084 breed_dose,BREED,ABC_VIMS_Family_2017190,ABC_VIMS_Family_2017084 breed_dose,BREED,ABC_VIMS_Family_2017190,ABC_VIMS_Family_2017089 breed_dose,BREED,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017089 breed_dose,BREED,ABC_VIMS_Family_2017120,ABC_VIMS_Family_2017190 breed_dose,BREED,ABC_VIMS_Family_2017089,ABC_VIMS_Family_2017084
#checking fields in GTF file zcat GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz | awk -F’\t’ ‘{print NF}’ | sort | uniq -c 5 1 1531542 9
#checking fields in GFF file zcat GCF_002022765.2_C_virginica-3.0_genomic.gff.gz | awk -F’\t’ ‘{print NF}’ | sort | uniq -c
zcat GCF_963853765.1_xbMagGiga1.1_genomic.gtf.gz | awk -F’\t’ ‘{print $3,$4,$5,$9}’ | awk -F’;’ ‘{print $1}’ | awk ‘{if(($1~/gene/) | ($1~/transcript/)) print $0}’ | awk ‘{if (NR % 2 == 1) {v1 = $2; v2 = $3} else {if ($2 != v1 | $3 != v2) {print “Mismatch at line “ NR “: Expected”, v1, v2, “but got”, $2, $3}}}END{if (NR % 2 == 1) {print “Warning: Last line has no pair”} else {print “All pairs matched”}}’ |
Dataset 4: 2021 Proestou
Condition: Fed/Inj/Control Batch: sampling Time: days
expression ~ condition + time + condition:time + 1 | batch |
different lines in contrast file for each different comparison. For example:
- For day 6:
- expression ~ condition + batch
id,variable,reference,target,blocking D6_treatment_control_fed_blockrep,treatment,Control,Fed,replicate;batch D6_treatment_control_inj_blockrep,treatment,Control,Injected,replicate;batch
id,variable,reference,target,blocking D28_treatment_control_fed_block,treatment,Control,Fed,batch D28_treatment_control_inj_block,treatment,Control,Injected,batch D28_treatment_control_fed_block,treatment,Fed,Injected, batch
id,variable,reference,target,blocking