RNASeq workflow for meta analysis with Cvig immunity
Summary
We are analyzing 5 datasets of Eastern Oyster, C. virginica, exposed to Perkinsus marinus infections.
Reference genome
All datasets work with the Eastern Oyster, C. virginica, so I need to download the genome, link2.
cd /mmfs1/gscratch/scrubbed/elstrand/genomes
mkdir C_gigas
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0/GCF_002022765.2_C_virginica-3.0_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0/GCF_002022765.2_C_virginica-3.0_genomic.gff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0/GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz
## modifying the empty gene id columns
zcat GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz | awk -F"\t" '{if($9 ~/gene_id ""/ ) gsub(/gene_id ""/,"gene_id \"unknown_transcript_1\"",$9);print $0}' | gzip > mod_GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz
Final script
I was messing with haivng the fastp arguments with $4 but was getting errors with that. So changed the clipping parameters as needed.
rnaseq_Cvir.sh
#!/bin/bash
#SBATCH --account=srlab
#SBATCH --error=/gscratch/scrubbed/elstrand/Cvir_disease_meta/scripts/output/"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=/gscratch/scrubbed/elstrand/Cvir_disease_meta/scripts/output/"%x_output.%j" #once your job is completed, any final job report comments will be put in this file
#SBATCH --partition=cpu-g2-mem2x
#SBATCH --nodes=1
#SBATCH --time=1-20:00:00
#SBATCH --mem=200G
#SBATCH --ntasks=7
#SBATCH --cpus-per-task=2
#conda activate nextflow
## paths defined in the sbatch command
samplesheet=$1
output=$2
multiqc_title=$3
# fastp=$4
## paths for all datasets
genome="/gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.fna.gz"
gff="/gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.gff.gz"
gtf="/gscratch/scrubbed/elstrand/genomes/C_virginica/mod_GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz"
nextflow run nf-core/rnaseq -resume \
-c /gscratch/scrubbed/elstrand/uw_hyak_srlab_estrand.config \
-profile singularity \
--input ${samplesheet} \
--outdir ${output} \
--gtf ${gtf} \
--gff ${gff} \
--fasta ${genome} \
--trimmer fastp \
--extra_fastp_args '--cut_mean_quality 30 --trim_front1 10 --trim_front2 10' \
--aligner star_salmon \
--skip_pseudo_alignment \
--multiqc_title ${multiqc_title} \
--deseq2_vst
How to run (example from dataset #4):
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset4
sbatch -J Cvir_disease_rnaseq_dataset4 ../scripts/rnaseq_Cvir.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_dataset4.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset4 \
Cvir_disease_rnaseq_dataset4 \
'--cut_mean_quality 30 --trim_front1 0 --trim_front2 0'
How to check output while script is running:
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/scripts/output
### two log files are generated from each sbatch script
### the titles are based on the -J job name indicated above
Cvir_disease_rnaseq_dataset4_error.22760577
Cvir_disease_rnaseq_dataset4_output.22760577
less <output_file>
## Shift + G to get to the bottom of the file to see something like this:
[ec/008475] NFC<E2><80><A6>ginica-3.0_genomic.fna.gz) | 1 of 1 <E2><9C><94>
[c0/569e4f] NFC<E2><80><A6>ginica-3.0_genomic.gtf.gz) | 1 of 1 <E2><9C><94>
[c0/63882a] NFC<E2><80><A6>virginica-3.0_genomic.fna) | 1 of 1 <E2><9C><94>
[93/224f7b] NFC<E2><80><A6>-3.0_genomic.filtered.gtf) | 1 of 1 <E2><9C><94>
[9a/2ccc66] NFC<E2><80><A6>virginica-3.0_genomic.fna) | 1 of 1 <E2><9C><94>
[d9/a97eb2] NFC<E2><80><A6>virginica-3.0_genomic.fna) | 1 of 1 <E2><9C><94>
[33/76468b] NFC<E2><80><A6>virginica-3.0_genomic.fna) | 1 of 1 <E2><9C><94>
[- ] NFC<E2><80><A6>_SETSTRANDEDNESS:CAT_FASTQ -
[0b/e866d0] NFC<E2><80><A6>P:FASTQC_RAW (SRX18040063) | 61 of 61 <E2><9C><94>
[cc/41ac6d] NFC<E2><80><A6>_FASTP:FASTP (SRX18040033) | 61 of 61 <E2><9C><94>
[26/ac53fa] NFC<E2><80><A6>:FASTQC_TRIM (SRX18040033) | 61 of 61 <E2><9C><94>
[16/899d3d] NFC<E2><80><A6>EX (genome.transcripts.fa) | 1 of 1 <E2><9C><94>
[dc/663c4e] NFC<E2><80><A6>FQ_SUBSAMPLE (SRX18040033) | 61 of 61 <E2><9C><94>
[ce/ab5214] NFC<E2><80><A6>SALMON_QUANT (SRX18040033) | 61 of 61 <E2><9C><94>
[73/e2416c] NFC<E2><80><A6>R:STAR_ALIGN (SRX18040038) | 22 of 62, failed: 1, retries: 1
[43/19aa0b] NFC<E2><80><A6>AMTOOLS_SORT (SRX18040038) | 16 of 21
[f4/a1733e] NFC<E2><80><A6>MTOOLS_INDEX (SRX18040056) | 0 of 16
[- ] NFC<E2><80><A6>TS_SAMTOOLS:SAMTOOLS_STATS -
[- ] NFC<E2><80><A6>SAMTOOLS:SAMTOOLS_FLAGSTAT -
[- ] NFC<E2><80><A6>SAMTOOLS:SAMTOOLS_IDXSTATS -
[a2/5d5395] NFC<E2><80><A6>SALMON_QUANT (SRX18040038) | 16 of 21
[4f/4acd78] NFC<E2><80><A6>RKDUPLICATES (SRX18040056) | 0 of 16
Plus 30 more processes waiting for tasks<E2><80><A6>
How I got there
Notes in chronological order of trial and error.
Datasets
See ABOUT page for table of datasets
Set # | Species | Data Type | stress class | stressor | Phenotype | Phenotpe Summary | Reference | DOI | SRA IDs | SRA links | meta data file | counts table | DEG data |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | C. virginica | T | disease | Perkinsus | infection tolerance | resilience | Proestou et al. 2023 | https://doi.org/10.3389/fgene.2023.1054558 | SRX180s | https://www.ncbi.nlm.nih.gov/sra?LinkName=bioproject_sra_all&from_uid=894694 | SraRunTable.csv | ||
2 | C. virginica | E,T | disease | Perkinsus | infected | sensitivity | Johnson et al. 2020 | https://doi.org/10.3389/fmars.2020.00598 | SRX765s | https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP246310&o=acc_s%3Aa | SraRunTable(1).csv | ||
3 | C. gigas & virginica | T | disease | Perkinsus | infection tolerance | resilience | Chan et al. 2021 | 10.3389/fgene.2021.795706 | SRX564s | https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SAMN11031730&o=acc_s%3Aa | SraRunTable(2).csv | ||
4 | C. virginica | T | disease | Perkinsus | infection | sensitivity | Sullivan and Proestou 2021 | https://doi.org/10.1016/j.aquaculture.2021.736831 | SRX984s | https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP301630&o=acc_s%3Aa | SraRunTable(3).csv |
Shelly downloaded set 1-4 and I downloaded #5 on 1-28-2025 once we added the fifth dataset. I followed her instructions in this post.
I went to SRA Run selector webpage and used PRJNA590205 to find the associated files. I then clicked download accession list which downloads a .txt file. I copied these into an excel and saved as Cvir_prkns_dataset5_SRAids.csv
. I used scp C:/Users/EmmaStrand/MyProjects/Resilience_Biomarkers_Aquaculture/Cvirg_Pmarinus_RNAseq/data/Cvir_prkns_dataset5_SRAids.csv elstrand@klone.hyak.uw.edu:/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5
Custom config for nextflow: (/gscratch/scrubbed/elstrand/uw_hyak_srlab_estrand.config
)
nano uw_hyak_srlab_estrand.config
### paste below information
params {
config_profile_description = 'UW Hyak Roberts labs cluster profile provided by nf-core/configs.'
config_profile_contact = 'Emma Strand emma.strand@gmgi.org'
config_profile_url = 'https://faculty.washington.edu/sr320/'
}
process {
executor = 'slurm'
queue = { task.attempt == 1 ? 'ckpt' : 'cpu-g2-mem2x' }
maxRetries = 1
clusterOptions = { "-A srlab" }
scratch = '/gscratch/scrubbed/srlab/'
resourceLimits = [
cpus: 16,
memory: '150.GB',
time: '72.h'
]
}
executor {
queuesize = 50
submitRateLimit = '1 sec'
}
singularity {
enabled = true
autoMounts = true
cacheDir = '/gscratch/scrubbed/srlab/.apptainer'
}
trace {
enabled = true
file = 'pipeline_trace.txt'
fields = 'task_id,hash,name,status,exit,submit,duration,realtime,%cpu,rss,vmem,rchar,wchar,disk,queue,hostname,cpu_model'
}
debug {
cleanup = false
}
k8s {
projectDir = '/gscratch/scrubbed/elstrand'
}
Running fetchngs after activating conda: conda activate nextflow
fetchngs.sh
#!/bin/bash
#SBATCH --job-name=fetchngs
#SBATCH --account=srlab
#SBATCH --error="%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output="%x_output.%j" #once your job is completed, any final job report comments will be put in this file
#SBATCH --partition=cpu-g2-mem2x
#SBATCH --nodes=1
#SBATCH --time=1-20:00:00
#SBATCH --mem=50G
#SBATCH --ntasks=7
#SBATCH --cpus-per-task=2
#SBATCH --chdir=/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5
## conda environment needs to be activated prior
# run nextflow pipeline
nextflow run nf-core/fetchngs -resume \
-c /gscratch/scrubbed/elstrand/uw_hyak_srlab_estrand.config \
--input /mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5/Cvir_prkns_dataset5_SRAids.csv \
--outdir /mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5 \
--download_method sratools
After this finished ~22 min then I renamed the data files and moved output to a new fetchnegs folder.
Workflow
This workflow is fully documented in a separate post here.
Using the mamba environment I created for CGigas thermal tolerance dataset #1.
Path to fastq files: /gscratch/scrubbed/strigg/analyses/20241108_RNAseq
Path to samplesheet created by fetchNGS: /gscratch/scrubbed/strigg/analyses/20241108_RNAseq/samplesheet/samplesheet.csv
Path to Emma’s analysis: /mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_2024Nov11
Path to C.virginica genome: /mmfs1/gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.fna.gz
Create new samplesheet to fit rnaseq workflow:
scp xxx@klone.hyak.uw.edu:/gscratch/scrubbed/strigg/analyses/20241108_RNAseq/samplesheet/samplesheet.csv C:\Users\EmmaStrand\Downloads
scp xxx@klone.hyak.uw.edu:/gscratch/scrubbed/strigg/analyses/20241205_FetchNGS/samplesheet/samplesheet.csv C:\Users\EmmaStrand\Downloads
scp xxx@klone.hyak.uw.edu:/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5/fetchngs/samplesheet/samplesheet.csv C:\Users\EmmaStrand\Downloads
scp C:\Users\EmmaStrand\MyProjects\Resilience_Biomarkers_Aquaculture\samplesheet_rnaseq_Cvir_disease_set1.csv xxx@klone.hyak.uw.edu:/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_2024Nov11/
scp C:\Users\EmmaStrand\MyProjects\Resilience_Biomarkers_Aquaculture\Cvirg_Pmarinus_RNAseq\data\rnaseq samplesheets\*.csv xxx@klone.hyak.uw.edu:/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets
scp C:\Users\EmmaStrand\MyProjects\Resilience_Biomarkers_Aquaculture\Cvirg_Pmarinus_RNAseq\data\rnaseq samplesheets\*.csv xxx@klone.hyak.uw.edu:/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets
scp C:\Users\EmmaStrand\MyProjects\Resilience_Biomarkers_Aquaculture\Cvirg_Pmarinus_RNAseq\data\rnaseq_samplesheets\samplesheet_rnaseq_dataset5.csv xxx@klone.hyak.uw.edu:/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets
R script on my computer to fix this sample sheet. I realized after that I over-wrote the .csv file but just moving on to uploading.
library(tidyverse)
fetchNGS_samplehseet <-
read.csv("C:/Users/EmmaStrand/MyProjects/Resilience_Biomarkers_Aquaculture/samplesheet_rnaseq_Cvir_disease_set1.csv")
rnaseq_samplesheet <- fetchNGS_samplehseet %>%
dplyr::select(sample, fastq_1, fastq_2) %>%
mutate(strandedness = "auto")
rnaseq_samplesheet %>%
write.csv("C:/Users/EmmaStrand/MyProjects/Resilience_Biomarkers_Aquaculture/samplesheet_rnaseq_Cvir_disease_set1.csv",
row.names = FALSE)
See notes after 12-4-2024 after how I used a common sbatch script to run multiple datasets.
Nextflow RNAsew workflow (01-rnaseq.sh
):
Start conda environment before running sbatch.
#!/bin/bash
#SBATCH --job-name=rnaseq
#SBATCH --account=srlab
#SBATCH --error=output/"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=output/"%x_output.%j" #once your job is completed, any final job report comments will be put in this file
#SBATCH --partition=cpu-g2-mem2x
#SBATCH --nodes=1
#SBATCH --time=1-20:00:00
#SBATCH --mem=150G
#SBATCH --ntasks=7
#SBATCH --cpus-per-task=2
#SBATCH --chdir=/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_2024Nov11/scripts
#conda activate nextflow
samplesheet="/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_2024Nov11/samplesheet_rnaseq_Cvir_disease_set1.csv"
output="/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_2024Nov11/"
genome="/mmfs1/gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.fna.gz"
gff="/mmfs1/gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.gff.gz"
gtf="/mmfs1/gscratch/scrubbed/elstrand/genomes/C_virginica/mod_GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz"
nextflow run nf-core/rnaseq \
-resume \
-profile singularity \
--input ${samplesheet} \
--outdir ${output} \
--gtf ${gtf} \
--gff ${gff} \
--fasta ${genome} \
--trimmer fastp \
--extra_fastp_args '--cut_mean_quality 30 --trim_front1 10 --trim_front2 10' \
--aligner star_salmon \
--skip_pseudo_alignment \
--multiqc_title Cvir_disease_2024Nov11 \
--deseq2_vst
2024-11-11 / 2024-11-13
First attempt to sbatch rnaseq.sh above with all four datasets at once. Should we also think about running these separately? UW was doing maintenance so checked output on 11-13. I hadn’t activated the nextflow conda environment so I activated that (conda activate nextflow
) and used sbatch 01-rnaseq.sh
. This started running immediately.
2024-11-14
Run failed with this error:
ERROR ~ Error executing process > 'NFCORE_RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA (rsem/GCF_002022765.2_C_virginica-3.0_genomic.fna)'
Caused by:
Process `NFCORE_RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA (rsem/GCF_002022765.2_C_virginica-3.0_genomic.fna)` terminated with an error exit status (255)
"rsem-extract-reference-transcripts rsem/genome 0 GCF_002022765.2_C_virginica-3.0_genomic.filtered.gtf None 0 rsem/GCF_002022765.2_C_virginica-3.0_genomic.fna" failed! Plase check if you provide correct parameters/options for the pipeline!
Stop at line : NC_007175.2 RefSeq exon 3430 3495 . + . gene_id ""; transcript_id "unknown_transcript_1"; anticodon "(pos:3461..3463)"; gbkey "tRNA"; product "tRNA-Ile"; exon_number "1";
Error Message: Attribute gene_id's value should be surrounded by double quotes and cannot be empty!
I found Sam’s post on rnaseq workflow. He documented the same issue which is how RSEM handles GTF strand parsing. So this is an issue specific to the gtf file we’re using.
Sam says: “Even though the GFF spec indicates that strand column can be one of +, -, ., or ?, RSEM only parses for + or -. And, as it turns out, our genome GFF has some . for strand info. Looking through our “merged” GenSAS GFF, it turns out there are two sets of annotations that only have . for strand info (GenSAS_5d82b316cd298-trnascan & RNAmmer-1.2). So, the decision needed to be made if we should convert these sets strands to an “artificial” value (e.g. set all of them to +), or eliminate them from the input GFF. I ended up converting GenSAS_5d82b316cd298-trnascan strand to + and eliminated RNAmmer-1.2 from the final input GFF.”
I might have an issue with .
and gene_id “” that are empty? Pause to consult with Shelly. I think I can follow Sam’s notes to convert .
to another character since RSEM only parses +
or -
, but I’m surprised that gene ids are empty here.
2024-11-20
We figured out that my issue is probably only the blank gene ID issue instead of the ‘+’/’.’ issue. Shelly ran the below code and figured out that there are only ‘+’ and ‘-‘ characters in the strand column (column #7).
(base) [strigg@klone-login03 C_virginica]$ zcat GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz | awk -F"\t" '{print $7}
' | sort | uniq -c
5
761431 +
770111 -
To address the blank gene IDs issue, we changed the one gene with several exons to read gene_id=”unknown_transcript_1”
## check that changes worked
check for the changes: zcat GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz | awk -F"\t" '{if($9 ~/gene_id ""/ ) gsub(/gene_id ""/,"gene_id \"unknown_transcript_1\"",$9);print $0}' | grep "unknown_t" | less
## save as new gtf file
zcat GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz | awk -F"\t" '{if($9 ~/gene_id ""/ ) gsub(/gene_id ""/,"gene_id \"unknown_transcript_1\"",$9);print $0}' | gzip > mod_GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz
Changing the path for the gtf file in the script above to this modified version. Started 01-rnaseq.sh
2024-11-25
The first iteration timed out so changed from 1d20h to 4d20h. Started again 11:20 am. Stopped this bc stalled at an early step.
2024-12-2
Trying this again after klone issue worked out. Bumped memory up to 150 GB instead of 100 GB. Started running right away - directory was clean of previous nextflow content beforehand.
2024-12-4
Stopped and separating each dataset.
I made a new structured in /gscratch/scrubbed/elstrand/Cvir_disease_meta
that has samplesheets for each dataset and output directories for each dataset. The scripts folder has the common script and output for each dataset.
(base) [elstrand@klone-login03 Cvir_disease_meta]$ ls
dataset1 dataset2 dataset3 dataset4 samplesheets scripts
(base) [elstrand@klone-login03 samplesheets]$ ls
samplesheet_rnaseq_Cvir_disease_full.csv samplesheet_rnaseq_Cvir_disease_set2.csv samplesheet_rnaseq_Cvir_disease_set4.csv
samplesheet_rnaseq_Cvir_disease_set1.csv samplesheet_rnaseq_Cvir_disease_set3.csv
rnaseq.sh
:
#!/bin/bash
#SBATCH --account=srlab
#SBATCH --error=/gscratch/scrubbed/elstrand/Cvir_disease_meta/scripts/output/"%x_error.%j" #if your job fails, the error report will be put in this file
#SBATCH --output=/gscratch/scrubbed/elstrand/Cvir_disease_meta/scripts/output/"%x_output.%j" #once your job is completed, any final job report comments will be put in this file
#SBATCH --partition=cpu-g2-mem2x
#SBATCH --nodes=1
#SBATCH --time=1-20:00:00
#SBATCH --mem=200G
#SBATCH --ntasks=7
#SBATCH --cpus-per-task=2
#conda activate nextflow
## paths defined in the sbatch command
samplesheet=$1
output=$2
multiqc_title=$3
## paths for all datasets
genome="/gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.fna.gz"
gff="/gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.gff.gz"
gtf="/gscratch/scrubbed/elstrand/genomes/C_virginica/mod_GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz"
nextflow run nf-core/rnaseq -resume \
-c /gscratch/scrubbed/elstrand/uw_hyak_srlab_estrand.config \
-profile singularity \
--input ${samplesheet} \
--outdir ${output} \
--gtf ${gtf} \
--gff ${gff} \
--fasta ${genome} \
--trimmer fastp \
--extra_fastp_args '--cut_mean_quality 30 --trim_front1 10 --trim_front2 10' \
--aligner star_salmon \
--skip_pseudo_alignment \
--multiqc_title ${multiqc_title} \
--deseq2_vst
To run - change the samplesheet (line 1), output directory (line 2), and multiqc report name (line 3):
If cd into directory and run script from there, the work/ directory that generates go into this dataset directory rather than the scripts general directory.
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset1
sbatch -J Cvir_disease_rnaseq_dataset1 ../scripts/rnaseq.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_Cvir_disease_set1.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset1 \
Cvir_disease_rnaseq_dataset1
12-4-2024: I started Cvir_disease_rnaseq_dataset1 ~1:45 pm. This ended up stalling so I ditched sbatch and trying to use our shared config that uses chkpt too.
12-5-2024
Starting an interactive node in a screen for dataset1 with config file from UW so it taps into ckpt resources
## start interactive
screen -S rnaseq
## (ctrl-A d to detach)
## screen -r rnaseq to reattach
## grab a node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=150GB --time=2-12:00:00
## start conda environment
conda activate nextflow
## navigate to output
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset1
## run nextflow on dataset 1
nextflow run nf-core/rnaseq -resume \
-c /gscratch/scrubbed/elstrand/uw_hyak_srlab_estrand.config \
--input /gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_Cvir_disease_set1.csv \
--outdir /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset1 \
--gtf /gscratch/scrubbed/elstrand/genomes/C_virginica/mod_GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz \
--gff /gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.gff.gz \
--fasta /gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.fna.gz \
--trimmer fastp \
--extra_fastp_args '--cut_mean_quality 30 --trim_front1 10 --trim_front2 10' \
--aligner star_salmon \
--skip_pseudo_alignment \
--multiqc_title Cvir_disease_rnaseq_dataset1 \
--deseq2_vst
Error:
ERROR ~ Error executing process > 'NFCORE_RNASEQ:PREPARE_GENOME:GUNZIP_GTF'
Caused by:
java.io.FileNotFoundException: /gscratch/scrubbed/srlab/.apptainer/.depot.galaxyproject.org-singularity-ubuntu-22.04.img.lock (Permission denied)
### also got the below
ERROR ~ Error executing process > 'NFCORE_RNASEQ:PREPARE_GENOME:GUNZIP_FASTA'
I think this is actually because of a singularity issue rather than gtf or genome file issue. I don’t know why I wouldn’t have ran into this while running an sbatch script though? I tried module load singularity/1.1.5
. Shelly’s scripts don’t use a container so maybe I don’t need this? Update: didn’t fix the issue. I get the same error.
Removed -profile singularity \
and re-ran. Update: This didn’t fix the issue.
I copied the config Shelly was using and changed the directories to be my user instead and that worked!
This still stalled but we figured out that we had the wrong files downloaded with fetchNGS.
Correct files now in gscratch/scrubbed/strigg/analyses/20241205_FetchNGS
12-06-2024
With newly downloaded datasets, trying this again with the new config.
## start interactive
screen -S rnaseq
## (ctrl-A d to detach)
## screen -r rnaseq to reattach
## grab a node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=150GB --time=2-12:00:00
## start conda environment
conda activate nextflow
## navigate to output
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset3
## run nextflow on dataset 3
nextflow run nf-core/rnaseq -resume \
-c /gscratch/scrubbed/elstrand/uw_hyak_srlab_estrand.config \
--input /gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_dataset3.csv \
--outdir /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset3 \
--gtf /gscratch/scrubbed/elstrand/genomes/C_virginica/mod_GCF_002022765.2_C_virginica-3.0_genomic.gtf.gz \
--gff /gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.gff.gz \
--fasta /gscratch/scrubbed/elstrand/genomes/C_virginica/GCF_002022765.2_C_virginica-3.0_genomic.fna.gz \
--trimmer fastp \
--extra_fastp_args '--cut_mean_quality 30 --trim_front1 10 --trim_front2 10' \
--aligner star_salmon \
--skip_pseudo_alignment \
--multiqc_title Cvir_disease_rnaseq_dataset3 \
--deseq2_vst
This finished in 1h 52m 56s 12-6-2024!
Transferring data to Gannet:
rsync --archive --verbose --progress multiqc/star_salmon/multiqc_report.html emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3
rsync --archive --verbose --progress pipeline_info/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3
rsync --archive --verbose --progress pipeline_trace.txt emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3
rsync --archive --verbose --progress pipeline_trace.txt emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3
rsync --archive --verbose --progress star_salmon/deseq2_qc/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3
rsync --archive --verbose --progress log/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3
rsync --archive --verbose --progress *.tsv emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3
I realized that we aren’t including the correct flag to fully analyze novel splice variants / de novo transcripts. Need to add the below flags (come back to filling this in)
--extra_star_align_args ''
Explanation of 2-pass to get novel transcripts:
Running dataset 1 but trying with sbatch script and config file. Added -c /gscratch/scrubbed/elstrand/uw_hyak_srlab_estrand.config
to the sbatch script.
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset1
sbatch -J Cvir_disease_rnaseq_dataset1 ../scripts/rnaseq.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_dataset1.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset1 \
Cvir_disease_rnaseq_dataset1
This worked for now! It’s running so I will continue to check in on the output file: /gscratch/scrubbed/elstrand/Cvir_disease_meta/scripts/output/Cvir_disease_rnaseq_dataset1_output.22760577
Running dataset 2 with sbatch script with config file.
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset2
sbatch -J Cvir_disease_rnaseq_dataset2 ../scripts/rnaseq.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_dataset2.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset2 \
Cvir_disease_rnaseq_dataset2
Running dataset 4 with sbatch script with config file.
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset4
sbatch -J Cvir_disease_rnaseq_dataset4 ../scripts/rnaseq.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_dataset4.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset4 \
Cvir_disease_rnaseq_dataset4
12-9-2024
Datasets 2 and 4 finished but dataset1 was killed b/c of a time limit. restarting this script.
Transferring data to Gannet:
rsync --archive --verbose --progress multiqc/star_salmon/multiqc_report.html emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset2/pipeline_info
rsync --archive --verbose --progress pipeline_info/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset2/pipeline_info
rsync --archive --verbose --progress pipeline_trace.txt emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset2/pipeline_info
rsync --archive --verbose --progress star_salmon/deseq2_qc/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset2/deseq2_qc
rsync --archive --verbose --progress star_salmon/log/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset2/log
rsync --archive --verbose --progress star_salmon/*.tsv emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset2
rsync --archive --verbose --progress multiqc/star_salmon/multiqc_report.html emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset4/pipeline_info
rsync --archive --verbose --progress pipeline_info/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset4/pipeline_info
rsync --archive --verbose --progress pipeline_trace.txt emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset4/pipeline_info
rsync --archive --verbose --progress star_salmon/deseq2_qc/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset4/deseq2_qc
rsync --archive --verbose --progress star_salmon/log/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset4/log
rsync --archive --verbose --progress star_salmon/*.tsv emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset4
Rm dataset1 repo and make new one – wasn’t sure if rnaseq pipeline stopped mid sample b/c of time limit so don’t want to risk it.
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset1
sbatch -J Cvir_disease_rnaseq_dataset1 ../scripts/rnaseq.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_dataset1.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset1 \
Cvir_disease_rnaseq_dataset1
Re-started at ~1:30 pm after I changed the time limit to 4d-20 hr.. It was 1d-20hr so are these files just really big and need more than 2 days? That would be surprising..
2024-12-11
This ran in 5 hr! Not sure what happened the first time around.
Moving data to Gannet.
rsync --archive --verbose --progress multiqc/star_salmon/multiqc_report.html emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset1/pipeline_info
rsync --archive --verbose --progress pipeline_info/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset1/pipeline_info
rsync --archive --verbose --progress pipeline_trace.txt emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset1/pipeline_info
rsync --archive --verbose --progress star_salmon/deseq2_qc/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset1/deseq2_qc
rsync --archive --verbose --progress star_salmon/log/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset1/log
rsync --archive --verbose --progress star_salmon/*.tsv emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset1
Now moving onto interpretation of this data via differential abundance workflow and/or custom script in R.
Running Cgigas from dataset 3 separately so the reference is Cgigas. I copied the rnaseq script and changed the genome paths.
scp C:\Users\EmmaStrand\MyProjects\Resilience_Biomarkers_Aquaculture\Cvirg_Pmarinus_RNAseq\data\rnaseq_samplesheets\samplesheet_rnaseq_dataset3_Cgigas.csv xxx@klone.hyak.uw.edu:/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets
genome=”/mmfs1/gscratch/scrubbed/elstrand/genomes/C_gigas/GCF_963853765.1_xbMagGiga1.1_genomic.fna.gz” gff=”/mmfs1/gscratch/scrubbed/elstrand/genomes/C_gigas/GCF_963853765.1_xbMagGiga1.1_genomic.gff.gz” gtf=”/mmfs1/gscratch/scrubbed/elstrand/genomes/C_gigas/GCF_963853765.1_xbMagGiga1.1_genomic.gtf.gz”
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset3_Cgigas
sbatch -J Cvir_disease_rnaseq_dataset3_Cgigas ../scripts/rnaseq_Cgigas.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_dataset3_Cgigas.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset3_Cgigas \
Cvir_disease_rnaseq_dataset3_Cgigas
2024-12-13
This ran in 16 mintues! That is so fast.. double check on multiqc that everything ran..
rsync --archive --verbose --progress multiqc/star_salmon/multiqc_report.html emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3_Cgigas/pipeline_info
rsync --archive --verbose --progress pipeline_info/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3_Cgigas/pipeline_info
rsync --archive --verbose --progress pipeline_trace.txt emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3_Cgigas/pipeline_info
rsync --archive --verbose --progress star_salmon/deseq2_qc/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3_Cgigas/deseq2_qc
rsync --archive --verbose --progress star_salmon/log/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3_Cgigas/log
rsync --archive --verbose --progress star_salmon/*.tsv emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3_Cgigas
rsync --archive --verbose --progress ../scripts/output/Cvir_disease_rnaseq_dataset3_Cgigas_* emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset3_Cgigas
Steve was getting this warning:
WARNING: you specified the options for cutting by quality, but forogt to enable any of cut_front/cut_tail/cut_right. This will have no effect.
Running Cvir_Prkns_rnaseq_dataset3_Cgigas in srun to see if I also get this warning and it doesn’t show up in my output files..
## start interactive
screen -S rnaseq
## (ctrl-A d to detach)
## screen -r rnaseq to reattach
## screen -X -S [session # you want to kill] kill
## grab a node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=150GB --time=2-12:00:00
## start conda environment
conda activate nextflow
## navigate to output
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset3_Cgigas
## run nextflow on dataset 3
nextflow run nf-core/rnaseq -resume \
-c /gscratch/scrubbed/elstrand/uw_hyak_srlab_estrand.config \
--input /gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_dataset3_Cgigas.csv \
--outdir /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset3_Cgigas \
--gtf /gscratch/scrubbed/elstrand/genomes/C_gigas/GCF_963853765.1_xbMagGiga1.1_genomic.gtf.gz \
--gff /gscratch/scrubbed/elstrand/genomes/C_gigas/GCF_963853765.1_xbMagGiga1.1_genomic.gff.gz \
--fasta /gscratch/scrubbed/elstrand/genomes/C_gigas/GCF_963853765.1_xbMagGiga1.1_genomic.fna.gz \
--trimmer fastp \
--extra_fastp_args '--cut_mean_quality 30 --trim_front1 10 --trim_front2 10' \
--aligner star_salmon \
--skip_pseudo_alignment \
--multiqc_title Cvir_disease_rnaseq_dataset3_Cgigas \
--deseq2_vst
I still can’t find this warning.. Maybe it’s a version issue?
Checking rnaseq workflow outputs via multiqc
Notes:
- Re-run dataset3 b/c the trimming is too large compared to the length of the sequences (cutting 10 off 50 bp long)
Re-run dataset3 with new parameters
I first changed parameters in the rnaseq file to be trimming defined by sbatch call. Then I moved the Cgigas data, samplesheet and script to its own folder.
Cgigas dataset3:
Only ‘–cut_mean_quality 30’ without the trim parameters
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cgigas_disease_meta/dataset3_Cgigas
sbatch -J Cgigas_disease_rnaseq_dataset3 ../scripts/rnaseq_Cgigas.sh \
/gscratch/scrubbed/elstrand/Cgigas_disease_meta/samplesheets/samplesheet_rnaseq_dataset3_Cgigas.csv \
/gscratch/scrubbed/elstrand/Cgigas_disease_meta/dataset3_Cgigas \
Cgigas_disease_rnaseq_dataset3
Cvir dataset3:
I then changed the samplesheet on github to reflect Cvir only instead of both and scp C:\Users\EmmaStrand\MyProjects\Resilience_Biomarkers_Aquaculture\Cvirg_Pmarinus_RNAseq\data\rnaseq_samplesheets\samplesheet_rnaseq_dataset3.csv xxx@klone.hyak.uw.edu:/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets
Only ‘–cut_mean_quality 30’ without the trim parameters
conda activate nextflow
cd \gscratch\scrubbed\elstrand\Cvir_disease_meta\dataset3
sbatch -J Cvir_disease_rnaseq_dataset3 ../scripts/rnaseq_Cvir.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/samplesheets/samplesheet_rnaseq_dataset3.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset3 \
Cvir_disease_rnaseq_dataset3
Running dataset 5
1-28-2025 - 1-29-2025
If re-running this, remove the pipeline_trace.txt file. It won’t re-run if there is already an existing file
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5/rnaseq
sbatch -J Cvir_disease_rnaseq_dataset5 ../../scripts/rnaseq_Cvir.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5/rnaseq/samplesheet_rnaseq_dataset5.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5 \
Cvir_disease_rnaseq_dataset5 \
'--cut_mean_quality 30 --trim_front1 10 --trim_front2 10'
I ran into this error. Figured out that I needed ${fastp} in the script rather than $4 in the actual nextflow code. Changed that and re-ran, but still getting this same error even with ${fastp}..
WARN: The following invalid input values have been detected:
* --cut_mean_quality: 30
* --trim_front1: 10
* --trim_front2: 10
ERROR ~ Validation of pipeline parameters failed!
Moved '--cut_mean_quality 30 --trim_front1 10 --trim_front2 10'
to the slurm script and ran again with this embedded in nextflow.
conda activate nextflow
cd /gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5/rnaseq
sbatch -J Cvir_disease_rnaseq_dataset5 ../../scripts/rnaseq_Cvir.sh \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5/samplesheet_rnaseq_dataset5.csv \
/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5 \
Cvir_disease_rnaseq_dataset5
I’m getting this error: ERROR ~ Validation of pipeline parameters failed!
but no mention of which parameters. This is because I need to modify the gtf file so did that again and re-ran the script. I STILL get the parameters issue. Cannot validate pipeline parameters.. Is the config file wrong? I changed the paths of the .apptainer back to srlab to mimic Shelly’s and only changing the directory at the bottom of the file. Re-ran script. I realized rnaseq/
was in the samplesheet path when it shouldn’t have been. Fixed and re-ran. Still got this issue.. Removed the srlab directories from my config file so it’s all directed to my user on scrubbed. Tried again.
Got this error:
* --input (/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5/samplesheet_rnaseq_dataset5.csv): Validation of file failed:
-> Entry 1: Error for field 'fastq_2' (/mmfs1/gscratch/scrubbed/elstrand/Cvir_disease_meta/dataset5/fastq/SRX7172518_SRR10482877_2.fastq.gz): FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'
I had changed the fastq folder to raw_data but fastq is in the path in the samplesheet. So I changed that folder back. Re-ran again and it worked!!
This output was in dataset5 and then afterwards I moved this to rnaseq. This dataset took 3 hours to complete.
Moving to gannet:
rsync --archive --verbose --progress multiqc/star_salmon/multiqc_report.html emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset5/pipeline_info
rsync --archive --verbose --progress pipeline_info/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset5/pipeline_info
rsync --archive --verbose --progress pipeline_trace.txt emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset5/pipeline_info
rsync --archive --verbose --progress star_salmon/deseq2_qc/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset5/deseq2_qc
rsync --archive --verbose --progress star_salmon/log/ emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset5/log
cd \gscratch\scrubbed\elstrand\Cvir_disease_meta\dataset5\rnaseq\star_salmon
rsync --archive --verbose --progress *.tsv emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset5
cd \gscratch\scrubbed\elstrand\Cvir_disease_meta\scripts\output
rsync --archive --verbose --progress * emma.strand@gannet.fish.washington.edu:/volume2/web/emma.strand/rnaseq/Cvir_Prkns_rnaseq_dataset5
Moved genomes, scripts, and config file to /gscratch/srlab/elstrand