Data needed:
NOTE: the fastq files linked below were incorrect and contained errors. SJW reprocessed the reads to correct these errors on in November 2024.
- Fastqs:
- Genome: https://gannet.fish.washington.edu/seashell/bu-github/ceasmallr/data/genome/Cvirginica_v300.fa
- samplesheet:
- Correct samplesheet: Cvirg_methylseq/20250506_methylseq/samplesheet.csv
- INCORRECT samplesheet: Cvirg_methylseq/samplesheet.csv
Methods
Step 1. Change paths in samplesheet.csv
# make new analysis directory and change into it
### analysis with correct reads ###
mkdir /gscratch/scrubbed/strigg/analyses/20250506_methylseq
# create samplesheet with paths of correct reads
cat /gscratch/scrubbed/strigg/analyses/20250502_methylseq/data/00.00-trimming-fastp/fastq_pairs.txt |\
awk -F"_" '{print $1",""/gscratch/scrubbed/strigg/analyses/20250502_methylseq/data/00.00-trimming-fastp/"$0}' |\ # print sample name without suffix, print a comma then the file path, then the file name
awk '{print $1}' | \ # only print the sample name and file path
awk -F"," '{print $1","$2","$2}' | \ # print file path in column 2 and 3
awk 'BEGIN{FS=OFS=","}{gsub(/R1/, "R2", $3)}1'| \ #sub R2 for R1 in column 3
awk 'BEGIN{print "sample,fastq_1,fastq_2"}1 \ #add header
> samplesheet.csv
### analysis with incorrect reads ###
mkdir 20250502_methylseq
cd /gscratch/scrubbed/strigg/analyses/20250502_methylseq
# copy samplesheet to klone
wget https://gannet.fish.washington.edu/metacarcinus/USDA_MetaOmics/Cvirg_methylseq/samplesheet.csv
# change paths of fastqs in samplesheet
sed 's/\/gscratch\/scrubbed\/sr320\/github\/ceasmallr\/data/\/gscratch\/scrubbed\/strigg\/analyses\/20250502_methylseq\/data/g' samplesheet.csv > ss.csv
#rename samplesheet
mv ss.csv samplesheet.csv
Step 2. Copy input data (fastqs and genome) to analysis directory
#get genome file
wget https://gannet.fish.washington.edu/seashell/bu-github/ceasmallr/data/genome/Cvirginica_v300.fa
### correctly trimmed reads ###
#copy reads to klone from gannet
rsync --verbose --progress --archive shellytrigg@gannet.fish.washington.edu:/volume2/web/gitrepos/ceasmallr/output/00.00-trimming-fastp .
### incorrect reads ###
rsync --verbose --progress --archive shellytrigg@gannet.fish.washington.edu:/volume2/web/seashell/bu-github/ceasmallr/data .
Step 3. Run methylseq pipeline
# start screen session; I already had one started so rejoined it
screen -r methylseq
# start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=50GB --time=2-00:00:00
# activate nextflow environment
mamba activate nextflow
# run methylseq pipeline from /gscratch/scrubbed/strigg/analyses/20250506_methylseq
nextflow run nf-core/methylseq \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
--input /gscratch/scrubbed/strigg/analyses/20250506_methylseq/samplesheet.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250506_methylseq \
--fasta /gscratch/scrubbed/strigg/analyses/20250502_methylseq/data/genome/Cvirginica_v300.fa \
-resume \
-with-report nf_report.html \
-with-trace \
-with-timeline nf_timeline.html \
--skip_trimming \
--nomeseq
### pipeline with incorrect reads ###
# ran in /gscratch/scrubbed/strigg/analyses/20250502_methylseq
nextflow run nf-core/methylseq \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
--input /gscratch/scrubbed/strigg/analyses/20250502_methylseq/samplesheet.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250502_methylseq \
--fasta /gscratch/scrubbed/strigg/analyses/20250502_methylseq/data/genome/Cvirginica_v300.fa \
-resume \
-with-report nf_report.html \
-with-trace \
-with-timeline nf_timeline.html \
--skip_trimming \
--nomeseq
this same error occured twice so will try pipeline without preseq
ERROR ~ Error executing process > 'NFCORE_METHYLSEQ:METHYLSEQ:PRESEQ_LCEXTRAP (CF01-CM01-Zygote)'
Caused by:
Process `NFCORE_METHYLSEQ:METHYLSEQ:PRESEQ_LCEXTRAP (CF01-CM01-Zygote)` terminated with an error exit status (1)
Command executed:
preseq \
lc_extrap \
\
-pe \
-output CF01-CM01-Zygote.lc_extrap.txt \
CF01-CM01-Zygote.sorted.bam
cp .command.err CF01-CM01-Zygote.command.log
cat <<-END_VERSIONS > versions.yml
"NFCORE_METHYLSEQ:METHYLSEQ:PRESEQ_LCEXTRAP":
preseq: $(echo $(preseq 2>&1) | sed 's/^.*Version: //; s/Usage:.*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
ERROR: max count before zero is less than min required count (4) duplicates removed
Work dir:
/mmfs1/gscratch/scrubbed/strigg/analyses/20250506_methylseq/work/e5/64279ad453959ee490b0430a9eb77e
Container:
/gscratch/scrubbed/srlab/.apptainer/depot.galaxyproject.org-singularity-preseq-3.2.0--hdcf5f25_6.img
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
-- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting
nextflow run nf-core/methylseq \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
--input /gscratch/scrubbed/strigg/analyses/20250506_methylseq/samplesheet.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250506_methylseq \
--fasta /gscratch/scrubbed/strigg/analyses/20250502_methylseq/data/genome/Cvirginica_v300.fa \
-resume \
-with-report nf_report.html \
-with-trace \
-with-timeline nf_timeline.html \
--skip_trimming \
--nomeseq \
--skip_preseq
that didn’t work so i added the preseq ignore option to my config file as described here: https://github.com/nf-core/methylseq/issues/161
process {
errorStrategy = 'retry'
maxSubmitAwait = '60 min'
maxRetries = 2
executor = 'slurm'
queue = { task.attempt < 4 ? (task.attempt < 3 ? 'ckpt' : 'cpu-g2' ) : 'cpu-g2-mem2x' }
clusterOptions = { task.attempt < 4 ? (task.attempt < 3 ? "-A srlab" : "-A coenv" ) : "-A srlab" }
scratch = '/gscratch/scrubbed/srlab/'
resourceLimits = [
cpus: 16,
memory: '150.GB',
time: '72.h'
]
withName:preseq {
errorStrategy = 'ignore'
}
withName: BISMARK_ALIGN {
ext.args = {
[
Adding that parameter did not stop preseq from running unfortunately. I tried updating nextflow with nextflow self-update
and running the pipeline again
now I got the error terminated with an error exit status (141)
will try running whole pipeline in a new dir tomorrow.
2025-05-08
# start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=50GB --time=2-00:00:00
# activate nextflow environment
mamba activate nextflow
# run methylseq pipeline from /gscratch/scrubbed/strigg/analyses/20250508_methylseq
nextflow run nf-core/methylseq \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
--input /gscratch/scrubbed/strigg/analyses/20250508_methylseq/samplesheet.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250508_methylseq \
--fasta /gscratch/scrubbed/strigg/analyses/20250502_methylseq/data/genome/Cvirginica_v300.fa \
-resume \
-with-report nf_report.html \
-with-trace \
-with-timeline nf_timeline.html \
--skip_trimming \
--nomeseq
got another error about job submission. there is still something weird going on because no ckpt jobs are running and hyakalloc shows thousands of idle cpus.
I attempted to update the methylseq container because I was getting a message that the was a more recent version available.
I ran nextflow pull https://github.com/nf-core/methylseq
.
I no longer get the note about the container but I am still getting errors every time the pipeline tries to submit to ckpt.
#pipeline trace output for one sample
15 90/f379f5 NFCORE_METHYLSEQ:METHYLSEQ:FASTQC (CF08-CM04-Larvae) FAILED 1 - - - - - ckpt-all - -
48 d8/ee9d82 NFCORE_METHYLSEQ:METHYLSEQ:FASTQC (CF08-CM04-Larvae) FAILED 1 - - - - - ckpt-all - -
90/f379f5 .command.run sbatch script header
#!/bin/bash
#SBATCH -J nf-NFCORE_METHYLSEQ_METHYLSEQ_FASTQC_(CF08-CM04-Larvae)
#SBATCH -o /mmfs1/gscratch/scrubbed/strigg/analyses/20250508_methylseq/work/90/f379f5ddf821f9fba8b08d4a8baf91/.command.log
#SBATCH --no-requeue
#SBATCH --signal B:USR2@30
#SBATCH -c 6
#SBATCH -t 08:00:00
#SBATCH --mem 36864M
#SBATCH -p ckpt-all
#SBATCH -A srlab
[d8/ee9d82] .command.run sbatch script header
#!/bin/bash
#SBATCH -J nf-NFCORE_METHYLSEQ_METHYLSEQ_FASTQC_(CF08-CM04-Larvae)
#SBATCH -o /mmfs1/gscratch/scrubbed/strigg/analyses/20250508_methylseq/work/d8/ee9d8285ef39131054efec88d2535e/.command.log
#SBATCH --no-requeue
#SBATCH --signal B:USR2@30
#SBATCH -c 12
#SBATCH -t 16:00:00
#SBATCH --mem 73728M
#SBATCH -p ckpt-all
#SBATCH -A srlab
[d8/ee9d82] NOTE: Error submitting process 'NFCORE_METHYLSEQ:METHYLSEQ:FASTQC (CF08-CM04-Larvae)' for execution -- Execution is retried (
2)
I figured out the problem by going into the work directory with where the error occured and tried running the sbatch script by running sbatch .command.run
. This gave the following error:
sbatch: error: When submitting a Checkpoint job, you cannot specify --no-requeue
sbatch: error: Batch job submission failed: Unspecified error
You can see in the headers of the sbatch scripts they have a --norequeue
parameter. Interestingly this was not a problem on 2025-04-22 when I ran the methylseq pipeline and it completely all jobs submitted to ckpt. There must have been an update to the ckpt partition that is now causing this to happen.
2025-05-09
Modified the .config file to get around the --norequeue
default as suggested by Carson Miller:
A quick “hack” to workaround this for the time being would be to do include the following in your nextflow.config
.
process {
clusterOptions = { "--requeue" }
errorStrategy = { task.exitStatus in ((130..145)) ? 'ignore' : 'retry' }
}
Basically this will include --requeue
in the SLURM submission so that you can still use checkpoint queue. The caveat is that if a job is pre-empted, Nextflow will think that the job failed and will try to create a new job, but the SLURM scheduler will also resubmit the job (even though this won’t be tracked by Nextflow), so what you’ll end up with is duplication of jobs and a potentially very large overhead. So I added the errorStrategy line so that Nextflow will ignore pre-emptions, with the idea that I will just resume the pipeline to clean up tasks that were pre-empted.
Edited .config file:
params {
config_profile_description = 'UW Hyak Roberts labs cluster profile provided by nf-core/configs.'
config_profile_contact = 'Shelly A. Wanamaker @shellywanamaker'
config_profile_url = 'https://faculty.washington.edu/sr320/'
}
process {
errorStrategy = {task.exitStatus in ((130..145)) ? 'ignore' : 'retry'}
maxSubmitAwait = '60 min'
maxRetries = 2
executor = 'slurm'
queue = { task.attempt < 4 ? (task.attempt < 3 ? 'ckpt-all' : 'cpu-g2' ) : 'cpu-g2-mem2x' }
clusterOptions = { task.attempt < 4 ? (task.attempt < 3 ? "-A srlab --requeue" : "-A coenv --requeue" ) : "-A srlab --requeue" }
scratch = '/gscratch/scrubbed/srlab/'
resourceLimits = [
cpus: 16,
memory: '150.GB',
time: '72.h'
]
withName:preseq {
errorStrategy = 'ignore'
}
withName: BISMARK_ALIGN {
ext.args = {
[
'--score_min L,0,-0.8',
'--non_directional'
].join(' ')
}
ext.prefix = { "${meta.id}" }
}
withName: BISMARK_METHYLATIONEXTRACTOR {
ext.args = {
[
'--merge_non_CpG',
'--comprehensive',
'--multicore 48',
'--buffer_size 75%'
].join(' ')
}
ext.prefix = { "${meta.id}" }
}
withName: BISMARK_COVERAGE2CYTOSINE {
ext.args = {
[
'--merge_CpG',
'--zero_based'
].join(' ')
}
ext.prefix = { "${meta.id}" }
}
}
executor {
queuesize = 50
submitRateLimit = '1 sec'
}
singularity {
enabled = true
autoMounts = true
cacheDir = '/gscratch/scrubbed/srlab/.apptainer'
}
trace {
enabled = true
file = 'pipeline_trace.txt'
fields = 'task_id,hash,name,status,exit,submit,duration,realtime,%cpu,rss,vmem,rchar,wchar,disk,queue,hostname,cpu_model'
}
debug {
cleanup = false
}
Re-ran the pipeline in a new directory
# start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=50GB --time=2-00:00:00
# activate nextflow environment
mamba activate nextflow
# run methylseq pipeline from /gscratch/scrubbed/strigg/analyses/20250509_methylseq
nextflow run nf-core/methylseq \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
--input /gscratch/scrubbed/strigg/analyses/20250509_methylseq/samplesheet.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250509_methylseq \
--fasta /gscratch/scrubbed/strigg/analyses/20250502_methylseq/data/genome/Cvirginica_v300.fa \
-resume \
-with-report nf_report.html \
-with-trace \
-with-timeline nf_timeline.html \
--skip_trimming \
--nomeseq
Results
Pipeline initally failed because the fastq files had errors as noted in Sam’s November 2024 notebook entry.
Command error:
INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
Failed to process file EF07-EM01-Zygote_2.gz
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry. Your file is probably truncated
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:187)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:129)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77)
at java.base/java.lang.Thread.run(Thread.java:833)
- See nf_report.html