11-24-2024
The goal here was to replicate the use of fetchngs
that Shelly Wanamaker documented in another notebook entry, with input being the ids.csv
file that she created manually from Roberto’s C. gigas thermotolerance study.
Method
I used Seqera’s fetchngs
pipeline. I noted during the launch steps that it provides a strandedness
value of auto
by default, recalling that a missing strandedness
field had been a detail to overcome in the rnaseq
work that I hope to replicate. I also noted with encouragement that the default sample_mapping_fields
value seemed to match the headings in the samplesheet.csv that Wanamaker’s task produced.
Generated parameters
The following are the parameters generated by the Seqera workflow
{
"custom_config_base": "https://raw.githubusercontent.com/nf-core/configs/master",
"validation-skip-duplicate-check": false,
"sample_mapping_fields": "experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description",
"monochrome-logs": false,
"validation-S3Path-check": false,
"plaintext_email": false,
"nf_core_rnaseq_strandedness": "auto",
"monochrome_logs": false,
"max_cpus": 16,
"custom_config_version": "master",
"max_memory": "128.GB",
"validationFailUnrecognisedParams": false,
"monochromeLogs": false,
"max_time": "240.h",
"validate_params": true,
"validationShowHiddenParams": false,
"validationLenientMode": false,
"version": false,
"outdir": "s3://steveyost-seqera",
"publish_dir_mode": "copy",
"validationS3PathCheck": false,
"input": "https://raw.githubusercontent.com/Resilience-Biomarkers-for-Aquaculture/Cgigas_denovotranscript/refs/heads/main/data/20240925/ids.csv",
"download_method": "ftp",
"help": false,
"validation-fail-unrecognised-params": false,
"validationSkipDuplicateCheck": false,
"validation-show-hidden-params": false,
"validationSchemaIgnoreParams": "",
"validation-lenient-mode": false,
"force_sratools_download": false,
"validation-schema-ignore-params": "",
"skip_fastq_download": false
}
Result
The pipeline run succeeded, producing files in the following output paths in the designated AWS S3 bucket:
samplesheet
:samplesheet.csv
,id_mappings.csv
, andmultiqc_config.yml
.fastq
: 72fastq.gz
files, 2 for each sample, and anmd5
directory with a checksum file for eachfastq.gz
file.metadata
: 36runinfo_ftp.tsv
files, one for each sample.pipeline_info
: several files
Wall time for the run was 7 m 53 s, with 2.0 CPU hours, and $0.056 estimated cost.
I noted that during the pipleine run’s NFCORE_FETCHNGS:SRA:SRA_FASTQ_FTP
task, one wget
operation failed, but succeeded upon the automatic retry.
Missing strandedness
column
I noted that the output samplesheet.csv
didn’t have a strandedness
column despite the assurance above, so reading documentation, I saw that the --nf_core_rnaseq_strandedness
parameter applies only when --nf_core_pipeline
parameter’s value is rnaseq
. I believe we do want to produce a sample sheet that can be used as direct input to rnaseq
, so I ran the pipleine again, this time also with --nf_core_pipeline rnaseq
. (I previously had not set a value in the nf_core_pipeline
pulldown on the Launch Run parameters
page.) This time I used the skip_fastq_download
option since we only need a new sample sheet output this time. This succeeded, overwriting the files in the S3 samplesheet
path, and the samplesheet.csv
file now has a strandedness
column with all values set to auto
. Wall time was 4 m 43 s, with 0.1 CPU hours, and $0.002 estimated cost.
Below are the parameters used given the above changes. Note the "nf_core_pipeline": "rnaseq"
:
{
"custom_config_base": "https://raw.githubusercontent.com/nf-core/configs/master",
"validation-skip-duplicate-check": false,
"sample_mapping_fields": "experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description",
"monochrome-logs": false,
"validation-S3Path-check": false,
"plaintext_email": false,
"nf_core_rnaseq_strandedness": "auto",
"monochrome_logs": false,
"max_cpus": 16,
"custom_config_version": "master",
"max_memory": "128.GB",
"validationFailUnrecognisedParams": false,
"monochromeLogs": false,
"nf_core_pipeline": "rnaseq",
"max_time": "240.h",
"validate_params": true,
"validationShowHiddenParams": false,
"validationLenientMode": false,
"version": false,
"outdir": "s3://steveyost-seqera",
"publish_dir_mode": "copy",
"validationS3PathCheck": false,
"input": "https://raw.githubusercontent.com/Resilience-Biomarkers-for-Aquaculture/Cgigas_denovotranscript/refs/heads/main/data/20240925/ids.csv",
"download_method": "ftp",
"help": false,
"validation-fail-unrecognised-params": false,
"validationSkipDuplicateCheck": false,
"validation-show-hidden-params": false,
"validationSchemaIgnoreParams": "",
"validation-lenient-mode": false,
"force_sratools_download": false,
"validation-schema-ignore-params": "",
"skip_fastq_download": true
}
Error downstream
When trying to run rnaseq
using the output samplesheet.csv
it produced errors like the following:
--input (s3://steveyost-seqera/samplesheet/samplesheet.csv): Validation of file failed:
-> Entry 1: Error for field 'fastq_2' (ftp.sra.ebi.ac.uk/vol1/fastq/SRR885/005/SRR8856685/SRR8856685_2.fastq.gz): FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'
-> Entry 1: Error for field 'fastq_1' (ftp.sra.ebi.ac.uk/vol1/fastq/SRR885/005/SRR8856685/SRR8856685_1.fastq.gz): FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'
I realized than when running fetchngs
with the "skip_fastq_download": true
option, it writes a samplesheet.csv
that refers to the original fastq.gz files. While the error message didn’t seem accurate (indeed the filename does match the required regex ^\S+\.f(ast)?q\.gz$
), I wondered if a lack of protocol (eg ftp://
) in the URI prefix was the problem, so I did another run without the the "skip_fastq_download": true
option, causing a re-fetch of all the fastq files.
No, that’s not it. I think that, surprisingly, quotes are not allowed, and in fact the constraints are listed here, so it also must have only four columns. So I had ChatGPT write a python script to convert samplesheet.csv
to conform.