11-24-2024

The goal here was to replicate the use of fetchngs that Shelly Wanamaker documented in another notebook entry, with input being the ids.csv file that she created manually from Roberto’s C. gigas thermotolerance study.

Method

I used Seqera’s fetchngs pipeline. I noted during the launch steps that it provides a strandedness value of auto by default, recalling that a missing strandedness field had been a detail to overcome in the rnaseq work that I hope to replicate. I also noted with encouragement that the default sample_mapping_fields value seemed to match the headings in the samplesheet.csv that Wanamaker’s task produced.

Generated parameters

The following are the parameters generated by the Seqera workflow

{
    "custom_config_base": "https://raw.githubusercontent.com/nf-core/configs/master",
    "validation-skip-duplicate-check": false,
    "sample_mapping_fields": "experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description",
    "monochrome-logs": false,
    "validation-S3Path-check": false,
    "plaintext_email": false,
    "nf_core_rnaseq_strandedness": "auto",
    "monochrome_logs": false,
    "max_cpus": 16,
    "custom_config_version": "master",
    "max_memory": "128.GB",
    "validationFailUnrecognisedParams": false,
    "monochromeLogs": false,
    "max_time": "240.h",
    "validate_params": true,
    "validationShowHiddenParams": false,
    "validationLenientMode": false,
    "version": false,
    "outdir": "s3://steveyost-seqera",
    "publish_dir_mode": "copy",
    "validationS3PathCheck": false,
    "input": "https://raw.githubusercontent.com/Resilience-Biomarkers-for-Aquaculture/Cgigas_denovotranscript/refs/heads/main/data/20240925/ids.csv",
    "download_method": "ftp",
    "help": false,
    "validation-fail-unrecognised-params": false,
    "validationSkipDuplicateCheck": false,
    "validation-show-hidden-params": false,
    "validationSchemaIgnoreParams": "",
    "validation-lenient-mode": false,
    "force_sratools_download": false,
    "validation-schema-ignore-params": "",
    "skip_fastq_download": false
}

Result

The pipeline run succeeded, producing files in the following output paths in the designated AWS S3 bucket:

  • samplesheet: samplesheet.csv, id_mappings.csv, and multiqc_config.yml.
  • fastq: 72 fastq.gz files, 2 for each sample, and an md5 directory with a checksum file for each fastq.gz file.
  • metadata: 36 runinfo_ftp.tsv files, one for each sample.
  • pipeline_info: several files

Wall time for the run was 7 m 53 s, with 2.0 CPU hours, and $0.056 estimated cost.

I noted that during the pipleine run’s NFCORE_FETCHNGS:SRA:SRA_FASTQ_FTP task, one wget operation failed, but succeeded upon the automatic retry.

Missing strandedness column

I noted that the output samplesheet.csv didn’t have a strandedness column despite the assurance above, so reading documentation, I saw that the --nf_core_rnaseq_strandedness parameter applies only when --nf_core_pipeline parameter’s value is rnaseq. I believe we do want to produce a sample sheet that can be used as direct input to rnaseq, so I ran the pipleine again, this time also with --nf_core_pipeline rnaseq. (I previously had not set a value in the nf_core_pipeline pulldown on the Launch Run parameters page.) This time I used the skip_fastq_download option since we only need a new sample sheet output this time. This succeeded, overwriting the files in the S3 samplesheet path, and the samplesheet.csv file now has a strandedness column with all values set to auto. Wall time was 4 m 43 s, with 0.1 CPU hours, and $0.002 estimated cost.

Below are the parameters used given the above changes. Note the "nf_core_pipeline": "rnaseq":

{
    "custom_config_base": "https://raw.githubusercontent.com/nf-core/configs/master",
    "validation-skip-duplicate-check": false,
    "sample_mapping_fields": "experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description",
    "monochrome-logs": false,
    "validation-S3Path-check": false,
    "plaintext_email": false,
    "nf_core_rnaseq_strandedness": "auto",
    "monochrome_logs": false,
    "max_cpus": 16,
    "custom_config_version": "master",
    "max_memory": "128.GB",
    "validationFailUnrecognisedParams": false,
    "monochromeLogs": false,
    "nf_core_pipeline": "rnaseq",
    "max_time": "240.h",
    "validate_params": true,
    "validationShowHiddenParams": false,
    "validationLenientMode": false,
    "version": false,
    "outdir": "s3://steveyost-seqera",
    "publish_dir_mode": "copy",
    "validationS3PathCheck": false,
    "input": "https://raw.githubusercontent.com/Resilience-Biomarkers-for-Aquaculture/Cgigas_denovotranscript/refs/heads/main/data/20240925/ids.csv",
    "download_method": "ftp",
    "help": false,
    "validation-fail-unrecognised-params": false,
    "validationSkipDuplicateCheck": false,
    "validation-show-hidden-params": false,
    "validationSchemaIgnoreParams": "",
    "validation-lenient-mode": false,
    "force_sratools_download": false,
    "validation-schema-ignore-params": "",
    "skip_fastq_download": true
}

Error downstream

When trying to run rnaseq using the output samplesheet.csv it produced errors like the following:

--input (s3://steveyost-seqera/samplesheet/samplesheet.csv): Validation of file failed:
	-> Entry 1: Error for field 'fastq_2' (ftp.sra.ebi.ac.uk/vol1/fastq/SRR885/005/SRR8856685/SRR8856685_2.fastq.gz): FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'
	-> Entry 1: Error for field 'fastq_1' (ftp.sra.ebi.ac.uk/vol1/fastq/SRR885/005/SRR8856685/SRR8856685_1.fastq.gz): FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'

I realized than when running fetchngs with the "skip_fastq_download": true option, it writes a samplesheet.csv that refers to the original fastq.gz files. While the error message didn’t seem accurate (indeed the filename does match the required regex ^\S+\.f(ast)?q\.gz$), I wondered if a lack of protocol (eg ftp://) in the URI prefix was the problem, so I did another run without the the "skip_fastq_download": true option, causing a re-fetch of all the fastq files.

No, that’s not it. I think that, surprisingly, quotes are not allowed, and in fact the constraints are listed here, so it also must have only four columns. So I had ChatGPT write a python script to convert samplesheet.csv to conform.