RNAseq workflow with de novo transcriptome assembly

This is a description of the nf-core/denovotranscript pipeline for de novo transcriptome assembly of paired-end short reads from bulk RNA-seq.

nf-core/transfuse metro map

Read QC of raw reads (FastQC)
Adapter and quality trimming (fastp)
Read QC of trimmed reads (FastQC)
Remove rRNA or mitochondrial DNA (optional) (SortMeRNA)
Transcriptome assembly using any combination of the following:
- Trinity with normalised reads (default=True)
  - normalization helps with the very non-uniform coverage of RNAseq datasets (e.g. highly vs lowly expressed genes, high-coverage in some regions may imply more sequencing errors complicating assembly down the road)
  - normalization will leave poorly covered regions unchanged, but will down-sample reads in high-coverage regions
- Trinity with non-normalised reads
- rnaSPAdes medium filtered transcripts outputted (default=True)
- rnaSPAdes soft filtered transcripts outputted
- rnaSPAdes hard filtered transcripts outputted
  - differences between trinity and rnaSPAdes include processing power/time with rnaSPAdes requiring less, and rnaSPAdes can sometimes recover more transcripts (especially for low coverage genes) than trinity however this is dataset dependent Bushmanova et al. 2019
  - some do’s and don’ts for de novo transcriptome assembly
Redundancy reduction with Evidential Gene tr2aacds. A transcript to gene mapping is produced from Evidential Gene’s outputs using gawk.
Assembly completeness QC (BUSCO)
Other assembly quality metrics (rnaQUAST)
Transcriptome quality assessment with TransRate, including the use of reads for assembly evaluation. This step is not performed if profile is set to conda or mamba.
Pseudo-alignment and quantification (Salmon)
HTML report for raw reads, trimmed reads, BUSCO, and Salmon (MultiQC)

1. Set-up samplesheet

prepare a samplesheet with your input data (each row represents a pair of fastq files (paired end)) that looks as follows:

samplesheet.csv:

sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz,AEG588A4_S4_L003_R2_001.fastq.gz
TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz,AEG588A5_S5_L003_R2_001.fastq.gz
TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,AEG588A6_S6_L003_R2_001.fastq.gz
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,AEG588A6_S6_L004_R2_001.fastq.gz

2. Set-up config file with parameter choices

run with rnaSPAdes using all 3 filters
no quantification because will run through nf-core RNAseq pipeline

RNAseq workflow with de novo transcriptome assembly

1. Set-up samplesheet

2. Set-up config file with parameter choices

3. Create slurm script

4. Assess transcriptomes and decide best to use as reference in RNAseq pipeline with reference

Plan is to start with Day 30 RNAseq datasets from Roberto’s 2023 thermotolerance study: https://doi.org/10.1016/j.cbd.2023.101089