This is a description of the nf-core/denovotranscript pipeline for de novo transcriptome assembly of paired-end short reads from bulk RNA-seq.
- Read QC of raw reads (
FastQC
) - Adapter and quality trimming (
fastp
) - Read QC of trimmed reads (
FastQC
) - Remove rRNA or mitochondrial DNA (optional) (
SortMeRNA
) -
Transcriptome assembly using any combination of the following:
Trinity
with normalised reads (default=True)- normalization helps with the very non-uniform coverage of RNAseq datasets (e.g. highly vs lowly expressed genes, high-coverage in some regions may imply more sequencing errors complicating assembly down the road)
- normalization will leave poorly covered regions unchanged, but will down-sample reads in high-coverage regions
Trinity
with non-normalised readsrnaSPAdes
medium filtered transcripts outputted (default=True)rnaSPAdes
soft filtered transcripts outputtedrnaSPAdes
hard filtered transcripts outputted- differences between trinity and rnaSPAdes include processing power/time with rnaSPAdes requiring less, and rnaSPAdes can sometimes recover more transcripts (especially for low coverage genes) than trinity however this is dataset dependent Bushmanova et al. 2019
- some do’s and don’ts for de novo transcriptome assembly
- Redundancy reduction with
Evidential Gene tr2aacds
. A transcript to gene mapping is produced from Evidential Gene’s outputs usinggawk
. - Assembly completeness QC (
BUSCO
) - Other assembly quality metrics (
rnaQUAST
) - Transcriptome quality assessment with
TransRate
, including the use of reads for assembly evaluation. This step is not performed if profile is set toconda
ormamba
. - Pseudo-alignment and quantification (
Salmon
) - HTML report for raw reads, trimmed reads, BUSCO, and Salmon (
MultiQC
)
1. Set-up samplesheet
prepare a samplesheet with your input data (each row represents a pair of fastq files (paired end)) that looks as follows:
samplesheet.csv
:
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz,AEG588A4_S4_L003_R2_001.fastq.gz
TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz,AEG588A5_S5_L003_R2_001.fastq.gz
TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,AEG588A6_S6_L003_R2_001.fastq.gz
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,AEG588A6_S6_L004_R2_001.fastq.gz
2. Set-up config file with parameter choices
- run with rnaSPAdes using all 3 filters
- no quantification because will run through nf-core RNAseq pipeline