Motivation

Based on successful nf-core/differentialabundance runs, we wished to augment them with gene ontology information in order to examine relationships with biological functions of the differentially expressed genes. nf-core/differentialabundance supports this, and requires one or more gene matrix files. The present task was to create a GMT file for C. Virginica, because none existed, and re-run nf-core/differentialabundance.

Results

Here is the differential abundance summary report including GSEA results in the Gene set analysis section. The link will download the HTML file for local display. The GMT file used, and its precursor files, are in our Git repository here, all prefixed with 4706222.C_virginica, based on the GOA file name from which they were derived.

Method

Because apparently no GMT file existed for C. Virginica, I used ChatGPT to give me the following recommendations:

Download C. virginica GO GAF from UniProt or NCBI
Parse GAF to map genes → GO terms
Write out a standard tab-delimited GMT I found a .goa file for C. Virginica. and added the file to our Git data directory.

I used the following scripts (created mostly by ChatGPT) in the following order:

goa_to_gmt.py to create a base GMT file from the downloaded GOA file. Noting that all evidence codes were IEA, meaning Inferred from Electronic Annotation, I had it filter only ND (No biological data available) codes. ChatGPT noted:
You can proceed with IEA-only gene sets, with the understanding that:
- The annotations are computational predictions, not experimentally validated.
- Enrichment results will reflect putative function, useful for hypothesis generation but not definitive.
This is a common reality when working with non-model species like Crassostrea virginica.
add_go_names_to_gmt.py to add gene ontology terms. Output file was 4706222.C_virginica_named.gmt
filter_gmt_by_geneset_size.py to filter out gene sets < 10 (not enough signal) or > 500 (too broad), producing 4706222.C_virginica_named_filtered.gmt
filter_gmt_by_go_depth.py to produce 4706222.C_virginica_named_filtered_depth3.gmt.
After running GSEA within an nf-core/differentialabundance pipeline, the GSEA section only lists the GO terms, like GO:0031032, not the function, so I used enrich_gmt_with_go_namespaces.py to have the GO IDs include encoded gene function, producing 4706222.C_virginica_named_filtered_depth3_enrichednames.gmt. I used this GMT file in the differentialabundance pipeline to produce the results linked in the Results section above.