pubmed.ncbi.nlm.nih.gov

MarFERReT, an open-source, version-controlled reference library of marine microbial eukaryote functional genes - PubMed

  • ️Sun Jan 01 2023

MarFERReT, an open-source, version-controlled reference library of marine microbial eukaryote functional genes

R D Groussman et al. Sci Data. 2023.

Abstract

Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of about half of eukaryotic environmental transcripts. Here, we introduce Marine Functional EukaRyotic Reference Taxa (MarFERReT), a marine microbial eukaryotic sequence library designed for use with taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 publicly accessible marine eukaryote genomes and transcriptomes and assessed their sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in MarFERReT. Version 1.1 of MarFERReT contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. The MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.

© 2023. The Author(s).

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1

Diagrammatic overview of MarFERReT validation and build processes. Boxes represent the data sets involved in building MarFERReT and the border style indicates the data type: external sequence inputs (dashed line), external taxonomic and functional annotation resources (dotted lines), internal data products (single solid line) and output MarFERReT data products (double lines). Arrows indicate processes. (a) Candidate entry and NCBI taxID validation: (1) Candidate entries were identified from primary data sources and downloaded as nucleotide and protein reference sequences; (2) six-frame translation and frame-selection of nucleotide sequences into protein sequences; (3) functional annotation of protein sequences with Pfam protein families using HMMER 3.3; (4) curation of NCBI Taxonomy IDs (taxIDs) for MarFERReT candidate entries and additional incorporation of matched IDs and classification from the PR Taxonomy ecosystem,; (5) candidate entries are assessed with evidence from external studies and by taxonomic analysis of ribosomal protein sequences for potential cross-contamination. Validated entries accepted for the quality-controlled build are recorded in the entry metadata. (b) Quality-controlled MarFERReT build with validated entries. For the set of 800 validated entries, the same methods used in 1a were used for (1) aggregating nucleotide and protein data and (2) translating nucleotide to protein sequences; (3) intra-taxa clustering at the strain or species level: protein sequence data sharing the same NCBI taxIDs are pooled together and clustered at 99% identity using updated taxIDs contained in the metadata; (4) Final Pfam annotation of the clustered protein sequences; (5) identification of core transcribed genes from functional annotations of transcriptome-derived entries.

Fig. 2
Fig. 2

Cladogram of 800 validated MarFERReT entries and summary of reference metadata. (a) Cladogram of hierarchical taxonomic ranks of marine eukaryotes within the NCBI Taxonomy framework using the

NCBI CommonTree

tool. Each tip is a unique taxon included in MarFERReT, defined by its NCBI taxID identifier. Branches are colored by taxonomic lineage with size of the closed circle at each tip proportional to the number of validated entries in each taxon. Concentric rings describe metadata and statistics for each taxon. From innermost ring outward: year of publication or data release for sequence data (average year of release for multiple entries), number of clustered sequences in taxon, raw input format of sequence data: transcriptome, transcriptome shotgun assembly; genome, genome-derived gene models; SAG, single-cell amplified genome; SAT, single-cell amplified transcriptome; or a combination of types (mixed), and source of data: NCBI, METdb, JGI PhycoCosm, or MMETSP. (b) Number of clustered sequences in MarFERReT build by year of data release, and (c) Histogram showing distribution of clustered sequence count for 453 taxa in the final build.

Fig. 3
Fig. 3

Validation of candidate entry sequences for cross-contamination. (a) Circular tree from hierarchical clustering of a binary distance matrix, constructed from the presence/absence of approximately 12,000 Pfam protein families in 874 candidate entries; entries with low sequence or pfam flags were excluded. Grey points at the tip indicate one of 102 entries excluded from the final build, with the other points marking the flag(s) for excluded entries. Contam > 50% (RP63), cross-contamination estimates over 50% from this study; Contam > 50% (VanVlierberghe), cross-contamination estimates over 50% from van Vlierberghe et al.; Contam (Lasek), reported contamination for ciliate entries from Lasek-Nesselquist and Johnson. (b) Histogram of Pfam domains in annotated candidate entry sequences; red dotted line indicates the 500 Pfam minimum threshold for inclusion. (c) Histogram of estimated cross-contamination in entries from ribosomal protein analysis; red dotted line indicates 50% cutoff threshold for exclusion.

Fig. 4
Fig. 4

Schematic of case study 1 and 2 use of MarFERReT for annotation of environmental metatranscriptomes. Example workflow showing how MarFERReT data products can be used to annotate unknown assembled sequences and assess taxonomic bins. Boxes indicate datasets; box borders indicate environmental contig sequence data (dashed line), taxonomic and functional annotation from external resources (dotted lines), MarFERReT data products (double lines) and taxonomic and functional annotation results (bold lines). The red diamond indicates a user-constructed DIAMOND database for lowest common ancestor determination; prior to this step the user could combine MarFERReT with other libraries for expanded taxonomic coverage as shown in case study 1. Arrows represent processes: (1) construction of DIAMOND database using MarFERReT proteins and data, and taxonomy files from NCBI Taxonomy; matching PR Taxonomy, identifiers and classifications are also provided in the metadata for alternative classification approaches. (2) Taxonomic annotation of environmental contigs using a DIAMOND database built from MarFERReT proteins; (3) Functional annotation of environmental contigs with HMMER 3.3 on Pfam protein family hmm profiles; (4) Completeness assessment of taxonomically- and functionally-annotated metatranscriptome bins using MarFERReT core transcribed genes.

Similar articles

Cited by

References

    1. Caron DA, et al. Probing the evolution, ecology and physiology of marine protists using transcriptomics. Nat. Rev. Microbiol. 2017;15:6–20. doi: 10.1038/nrmicro.2016.160. - DOI - PubMed
    1. Carradec Q, et al. A global ocean atlas of eukaryotic genes. Nat. Commun. 2018;9:373. doi: 10.1038/s41467-017-02342-1. - DOI - PMC - PubMed
    1. Keeling PJ, et al. The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): illuminating the functional diversity of eukaryotic life in the oceans through transcriptome sequencing. PLoS Biol. 2014;12:e1001889. doi: 10.1371/journal.pbio.1001889. - DOI - PMC - PubMed
    1. A.E. Allen Lab. PhyloDB, version 1.075. https://github.com/allenlab/PhyloDB (2015).
    1. Becker JW, Hogle SL, Rosendo K, Chisholm SW. Co-culture and biogeography of Prochlorococcus and SAR11. The ISME journal. 2019;13:1506–1519. doi: 10.1038/s41396-019-0365-4. - DOI - PMC - PubMed

Publication types

MeSH terms