The Cryptic Bacterial Microproteome - PubMed
- ️Mon Jan 01 2024
The Cryptic Bacterial Microproteome
Igor Fesenko et al. bioRxiv. 2024.
Abstract
Microproteins encoded by small open reading frames (smORFs) comprise the "dark matter" of proteomes. Although functional microproteins were identified in diverse organisms from all three domains of life, bacterial smORFs remain poorly characterized. In this comprehensive study of intergenic smORFs (ismORFs, 15-70 codons) in 5,668 bacterial genomes of the family Enterobacteriaceae, we identified 67,297 clusters of ismORFs subject to purifying selection. The ismORFs mainly code for hydrophobic, potentially transmembrane, unstructured, or minimally structured microproteins. Using AlphaFold Multimer, we predicted interactions of some of the predicted microproteins encoded by transcribed ismORFs with proteins encoded by neighboring genes, revealing the potential of microproteins to regulate the activity of various proteins, particularly, under stress. We compiled a catalog of predicted microprotein families with different levels of evidence from synteny analysis, structure prediction, and transcription and translation data. This study offers a resource for investigation of biological functions of microproteins.
Keywords: bacteria; bioinformatics; de novo; evolutionary analysis of smORFs; intergenic; microproteins; protein structure prediction; small open reading frames (smORFs).
Conflict of interest statement
Declaration of interests The authors declare no competing interests.
Figures
![Figure 1.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/960c/11188072/49a19e2cd8a7/nihpp-2024.02.17.580829v1-f0001.gif)
(A) Pipeline for identification and analysis of intergenic small Open Reading Frames (ismORFs). More than 17 million intergenic sequences from 23 bacterial genera in the family Enterobacteriaceae were used to predict intergenic smORFs. Predicted microproteins were clustered by the CD-HIT tool in 947,440 clusters with a median length of 15 to 70 aa, and at least three sequences were selected for further analysis. Signatures of evolutionary selections were used to define the set of potentially coding clusters (Entero67K); (B) The lengths of the selected intergenic sequences and genes from RefSeq annotation across all genera; (C) The GC-content of annotated genes and intergenic sequences across all genera; (D) The lengths of annotated small proteins and putative microporteins encoded by intergenic smORFs across all genera (Mann-Whitney U test, P < 10−10 for all comparisons).
![Figure 2.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/960c/11188072/357383911c1b/nihpp-2024.02.17.580829v1-f0002.gif)
(A) Sankey plot showing the proportion of clustered ismORFs and those with that were predicted as coding. The microprotein families have evidence of transcription (“Transcribed ismORFs”, 15 species), located in syntenic intergenic regions (“Syntenic ismORFs"), and have predicted structure (“Structured ismORFs”). (B) RNAcode and MegaX tools were used for the prediction of the coding potential of intergenic smORFs. Based on this analysis, the “Entero67K” set of potentially coding microproteins (15–70 aa) was established. (C) The calculated evolutionary rates for ismORFs classified as “coding” (“EvolScore”), all clustered ismORFs, shuffled microprotein clusters, and annotated small proteins (SmallProt); EvolScore thresholds are shown by dashed lines; (D) The length of annotated small proteins in the “SmallProt” dataset and ismORFs from “Entero67K” families; (E) A boxplot showing the number of ismORFs from Entero67K families in enterobacterial genomes; (F) The plot shows the number of Syn-IGS blocks (axis Y) in relation to a minimal proportion of genomes (axis X) containing an intact ismORF (Entero67K) in the corresponding Syn-IGS; The plot was built for intergenic sequences, containing ismORFs from one Entero67K cluster; (G) The phylogenomic tree of the genus Cronobacter. The species with intact ismORFs from the syntenic cluster “Cronobacter_15372” are shown in red; (H) Non-redundant nucleotide and translated sequences of the corresponding Syn-IGS (for the cluster “Cronobacter_15372”) are shown.
![Figure 3.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/960c/11188072/f008e381eff0/nihpp-2024.02.17.580829v1-f0003.gif)
(A) Proportion of “ATG”, "GTG", and “TTG” start codons in translatable small proteins and Entero67K ismORFs; (B) The comparison of TIS read coverage for the known small protein AppX and microprotein NC_000913.3_ORF.34481. The microprotein has signals from both TIS and RPF data and is located downstream of the protein-coding gene MgtA. In Salmonella spp., two small proteins, MgtR and MgtU, interacting with MgtA, are located in this position; (C) The evolutionary rates of potentially translated ismORFs without predicted coding potential (“intergenic”), ismORFs from the Entero67K set, predicted by different methods (RNAcode and EvolScore) and small proteins (SmallProt). The evolutionary rates significantly differ: dN - Kruskal-Wallis rank sum test, P < 0.001; dS - Kruskal-Wallis rank sum test, P < 10−15; Kd - Kruskal-Wallis rank sum test, P < 10−5; dN/dS - Kruskal-Wallis rank sum test, P < 10−15.
![Figure 4.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/960c/11188072/6c5de3050088/nihpp-2024.02.17.580829v1-f0004.gif)
(A) The diagram showing conservation of the “MicroProt 6K” families in comparison to small protein families (SmallProt); (B) Chord diagram showing top20 interconnections between different genera based on similarity between small protein clusters (SmallProt) revealed by HHsuite; (C) Chord diagram showing top20 interconnections between different genera based on similarity between microprotein clusters (Entero67K) revealed by HHsuite; (D) Multiple sequence alignments between the Uniref profile representative from Vibrio sp. (UniRef100_A0A7X5BB38) and microprotein clusters from Klebsiella, and Pectobacterium; (E) Small proteins (SmallProt) annotation word cloud. Protein domain names were obtained from the “Pfam” database; (F) The scatterplot shows the mean proportion of predicted (TMHMM 2.0) transmembrane domains in Entero67K (“real”) and “shuffled” microproteins (predicted on shuffled intergenic sequences) for each genus and the GC content of intergenic sequences across Enterobacteriaceae genera. Entero67K microproteins are shown in red. Genus Buchnera and Candidatus spp. have the lowest GC content of IGs and the highest percent of “shuffled” microproteins with predicted TM domains; (G) The orientation and location of ismORFs from the “Entero 67K” set in relation to nearby genes; collinear (I, III), convergent (II) and divergent (IV) orientations.
![Figure 5.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/960c/11188072/e96f6114f36b/nihpp-2024.02.17.580829v1-f0005.gif)
(A) AlphaFold2 was used to predict the structures of microproteins. Structures with pLDDT > 80 were selected for further analysis. The predicted structures were searched against PDB database; (B) Examples of the most frequent secondary elements and structural motifs identified in a set of microproteins; (C) An example of a microprotein structure (NC_013353.1_ORF.61612) that resembles the SH3 domain with a beta-barrel-like fold. This structure has hits from the PDB database (for example, 2cud_A); (D, E, F, G, H) Examples of structures that cannot be identified in the PDB database from clusters NC_018522.1_ORF.60593, NZ_CP043332.1_ORF.30507, NZ_CP011254.1_ORF.115569, NC_015663.1_ORF.39606, and CP056474.1_ORF.21448, respectively.
![Figure 6.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/960c/11188072/2529b9f24bd1/nihpp-2024.02.17.580829v1-f0006.gif)
(A, B, C, D) Examples of microprotein oligomerization predicted by AFM. Structures with pLDDT > 80 were selected for this analysis. An oligomeric structure is illustrated under each monomer. Protein cartoon structures are colored by pLDDT except in the last row, where homo-oligomers are colored by chain. (E) Structure of NZ_CP011254.1_ORF.115569 dimer superimposed with AF-A0A2D5VJZ3; (F) Structure of NC_017641.1_ORF.45846 superimposed with DUF3572; (G) Dimeric structure of NZ_LR134492.1_ORF.72629 with a beta-barrel-like structure aligned with human SAGA-associated factor 29 using the flexible FATCAT method; The E, F, G chains of the query dimers are colored green and purple, and the target protein is colored gray.
![Figure 7.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/960c/11188072/ff16a793c6e4/nihpp-2024.02.17.580829v1-f0007.gif)
(A) Heatmaps showing transcription regulation patterns of DE-smORF; (B) Scatterplot showing pLDDT and ipTM+pTM for experimentally confirmed annotated small protein-protein pair; the vertical line denotes the ipTM+pTM threshold eqtol0.75; 0.75 (C) Complex of the small protein Bfd and bacterioferritin Bfr. The interaction of these proteins was confirmed experimentally; the microprotein is shown in cyan; (D) Complex of the small protein YbdD and the inner membrane protein CstA; the microprotein is shown in cyan; (E) Scatterplot showing pLDDT and ipTM+pTM for microprotein-protein pairs in E. coli UPEC536; the vertical and horizontal lines denote thresholds; (F) Complex of the microprotein NC_008253.1_ORF.30477 and the protein KduI; The microprotein is shown in magenta; The microprotein is shown in magenta; (G) Complex of the microprotein NZ_CP009792.1_ORF.22445 and shikimate kinase AroK; the microprotein is shown in magenta. All protein complexes presented above are predicted with AFM.
Similar articles
-
A vast pool of lineage-specific microproteins encoded by long non-coding RNAs in plants.
Fesenko I, Shabalina SA, Mamaeva A, Knyazev A, Glushkevich A, Lyapina I, Ziganshin R, Kovalchuk S, Kharlampieva D, Lazarev V, Taliansky M, Koonin EV. Fesenko I, et al. Nucleic Acids Res. 2021 Oct 11;49(18):10328-10346. doi: 10.1093/nar/gkab816. Nucleic Acids Res. 2021. PMID: 34570232 Free PMC article.
-
smORFunction: a tool for predicting functions of small open reading frames and microproteins.
Ji X, Cui C, Cui Q. Ji X, et al. BMC Bioinformatics. 2020 Oct 14;21(1):455. doi: 10.1186/s12859-020-03805-x. BMC Bioinformatics. 2020. PMID: 33054771 Free PMC article.
-
He C, Jia C, Zhang Y, Xu P. He C, et al. J Proteome Res. 2018 Jul 6;17(7):2335-2344. doi: 10.1021/acs.jproteome.8b00032. Epub 2018 Jun 25. J Proteome Res. 2018. PMID: 29897761
-
Microproteins in skeletal muscle: hidden keys in muscle physiology.
Bonilauri B, Dallagiovanna B. Bonilauri B, et al. J Cachexia Sarcopenia Muscle. 2022 Feb;13(1):100-113. doi: 10.1002/jcsm.12866. Epub 2021 Nov 30. J Cachexia Sarcopenia Muscle. 2022. PMID: 34850602 Free PMC article. Review.
-
Microproteins: Overlooked regulators of physiology and disease.
Hassel KR, Brito-Estrada O, Makarewich CA. Hassel KR, et al. iScience. 2023 Apr 29;26(6):106781. doi: 10.1016/j.isci.2023.106781. eCollection 2023 Jun 16. iScience. 2023. PMID: 37213226 Free PMC article. Review.
References
-
- Jordan B., Weidenbach K., and Schmitz R.A. (2023). The power of the small: the underestimated role of small proteins in bacterial and archaeal physiology. Curr. Opin. Microbiol. 76, 102384. - PubMed
-
- Couso J.-P., and Patraquim P. (2017). Classification and function of small open reading frames. Nat. Rev. Mol. Cell Biol. 18, 575–589. - PubMed
Publication types
LinkOut - more resources
Full Text Sources