pubmed.ncbi.nlm.nih.gov

The Cryptic Bacterial Microproteome - PubMed

  • ️Mon Jan 01 2024

The Cryptic Bacterial Microproteome

Igor Fesenko et al. bioRxiv. 2024.

Abstract

Microproteins encoded by small open reading frames (smORFs) comprise the "dark matter" of proteomes. Although functional microproteins were identified in diverse organisms from all three domains of life, bacterial smORFs remain poorly characterized. In this comprehensive study of intergenic smORFs (ismORFs, 15-70 codons) in 5,668 bacterial genomes of the family Enterobacteriaceae, we identified 67,297 clusters of ismORFs subject to purifying selection. The ismORFs mainly code for hydrophobic, potentially transmembrane, unstructured, or minimally structured microproteins. Using AlphaFold Multimer, we predicted interactions of some of the predicted microproteins encoded by transcribed ismORFs with proteins encoded by neighboring genes, revealing the potential of microproteins to regulate the activity of various proteins, particularly, under stress. We compiled a catalog of predicted microprotein families with different levels of evidence from synteny analysis, structure prediction, and transcription and translation data. This study offers a resource for investigation of biological functions of microproteins.

Keywords: bacteria; bioinformatics; de novo; evolutionary analysis of smORFs; intergenic; microproteins; protein structure prediction; small open reading frames (smORFs).

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Intergenic sequences and and intergenic small ORFs in enterobacteria

(A) Pipeline for identification and analysis of intergenic small Open Reading Frames (ismORFs). More than 17 million intergenic sequences from 23 bacterial genera in the family Enterobacteriaceae were used to predict intergenic smORFs. Predicted microproteins were clustered by the CD-HIT tool in 947,440 clusters with a median length of 15 to 70 aa, and at least three sequences were selected for further analysis. Signatures of evolutionary selections were used to define the set of potentially coding clusters (Entero67K); (B) The lengths of the selected intergenic sequences and genes from RefSeq annotation across all genera; (C) The GC-content of annotated genes and intergenic sequences across all genera; (D) The lengths of annotated small proteins and putative microporteins encoded by intergenic smORFs across all genera (Mann-Whitney U test, P < 10−10 for all comparisons).

Figure 2.
Figure 2.. Families of ismORF-encoded predicted microproteins

(A) Sankey plot showing the proportion of clustered ismORFs and those with that were predicted as coding. The microprotein families have evidence of transcription (“Transcribed ismORFs”, 15 species), located in syntenic intergenic regions (“Syntenic ismORFs"), and have predicted structure (“Structured ismORFs”). (B) RNAcode and MegaX tools were used for the prediction of the coding potential of intergenic smORFs. Based on this analysis, the “Entero67K” set of potentially coding microproteins (15–70 aa) was established. (C) The calculated evolutionary rates for ismORFs classified as “coding” (“EvolScore”), all clustered ismORFs, shuffled microprotein clusters, and annotated small proteins (SmallProt); EvolScore thresholds are shown by dashed lines; (D) The length of annotated small proteins in the “SmallProt” dataset and ismORFs from “Entero67K” families; (E) A boxplot showing the number of ismORFs from Entero67K families in enterobacterial genomes; (F) The plot shows the number of Syn-IGS blocks (axis Y) in relation to a minimal proportion of genomes (axis X) containing an intact ismORF (Entero67K) in the corresponding Syn-IGS; The plot was built for intergenic sequences, containing ismORFs from one Entero67K cluster; (G) The phylogenomic tree of the genus Cronobacter. The species with intact ismORFs from the syntenic cluster “Cronobacter_15372” are shown in red; (H) Non-redundant nucleotide and translated sequences of the corresponding Syn-IGS (for the cluster “Cronobacter_15372”) are shown.

Figure 3.
Figure 3.. Analysis of potentially translated ismORFs

(A) Proportion of “ATG”, "GTG", and “TTG” start codons in translatable small proteins and Entero67K ismORFs; (B) The comparison of TIS read coverage for the known small protein AppX and microprotein NC_000913.3_ORF.34481. The microprotein has signals from both TIS and RPF data and is located downstream of the protein-coding gene MgtA. In Salmonella spp., two small proteins, MgtR and MgtU, interacting with MgtA, are located in this position; (C) The evolutionary rates of potentially translated ismORFs without predicted coding potential (“intergenic”), ismORFs from the Entero67K set, predicted by different methods (RNAcode and EvolScore) and small proteins (SmallProt). The evolutionary rates significantly differ: dN - Kruskal-Wallis rank sum test, P < 0.001; dS - Kruskal-Wallis rank sum test, P < 10−15; Kd - Kruskal-Wallis rank sum test, P < 10−5; dN/dS - Kruskal-Wallis rank sum test, P < 10−15.

Figure 4.
Figure 4.. Analysis of the Entero67K families of predicted microproteins

(A) The diagram showing conservation of the “MicroProt 6K” families in comparison to small protein families (SmallProt); (B) Chord diagram showing top20 interconnections between different genera based on similarity between small protein clusters (SmallProt) revealed by HHsuite; (C) Chord diagram showing top20 interconnections between different genera based on similarity between microprotein clusters (Entero67K) revealed by HHsuite; (D) Multiple sequence alignments between the Uniref profile representative from Vibrio sp. (UniRef100_A0A7X5BB38) and microprotein clusters from Klebsiella, and Pectobacterium; (E) Small proteins (SmallProt) annotation word cloud. Protein domain names were obtained from the “Pfam” database; (F) The scatterplot shows the mean proportion of predicted (TMHMM 2.0) transmembrane domains in Entero67K (“real”) and “shuffled” microproteins (predicted on shuffled intergenic sequences) for each genus and the GC content of intergenic sequences across Enterobacteriaceae genera. Entero67K microproteins are shown in red. Genus Buchnera and Candidatus spp. have the lowest GC content of IGs and the highest percent of “shuffled” microproteins with predicted TM domains; (G) The orientation and location of ismORFs from the “Entero 67K” set in relation to nearby genes; collinear (I, III), convergent (II) and divergent (IV) orientations.

Figure 5.
Figure 5.. Prediction of microprotein structures using AlphaFold2

(A) AlphaFold2 was used to predict the structures of microproteins. Structures with pLDDT > 80 were selected for further analysis. The predicted structures were searched against PDB database; (B) Examples of the most frequent secondary elements and structural motifs identified in a set of microproteins; (C) An example of a microprotein structure (NC_013353.1_ORF.61612) that resembles the SH3 domain with a beta-barrel-like fold. This structure has hits from the PDB database (for example, 2cud_A); (D, E, F, G, H) Examples of structures that cannot be identified in the PDB database from clusters NC_018522.1_ORF.60593, NZ_CP043332.1_ORF.30507, NZ_CP011254.1_ORF.115569, NC_015663.1_ORF.39606, and CP056474.1_ORF.21448, respectively.

Figure 6.
Figure 6.. Prediction of microprotein homo-oligomers using AlphaFold2 Multimer

(A, B, C, D) Examples of microprotein oligomerization predicted by AFM. Structures with pLDDT > 80 were selected for this analysis. An oligomeric structure is illustrated under each monomer. Protein cartoon structures are colored by pLDDT except in the last row, where homo-oligomers are colored by chain. (E) Structure of NZ_CP011254.1_ORF.115569 dimer superimposed with AF-A0A2D5VJZ3; (F) Structure of NC_017641.1_ORF.45846 superimposed with DUF3572; (G) Dimeric structure of NZ_LR134492.1_ORF.72629 with a beta-barrel-like structure aligned with human SAGA-associated factor 29 using the flexible FATCAT method; The E, F, G chains of the query dimers are colored green and purple, and the target protein is colored gray.

Figure 7.
Figure 7.. Analysis of protein-protein interactions with AlphaFold Multimer

(A) Heatmaps showing transcription regulation patterns of DE-smORF; (B) Scatterplot showing pLDDT and ipTM+pTM for experimentally confirmed annotated small protein-protein pair; the vertical line denotes the ipTM+pTM threshold eqtol0.75; 0.75 (C) Complex of the small protein Bfd and bacterioferritin Bfr. The interaction of these proteins was confirmed experimentally; the microprotein is shown in cyan; (D) Complex of the small protein YbdD and the inner membrane protein CstA; the microprotein is shown in cyan; (E) Scatterplot showing pLDDT and ipTM+pTM for microprotein-protein pairs in E. coli UPEC536; the vertical and horizontal lines denote thresholds; (F) Complex of the microprotein NC_008253.1_ORF.30477 and the protein KduI; The microprotein is shown in magenta; The microprotein is shown in magenta; (G) Complex of the microprotein NZ_CP009792.1_ORF.22445 and shikimate kinase AroK; the microprotein is shown in magenta. All protein complexes presented above are predicted with AFM.

Similar articles

References

    1. Kute P.M., Soukarieh O., Tjeldnes H., Trégouët D.-A., and Valen E. (2021). Small Open Reading Frames, How to Find Them and Determine Their Function. Front. Genet. 12, 796060. - PMC - PubMed
    1. Jordan B., Weidenbach K., and Schmitz R.A. (2023). The power of the small: the underestimated role of small proteins in bacterial and archaeal physiology. Curr. Opin. Microbiol. 76, 102384. - PubMed
    1. Couso J.-P., and Patraquim P. (2017). Classification and function of small open reading frames. Nat. Rev. Mol. Cell Biol. 18, 575–589. - PubMed
    1. Wacholder A., Parikh S.B., Coelho N.C., Acar O., Houghton C., Chou L., and Carvunis A.-R. (2023). A vast evolutionarily transient translatome contributes to phenotype and fitness. Cell Syst 14, 363–381.e8. - PMC - PubMed
    1. Fesenko I., Shabalina S.A., Mamaeva A., Knyazev A., Glushkevich A., Lyapina I., Ziganshin R., Kovalchuk S., Kharlampieva D., Lazarev V., et al. (2021). A vast pool of lineage-specific microproteins encoded by long non-coding RNAs in plants. Nucleic Acids Res. 49, 10328–10346. - PMC - PubMed

Publication types