pubmed.ncbi.nlm.nih.gov

fLPS: Fast discovery of compositional biases for the protein universe - PubMed

️Sun Jan 01 2017

fLPS: Fast discovery of compositional biases for the protein universe

Paul M Harrison. BMC Bioinformatics. 2017.

Abstract

Background: Proteins often contain regions that are compositionally biased (CB), i.e., they are made from a small subset of amino-acid residue types. These CB regions can be functionally important, e.g., the prion-forming and prion-like regions that are rich in asparagine and glutamine residues.

Results: Here I report a new program fLPS that can rapidly annotate CB regions. It discovers both single-residue and multiple-residue biases. It works through a process of probability minimization. First, contigs are constructed for each amino-acid type out of sequence windows with a low degree of bias; second, these contigs are searched exhaustively for low-probability subsequences (LPSs); third, such LPSs are iteratively assessed for merger into possible multiple-residue biases. At each of these stages, efficiency measures are taken to avoid or delay probability calculations unless/until they are necessary. On a current desktop workstation, the fLPS algorithm can annotate the biased regions of the yeast proteome (>5700 sequences) in <1 s, and of the whole current TrEMBL database (>65 million sequences) in as little as ~1 h, which is >2 times faster than the commonly used program SEG, using default parameters. fLPS discovers both shorter CB regions (of the sort that are often termed 'low-complexity sequence'), and milder biases that may only be detectable over long tracts of sequence.

Conclusions: fLPS can readily handle very large protein data sets, such as might come from metagenomics projects. It is useful in searching for proteins with similar CB regions, and for making functional inferences about CB regions for a protein of interest. The fLPS package is available from: http://biology.mcgill.ca/faculty/harrison/flps.html , or https://github.com/pmharrison/flps , or is a supplement to this article.

Keywords: Annotation; Bias; Composition; Intrinsic disorder; Low-complexity; Prion; Protein.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The author declares that he has no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
The algorithm. Three stages of bias annotation are depicted: *QUICK SCAN:* For each amino-acid residue type, from the maximum window size M down to the minimum m, the sequence is scanned for windows that have numbers of amino-acids greater than the expectation for a high binomial P-value threshold (=0.001). These windows are merged into a contig if they overlap each other. *MINIMIZE:* For each contig, the lowest-probability subsequences (LPSs) are computed by searching down from the contig length to the minimum m. *MERGE:* LPSs for different residue types are then sorted together in increasing order of binomial P-value and iteratively assessed for merger into multiple-residue LPSs. LPSs are merged if the merged LPS would have a lower P-value. This assessment entails checking whether the multiple-residue LPSs can be trimmed or extended, as depicted. Mergers of LPSs are assessed until no more can be performed. *OUTPUT:* Both single- and multiple-residue LPSs are output in increasing order of binomial P-value

**Fig. 2**
Output example. An example of the fLPS output in (a) short and (b) long formats, with a graphic of the LPSs in (c) (this is not part of the actual output of the program). The output is for protein CRPAK_HUMAN, human cysteine-rich PAK1 inhibitor. a The short format is: sequence name; type of bias (SINGLE-residue, MULTIPLE-residue or WHOLE-sequence); ordinal number of the LPS for the sequence (they are sorted in increasing order of binomial P-value); start residue in sequence; end residue in sequence; total number of bias residues in the LPS; binomial P-value for the LPS; CB signature (the single-letter amino-acid code of the residues is listed in order of precedence within curly brackets). b Two examples of the extra fields in long output, corresponding to the short output in (a). The long format has the additional fields: sum of log(P) (the sum of the log P-values of each of the constituent biases in the LPS, prior to merging); start residue of a core subsequence with the highest density of bias residues; end residue of the core subsequence; the core subsequence; up to 10 residues of N-terminal sequence context for the LPS; the LPS subsequence; up to 10 residues of C-terminal sequence context. Each LPS is listed on one line, except that in long format there is an optional summary footer that can be output using the ‘–d’ option. This begins with the ‘<’ symbol and contains these fields: sequence name; sequence length; number of SINGLE-residue LPSs; number of MULTIPLE-residue LPSs; number of WHOLE-sequence biases. For the long format in (b), for brevity most of the duplicated fields are omitted from the short format shown in (a). c A graphic of the LPSs. Bias type information, etc. as in (a)

**Fig. 3**
Timing: All three of the protein data sets analyzed here were downloaded from the UniProt website in August 2016 [11]. The programs were all run on an Intel Xeon CPU e5–1650 v2 @3.5GHz in an Apple Mac Pro that has 32GB of RAM and has installed in it MacOSX version 10.12.6. The program fLPS was tested for different maximum window size (M) values, and all other parameters set to defaults (dark grey bars). The CPU time in seconds is the sum of the user time and system time. For tests on the yeast (*Saccharomyces cerevisiae*) proteome (5782 sequences) and SwissProt (551,705 sequences) the time depicted is the mean from ten runs. For TrEMBL (65,378,749 sequences), the time for just one run is shown. Timings for SEG (light grey bars, default parameter values, and two other recommended parameter sets except for the TrEMBL run) are provided for comparison [4]

**Fig. 4**
Amount of annotated bias. a Here, fLPS was run with the parameters listed, with all other parameters set to defaults. The total number of residues in multiple-residue LPSs (dark grey bars) and in single-residue regions (white bars) are shown. The total number of residues annotated as low-complexity by SEG (light grey, with default parameters, and up to two other recommended parameter sets) is provided for approximate comparison. b The fractions of the proteins from the databases that are annotated with CB. Programs are run as for part (A) and for Fig. 3

**Fig. 5**
Behaviour of the algorithm with different M values. Here, as an example, I use the annotation of multiple-residue LPSs in the protein RNQ1_YEAST (Rnq1p, which underlies the [RNQ+]/[PIN+] prion in *S. cerevisiae* [19]). fLPS has been run with different maximum window size (M) values, and with other parameters set to defaults. With a sufficiently large M (≥80), one large LPS is annotated with signature {QNSG}. For smaller M values, this LPS is broken into smaller LPSs, as depicted by the boxes at the bottom of the figure. The endpoints of LPSs are numbered at the ends of a box. At the top of the figure, the LPS (for M ≥ 80) is highlighted in orange within the RNQ1_YEAST sequence. The endpoints of LPSs for different M values are labelled above the orange text, with the first numeral of the residue position aligned to the position in the sequence

**Fig. 6**
Most common short biased tracts in TrEMBL. The fifty most common CB regions of ≤100 residues in length and binomial P-value ≤1e-10, from the run of fLPS on TrEMBL with M = 25 and all other parameters at default values. The sequence names, binomial P-values and signatures of four random examples are shown, along with the LPSs delimited by ‘|’ with up to ten residues of sequence context at either end. TrEMBL was downloaded from the UniProt website in August 2016 [11]

Cited by

A unified view of low complexity regions (LCRs) across species.
Lee B, Jaberi-Lashkari N, Calo E. Lee B, et al. Elife. 2022 Sep 13;11:e77058. doi: 10.7554/eLife.77058. Elife. 2022. PMID: 36098382 Free PMC article.
fLPS 2.0: rapid annotation of compositionally-biased regions in biological sequences.
Harrison PM. Harrison PM. PeerJ. 2021 Oct 28;9:e12363. doi: 10.7717/peerj.12363. eCollection 2021. PeerJ. 2021. PMID: 34760378 Free PMC article.
Integrative analysis and prediction of human R-loop binding proteins.
Kumar A, Fournier LA, Stirling PC. Kumar A, et al. G3 (Bethesda). 2022 Jul 29;12(8):jkac142. doi: 10.1093/g3journal/jkac142. G3 (Bethesda). 2022. PMID: 35666183 Free PMC article.
Feature architecture aware phylogenetic profiling indicates a functional diversification of type IVa pili in the nosocomial pathogen Acinetobacter baumannii.
Iruegas R, Pfefferle K, Göttig S, Averhoff B, Ebersberger I. Iruegas R, et al. PLoS Genet. 2023 Jul 27;19(7):e1010646. doi: 10.1371/journal.pgen.1010646. eCollection 2023 Jul. PLoS Genet. 2023. PMID: 37498819 Free PMC article.
A Census of Human Methionine-Rich Prion-like Domain-Containing Proteins.
Aledo JC. Aledo JC. Antioxidants (Basel). 2022 Jun 29;11(7):1289. doi: 10.3390/antiox11071289. Antioxidants (Basel). 2022. PMID: 35883780 Free PMC article.

References

1. An L, Fitzpatrick D, Harrison PM. Emergence and evolution of yeast prion and prion-like proteins. BMC Evol Biol. 2016;16:24. doi: 10.1186/s12862-016-0594-3. - DOI - PMC - PubMed
1. An L, Harrison PM. The evolutionary scope and neurological disease linkage of yeast-prion-like proteins in humans. Biol Direct. 2016;11:32. doi: 10.1186/s13062-016-0134-5. - DOI - PMC - PubMed
1. Harbi D, Kumar M, Harrison PM. LPS-annotate: complete annotation of compositionally biased regions in the protein knowledgebase. Database (Oxford) 2011;2011:baq031. doi: 10.1093/database/baq031. - DOI - PMC - PubMed
1. Wootton JC, Federhen S. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 1996;266:554–571. doi: 10.1016/S0076-6879(96)66035-2. - DOI - PubMed
1. Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, Sander C, Ouzounis CA. CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics. 2000;16(10):915–922. doi: 10.1093/bioinformatics/16.10.915. - DOI - PubMed

MeSH terms

Substances

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- Saccharomyces Genome Database

fLPS: Fast discovery of compositional biases for the protein universe - PubMed