UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches - PubMed
- ️Thu Jan 01 2015
UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches
Baris E Suzek et al. Bioinformatics. 2015.
Abstract
Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters.
Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
© The Author 2014. Published by Oxford University Press.
Figures

The categories of UniRef clusters based on intra-cluster functional consistency. Upper left panel shows an example of GO term hierarchy used. Other panels illustrate the UniRef clusters in categories based on their intra-cluster consistency; I (All members have identical GO terms), II-1 (all members share common GO terms and some have additional less or equally specific GO terms, not children of the shared GO terms), II-2 (all members share common GO terms and some have additional more specific GO terms), III (only some members share common GO terms but all member’s GO terms can be traced to a common non-root parent GO term, is a child of one of the shared GO terms) and IV (members do not have any common GO term and the existing ones cannot be traced to a common non-root parent GO term)

Example UniProtKB/Swiss-Prot (query) and UniProtKB (target) pairs for distant similarity detection analysis, where Pfam domains common to query and targets span more than 80% of target protein sequences

Growth of UniRef databases and UniProt Knowledgebase

The size distribution of UniRef clusters follows a power law distribution

Distribution of UniRef90 clusters specificity for complete set of clusters (top bars) and those containing only model organisms (bottom bars). UniRef50 clusters follow similar distribution

Precision and recall (Equations (2) and (3)) of UniRef50-based BLASTP searches expanded using cluster memberships at different e-value thresholds

The percentage difference in distant similarities detected by UniRef50- versus UniProtKB-based searches based on the dataset constructed using Pfam domains

ROC50 values for UniRef50- versus UniProtKB-based searches based on the dataset constructed using Pfam domains
Similar articles
-
UniRef: comprehensive and non-redundant UniProt reference clusters.
Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. Suzek BE, et al. Bioinformatics. 2007 May 15;23(10):1282-8. doi: 10.1093/bioinformatics/btm098. Epub 2007 Mar 22. Bioinformatics. 2007. PMID: 17379688
-
Uniclust databases of clustered and deeply annotated protein sequences and alignments.
Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M. Mirdita M, et al. Nucleic Acids Res. 2017 Jan 4;45(D1):D170-D176. doi: 10.1093/nar/gkw1081. Epub 2016 Nov 28. Nucleic Acids Res. 2017. PMID: 27899574 Free PMC article.
-
Leuthaeuser JB, Knutson ST, Kumar K, Babbitt PC, Fetrow JS. Leuthaeuser JB, et al. Protein Sci. 2015 Sep;24(9):1423-39. doi: 10.1002/pro.2724. Epub 2015 Aug 18. Protein Sci. 2015. PMID: 26073648 Free PMC article.
-
The Universal Protein Resource (UniProt): an expanding universe of protein information.
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B. Wu CH, et al. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D187-91. doi: 10.1093/nar/gkj161. Nucleic Acids Res. 2006. PMID: 16381842 Free PMC article.
-
Protein function prediction: towards integration of similarity metrics.
Erdin S, Lisewski AM, Lichtarge O. Erdin S, et al. Curr Opin Struct Biol. 2011 Apr;21(2):180-8. doi: 10.1016/j.sbi.2011.02.001. Epub 2011 Feb 24. Curr Opin Struct Biol. 2011. PMID: 21353529 Free PMC article. Review.
Cited by
-
Hao Y, Liu X, Fu H, Shao X, Cai W. Hao Y, et al. Bioinformatics. 2024 Aug 2;40(8):btae497. doi: 10.1093/bioinformatics/btae497. Bioinformatics. 2024. PMID: 39120878 Free PMC article.
-
Computational clustering for viral reference proteomes.
Chen C, Huang H, Mazumder R, Natale DA, McGarvey PB, Zhang J, Polson SW, Wang Y, Wu CH; UniProt Consortium. Chen C, et al. Bioinformatics. 2016 Jul 1;32(13):2041-3. doi: 10.1093/bioinformatics/btw110. Epub 2016 Feb 26. Bioinformatics. 2016. PMID: 27153712 Free PMC article.
-
Hoepner CM, Stewart ZK, Qiao R, Fobert EK, Prentis PJ, Colella A, Chataway T, Burke da Silva K, Abbott CA. Hoepner CM, et al. Toxins (Basel). 2024 Feb 5;16(2):85. doi: 10.3390/toxins16020085. Toxins (Basel). 2024. PMID: 38393163 Free PMC article.
-
Liu J, Wei Y, Lin L, Teng L, Yin J, Lu Q, Chen J, Zheng Y, Li Y, Xu R, Zhai W, Liu Y, Liu Y, Cao P, Ang EL, Zhao H, Yuchi Z, Zhang Y. Liu J, et al. Proc Natl Acad Sci U S A. 2020 Jul 7;117(27):15599-15608. doi: 10.1073/pnas.2003434117. Epub 2020 Jun 22. Proc Natl Acad Sci U S A. 2020. PMID: 32571930 Free PMC article.
-
Chen Q, Britto R, Erill I, Jeffery CJ, Liberzon A, Magrane M, Onami JI, Robinson-Rechavi M, Sponarova J, Zobel J, Verspoor K. Chen Q, et al. Genomics Proteomics Bioinformatics. 2020 Apr;18(2):91-103. doi: 10.1016/j.gpb.2018.11.006. Epub 2020 Jul 9. Genomics Proteomics Bioinformatics. 2020. PMID: 32652120 Free PMC article. No abstract available.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases