pubmed.ncbi.nlm.nih.gov

The Pfam protein families database - PubMed

. 2012 Jan;40(Database issue):D290-301.

doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.

Penny C Coggill, Ruth Y Eberhardt, Jaina Mistry, John Tate, Chris Boursnell, Ningze Pang, Kristoffer Forslund, Goran Ceric, Jody Clements, Andreas Heger, Liisa Holm, Erik L L Sonnhammer, Sean R Eddy, Alex Bateman, Robert D Finn

Affiliations

PMID: 22127870
PMCID: PMC3245129
DOI: 10.1093/nar/gkr1065

The Pfam protein families database

Marco Punta et al. Nucleic Acids Res. 2012 Jan.

Abstract

Pfam is a widely used database of protein families, currently containing more than 13,000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/). Here, we report on changes that have occurred since our 2010 NAR paper (release 24.0). Over the last 2 years, we have generated 1840 new families and increased coverage of the UniProt Knowledgebase (UniProtKB) to nearly 80%. Notably, we have taken the step of opening up the annotation of our families to the Wikipedia community, by linking Pfam families to relevant Wikipedia pages and encouraging the Pfam and Wikipedia communities to improve and expand those pages. We continue to improve the Pfam website and add new visualizations, such as the 'sunburst' representation of taxonomic distribution of families. In this work we additionally address two topics that will be of particular interest to the Pfam community. First, we explain the definition and use of family-specific, manually curated gathering thresholds. Second, we discuss some of the features of domains of unknown function (also known as DUFs), which constitute a rapidly growing class of families within Pfam.

PubMed Disclaimer

Figures

**Figure 1.**
New Pfam features since release 24.0. (A) The Pfam-A family page for Avidin (PF01382), showing the embedded contents of the associated Wikipedia article. The ‘infobox’ is highlighted. (B) The ‘sunburst’ representation of the tree showing the species distribution of the Pfam-A family Peptidase_M10 (PF00413). (C) The PfamAlyzer applet, showing the results of searching for all architectures that include the domains IMPDH and CBS. The PfamAlyzer applet allows querying of Pfam for proteins with particular domains, domain combinations or architectures.

**Figure 2.**
Pfam users in the world. A world map showing the usage of Pfam website at the Wellcome Trust Sanger Institute, UK. Usage statistics were obtained from our Google Urchin tracking database and plotted using the Google map API. Circle size is proportional to number of visits from each country for those with >5000 visits. Countries contributing <5000 visits are all shown with the same sized marker. Data refer to the period between 1 and 30 June 2011.

**Figure 3.**
Heat map showing sequence gathering threshold (GA) changes between Pfam releases 24.0 and 26.0. Yellow squares represent high density; red squares represent low density. Squares on the diagonal correspond to GAs that are unchanged; squares in the region above the diagonal are GAs that have increased; and squares below the diagonal are GAs that have decreased. For the sake of clarity, we chose to show a zoomed-in version of the complete plot, which also includes a number of points outside of the range seen here. The plot was created using R (21).

**Figure 4.**
Distribution of sequence gathering (GA) thresholds and of corresponding E-values. (A) Distribution of sequence GAs for all Pfam-A families. Note that intervals are such that, for example, ‘25–26’ translates into 25 ≤ sequence GA(bits) < 26. (B) Same as the histogram in panel (A), with log10(E-values) in place of GAs. E-values are calculated from GAs according to the following formula: E = N × exp[−λ·(x − τ)], where x is the bit score GA, λ and τ are parameters derived from the HMM model (λ is the slope parameter, τ is the location parameter) and N is the database size (in this case the size of UniProtKB) (22). (C) Box-plot of all Pfam families’ GAs (left side; median = 22.1, 25th percentile = 20.8, 75th percentile = 25.0), and for all families excluding those where both sequence and domain thresholds equal 25.0 or 27.0 (right side; median = 21.2, 25th percentile = 20.6, 75th percentile = 22.8). (D) Same as (C) with log10(E-values) in place of GAs. E-values calculated as in panel (B). Left side: median = 0.096, 25th percentile = 0.012, 75th percentile = 0.24. Right side: median = 0.18, 25th percentile = 0.057, 75th percentile = 0.27. Note that values reported here for median and percentiles are for E-values and not log10(E-values).

**Figure 5.**
DUF families’ statistics. (A) Comparison between number of DUFs added (blue) and number of DUFs renamed or otherwise removed (red) since Pfam 22.0 (data shown for releases 23.0–26.0, as indicated by labels on the graph). (B) Number of PIR representative clusters of genomes (23) in DUF families. We used Representative Proteomes version 2.0, comprising a total of 671 clusters for a 35% membership cut-off. (C) Co-occurrence between DUFs and other families. The term ‘architecture’ refers to a combination of families occurring within the same protein sequence. Note that we only considered architectures with at least five member sequences. (D) DUF families and protein structure. ‘Families that have structure’ means that a PDB structure is available for a member of the family; ‘families in a clan that has structure’ means that a PDB structure is available for a member of the same clan.

Cited by

Genome-Wide Identification of the Maize Chitinase Gene Family and Analysis of Its Response to Biotic and Abiotic Stresses.
Wang T, Wang C, Liu Y, Zou K, Guan M, Wu Y, Yue S, Hu Y, Yu H, Zhang K, Wu D, Du J. Wang T, et al. Genes (Basel). 2024 Oct 15;15(10):1327. doi: 10.3390/genes15101327. Genes (Basel). 2024. PMID: 39457451 Free PMC article.
Evidence for suppression of immunity as a driver for genomic introgressions and host range expansion in races of Albugo candida, a generalist parasite.
McMullan M, Gardiner A, Bailey K, Kemen E, Ward BJ, Cevik V, Robert-Seilaniantz A, Schultz-Larsen T, Balmuth A, Holub E, van Oosterhout C, Jones JD. McMullan M, et al. Elife. 2015 Feb 27;4:e04550. doi: 10.7554/eLife.04550. Elife. 2015. PMID: 25723966 Free PMC article.
Insight into neutral and disease-associated human genetic variants through interpretable predictors.
van den Berg BA, Reinders MJ, de Ridder D, de Beer TA. van den Berg BA, et al. PLoS One. 2015 Mar 31;10(3):e0120729. doi: 10.1371/journal.pone.0120729. eCollection 2015. PLoS One. 2015. PMID: 25826299 Free PMC article.
Evolutionary Remodeling of the Cell Envelope in Bacteria of the Planctomycetes Phylum.
Mahajan M, Seeger C, Yee B, Andersson SGE. Mahajan M, et al. Genome Biol Evol. 2020 Sep 1;12(9):1528-1548. doi: 10.1093/gbe/evaa159. Genome Biol Evol. 2020. PMID: 32761170 Free PMC article.
Structures of apo- and ssDNA-bound YdbC from Lactococcus lactis uncover the function of protein domain family DUF2128 and expand the single-stranded DNA-binding domain proteome.
Rossi P, Barbieri CM, Aramini JM, Bini E, Lee HW, Janjua H, Xiao R, Acton TB, Montelione GT. Rossi P, et al. Nucleic Acids Res. 2013 Feb 1;41(4):2756-68. doi: 10.1093/nar/gks1348. Epub 2013 Jan 8. Nucleic Acids Res. 2013. PMID: 23303792 Free PMC article.

References

1. Heger A, Holm L. Exhaustive enumeration of protein domain families. J. Mol. Biol. 2003;328:749–767. - PubMed
1. The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010;38:D142–D148. - PMC - PubMed
1. Coggill P, Finn RD, Bateman A. Identifying protein domains with the Pfam database. Curr. Protoc. Bioinformatics. 2008;Chapter 2 Unit 2 5. - PubMed
1. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. - PMC - PubMed
1. Daub J, Gardner PP, Tate J, Ramskold D, Manske M, Scott WG, Weinberg Z, Griffiths-Jones S, Bateman A. The RNA WikiProject: community annotation of RNA families. RNA. 2008;14:2462–2464. - PMC - PubMed

The Pfam protein families database - PubMed

The Pfam protein families database

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources