The Pfam protein families database - PubMed
. 2012 Jan;40(Database issue):D290-301.
doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.
Penny C Coggill, Ruth Y Eberhardt, Jaina Mistry, John Tate, Chris Boursnell, Ningze Pang, Kristoffer Forslund, Goran Ceric, Jody Clements, Andreas Heger, Liisa Holm, Erik L L Sonnhammer, Sean R Eddy, Alex Bateman, Robert D Finn
Affiliations
- PMID: 22127870
- PMCID: PMC3245129
- DOI: 10.1093/nar/gkr1065
The Pfam protein families database
Marco Punta et al. Nucleic Acids Res. 2012 Jan.
Abstract
Pfam is a widely used database of protein families, currently containing more than 13,000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/). Here, we report on changes that have occurred since our 2010 NAR paper (release 24.0). Over the last 2 years, we have generated 1840 new families and increased coverage of the UniProt Knowledgebase (UniProtKB) to nearly 80%. Notably, we have taken the step of opening up the annotation of our families to the Wikipedia community, by linking Pfam families to relevant Wikipedia pages and encouraging the Pfam and Wikipedia communities to improve and expand those pages. We continue to improve the Pfam website and add new visualizations, such as the 'sunburst' representation of taxonomic distribution of families. In this work we additionally address two topics that will be of particular interest to the Pfam community. First, we explain the definition and use of family-specific, manually curated gathering thresholds. Second, we discuss some of the features of domains of unknown function (also known as DUFs), which constitute a rapidly growing class of families within Pfam.
Figures

New Pfam features since release 24.0. (A) The Pfam-A family page for Avidin (PF01382), showing the embedded contents of the associated Wikipedia article. The ‘infobox’ is highlighted. (B) The ‘sunburst’ representation of the tree showing the species distribution of the Pfam-A family Peptidase_M10 (PF00413). (C) The PfamAlyzer applet, showing the results of searching for all architectures that include the domains IMPDH and CBS. The PfamAlyzer applet allows querying of Pfam for proteins with particular domains, domain combinations or architectures.

Pfam users in the world. A world map showing the usage of Pfam website at the Wellcome Trust Sanger Institute, UK. Usage statistics were obtained from our Google Urchin tracking database and plotted using the Google map API. Circle size is proportional to number of visits from each country for those with >5000 visits. Countries contributing <5000 visits are all shown with the same sized marker. Data refer to the period between 1 and 30 June 2011.

Heat map showing sequence gathering threshold (GA) changes between Pfam releases 24.0 and 26.0. Yellow squares represent high density; red squares represent low density. Squares on the diagonal correspond to GAs that are unchanged; squares in the region above the diagonal are GAs that have increased; and squares below the diagonal are GAs that have decreased. For the sake of clarity, we chose to show a zoomed-in version of the complete plot, which also includes a number of points outside of the range seen here. The plot was created using R (21).

Distribution of sequence gathering (GA) thresholds and of corresponding E-values. (A) Distribution of sequence GAs for all Pfam-A families. Note that intervals are such that, for example, ‘25–26’ translates into 25 ≤ sequence GA(bits) < 26. (B) Same as the histogram in panel (A), with log10(E-values) in place of GAs. E-values are calculated from GAs according to the following formula: E = N × exp[−λ·(x − τ)], where x is the bit score GA, λ and τ are parameters derived from the HMM model (λ is the slope parameter, τ is the location parameter) and N is the database size (in this case the size of UniProtKB) (22). (C) Box-plot of all Pfam families’ GAs (left side; median = 22.1, 25th percentile = 20.8, 75th percentile = 25.0), and for all families excluding those where both sequence and domain thresholds equal 25.0 or 27.0 (right side; median = 21.2, 25th percentile = 20.6, 75th percentile = 22.8). (D) Same as (C) with log10(E-values) in place of GAs. E-values calculated as in panel (B). Left side: median = 0.096, 25th percentile = 0.012, 75th percentile = 0.24. Right side: median = 0.18, 25th percentile = 0.057, 75th percentile = 0.27. Note that values reported here for median and percentiles are for E-values and not log10(E-values).

DUF families’ statistics. (A) Comparison between number of DUFs added (blue) and number of DUFs renamed or otherwise removed (red) since Pfam 22.0 (data shown for releases 23.0–26.0, as indicated by labels on the graph). (B) Number of PIR representative clusters of genomes (23) in DUF families. We used Representative Proteomes version 2.0, comprising a total of 671 clusters for a 35% membership cut-off. (C) Co-occurrence between DUFs and other families. The term ‘architecture’ refers to a combination of families occurring within the same protein sequence. Note that we only considered architectures with at least five member sequences. (D) DUF families and protein structure. ‘Families that have structure’ means that a PDB structure is available for a member of the family; ‘families in a clan that has structure’ means that a PDB structure is available for a member of the same clan.
Similar articles
-
The Pfam protein families database.
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A. Finn RD, et al. Nucleic Acids Res. 2008 Jan;36(Database issue):D281-8. doi: 10.1093/nar/gkm960. Epub 2007 Nov 26. Nucleic Acids Res. 2008. PMID: 18039703 Free PMC article.
-
The Pfam protein families database.
Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. Finn RD, et al. Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. doi: 10.1093/nar/gkp985. Epub 2009 Nov 17. Nucleic Acids Res. 2010. PMID: 19920124 Free PMC article.
-
Pfam: clans, web tools and services.
Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A. Finn RD, et al. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D247-51. doi: 10.1093/nar/gkj149. Nucleic Acids Res. 2006. PMID: 16381856 Free PMC article.
-
Pfam 10 years on: 10,000 families and still growing.
Sammut SJ, Finn RD, Bateman A. Sammut SJ, et al. Brief Bioinform. 2008 May;9(3):210-9. doi: 10.1093/bib/bbn010. Epub 2008 Mar 15. Brief Bioinform. 2008. PMID: 18344544 Review.
-
Domain of unknown function (DUF) proteins in plants: function and perspective.
Luo C, Akhtar M, Min W, Bai X, Ma T, Liu C. Luo C, et al. Protoplasma. 2024 May;261(3):397-410. doi: 10.1007/s00709-023-01917-8. Epub 2023 Dec 30. Protoplasma. 2024. PMID: 38158398 Review.
Cited by
-
Wang T, Wang C, Liu Y, Zou K, Guan M, Wu Y, Yue S, Hu Y, Yu H, Zhang K, Wu D, Du J. Wang T, et al. Genes (Basel). 2024 Oct 15;15(10):1327. doi: 10.3390/genes15101327. Genes (Basel). 2024. PMID: 39457451 Free PMC article.
-
McMullan M, Gardiner A, Bailey K, Kemen E, Ward BJ, Cevik V, Robert-Seilaniantz A, Schultz-Larsen T, Balmuth A, Holub E, van Oosterhout C, Jones JD. McMullan M, et al. Elife. 2015 Feb 27;4:e04550. doi: 10.7554/eLife.04550. Elife. 2015. PMID: 25723966 Free PMC article.
-
Insight into neutral and disease-associated human genetic variants through interpretable predictors.
van den Berg BA, Reinders MJ, de Ridder D, de Beer TA. van den Berg BA, et al. PLoS One. 2015 Mar 31;10(3):e0120729. doi: 10.1371/journal.pone.0120729. eCollection 2015. PLoS One. 2015. PMID: 25826299 Free PMC article.
-
Evolutionary Remodeling of the Cell Envelope in Bacteria of the Planctomycetes Phylum.
Mahajan M, Seeger C, Yee B, Andersson SGE. Mahajan M, et al. Genome Biol Evol. 2020 Sep 1;12(9):1528-1548. doi: 10.1093/gbe/evaa159. Genome Biol Evol. 2020. PMID: 32761170 Free PMC article.
-
Rossi P, Barbieri CM, Aramini JM, Bini E, Lee HW, Janjua H, Xiao R, Acton TB, Montelione GT. Rossi P, et al. Nucleic Acids Res. 2013 Feb 1;41(4):2756-68. doi: 10.1093/nar/gks1348. Epub 2013 Jan 8. Nucleic Acids Res. 2013. PMID: 23303792 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources