pubmed.ncbi.nlm.nih.gov

A probabilistic model of local sequence alignment that simplifies statistical significance estimation - PubMed

  • ️Tue Jan 01 2008

A probabilistic model of local sequence alignment that simplifies statistical significance estimation

Sean R Eddy. PLoS Comput Biol. 2008.

Abstract

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (lambda) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty ("Forward" scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores ("Viterbi" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

PubMed Disclaimer

Conflict of interest statement

The author has declared that no competing interests exist.

Figures

Figure 1
Figure 1. A generative probabilistic model of local alignment.

(A) an example of a core model with five consensus positions. Each consensus position of the query is modeled by a node containing three states: a match state (M) that emits the consensus position (squares), a mute delete state (D) that emits nothing and deletes the consensus position (circles), and an insert state (I) that emits one or more residues after the consensus position (diamonds). For clarity, the emission probability distributions on match and insert states have been omitted in the figure. Nodes are numbered 1…N for a query of length N consensus positions. The three states in each node have seven transition probabilities (arrows), implementing a probabilistic model of traditional sequence alignment with affine gap penalties: the MD and MI probabilities correspond to gap-open costs, and DD and II probabilities correspond to gap-extend costs. The core model starts and ends with mute begin (B) and end (E) states (circles). Bold arrows indicate the consensus (all match) path through the model. Blue states and transitions are either modified or removed in a configured search profile. (B) The search profile, with extra states and state transitions (magenta) enabling a model of local or glocal alignment, and unihit versus multihit alignment, as described in the text. States N, C, and J emit on transition (indicated by x's on their transition arrows), in order to be able to generate ≥0 residues rather than ≥1. (C) A partial view of the implicit probabilistic model, showing three of the possible 15 ij query segments (1…5, 2…4, and 3) for an N = 5-node model, with uniform entry probabilities of formula image and exit probabilities of 1.0. The presence of the remaining 12 query fragments in the model is indicated by dashed entry/exit transitions and vertical ellipses.

Figure 2
Figure 2. Viterbi scores follow Gumbel distributions with constant λ.

(A) A histogram showing the distribution of formula image estimates determined by maximum likelihood Gumbel fits to multihit local Viterbi scores of n = 105 i.i.d random sequences of length L = 400, for 9318 profile HMMs built from Pfam 22.0 seed alignments. The sharp black peak is from prototype HMMER3, with mean 0.6928 and standard deviation 0.0114, and extreme outliers indicated by arrows. The broader grey histogram is from old HMMER2, for comparison. The conjectured λ = log 2 is shown as a vertical dotted red line. (B,C) log survival plots (P(V>t) on a log scale, versus score threshold t) showing observed versus expected distributions for multihit local Viterbi scores for two typical Pfam models, RRM_1 and Caudal_act, for n = 108 i.i.d. random sequences of length L = 400. On a log survival plot, the high-scoring tail of a Gumbel distribution is a straight line with slope −λ. Black circles show the observed data. The black lines show maximum likelihood fitted Gumbel distributions, with formula image estimates as indicated. The red lines show the conjectured λ = log 2 Gumbel distributions, with μ fitted by maximum likelihood. (D,E) log survival plots for the extreme outliers DUF851 and Sulfakinin, as described in the text.

Figure 3
Figure 3. Forward scores follow exponential tails with constant λ.

(A) a graph showing how formula image estimates for exponential distributions asymptote towards the conjectured log 2 as the fitted tail mass decreases. Each open circle is the mean of 9,318 formula image estimates, one for each Pfam 22.0 model, fitted by maximum likelihood to the high-scoring tail of multihit local Forward scores for n = 105 i.i.d random sequences of length L = 400, with varying tail mass from 1.0 to 0.0001. Variation in formula image is represented by plotting quartiles (black bars) and most extreme outliers (grey triangles) in addition to the means. formula image approaches the conjectured log 2 as fitted tail mass decreases, but beyond a certain point, variance increases. A tail mass of 0.001 was chosen as an appropriate tradeoff, and the mean (0.6993), standard deviation (0.0338), and outliers at that choice are annotated. (B,C) log survival plots showing observed (black circles) and expected (red lines) P(F>t) distributions versus score t for multihit local Forward scores for the “typical” RRM_1 and Caudal_act models, for n = 108 i.i.d. random sequences of length L = 400. On a log survival plot, an exponential tail is a straight line of slope −λ. Black circles show the observed data. The black lines show maximum likelihood exponentials fitted to the 0.001 high-scoring tail, with formula image estimates as indicated. The red lines show the conjectured λ = log 2 exponential tails. (D,E) log survival plots for the extreme outliers Ribosomal_L12 and XYPPX.

Figure 4
Figure 4. Target length modeling makes distributions length-independent.

Log survival plots for multihit local Viterbi scores (left; [A,B,E,F]) and multihit local Forward scores (right; [C,D,G,H]) for the two “typical” models RRM_1 and Caudal_act, for n = 106 i.i.d. random sequences of various lengths, for either old HMMER2 scoring (top; [A–D]) or the new target length model in prototype HMMER3 (bottom; [E–H]). Eleven target sequence lengths are used, ranging from 25 to 25,600 in steps of two-fold, with L = 25 shown in red, L = 400 shown in black, L = 25,600 shown in cyan, and other lengths shown in grey. Each line is the observed log survival plot, collected in 0.1 bit intervals. The grey inset in the HMMER2 Viterbi scores (A,B) shows the length dependence predicted by Karlin/Altschul statistics, with location increasing by one bit for each doubling in target sequence length. The HMMER3 results (bottom; [E–H]) show that both Viterbi and Forward scores are essentially independent of target sequence length in the new parameterization of the target length model, even for the previously problematic multihit Forward scores.

Figure 5
Figure 5. Accuracy of E-value determination.

Plots of predicted E-value versus actual rank, for multihit local Viterbi scores (A,C) and multihit local Forward scores (B,D), using models with either the standard profile HMM multinomial parameterization used in the rest of the paper (A,B) or “entropy-weighted” models of reduced information content (C,D). Each plotted point (open circles) is the mean of 9,318 profile HMM searches of n = 105 target sequences of three different target lengths: L = 100 (red), L = 400 (black), and L = 1,600 (cyan). The extreme outliers for each point are shown by squares and dotted vertical lines. (Interquartile ranges are smaller than the circles plotted for the means.) The expected result, of E-value equal to observed rank, is shown as a black line. Displayed text shows means and standard deviations for predicted E-values of the top-ranked score in each search, which should be (and is) about 1.0.

Similar articles

Cited by

References

    1. Durbin R, Eddy SR, Krogh A, Mitchison GJ. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University Press; 1998. p. 356.
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol. 1994;235:1501–1531. - PubMed
    1. Altschul SF, Boguski MS, Gish W, Wooton JC. Issues in searching molecular sequence databases. Nature Genetics. 1994;6:119–129. - PubMed
    1. Mitrophanov AY, Borodovsky M. Statistical significance in biological sequence analysis. Brief Bioinform. 2006;7:2–24. - PubMed

Publication types

MeSH terms