bmcbioinformatics.biomedcentral.com

Dinucleotide controlled null models for comparative RNA gene prediction - BMC Bioinformatics

  • ️Washietl, Stefan
  • ️Tue May 27 2008

Requirements for an accurate null model

An optimal null model preserves all the features of the original data with the exception of the signal under question that needs to be removed efficiently. In our case, the data are multiple alignments of homologous sequences and the signal of interest is an evolved RNA secondary structure. Correlations arising from base-pairing patterns need to be removed. Currently, alignments are usually randomized by shuffling the alignment columns (see ref. 5 for a discussion of this method). Although the shuffling approach has its limitations and considering dinucleotides seems difficult, it is an appealing approach because it is relatively simple, fast, and extremely conservative. Changing the order of the columns does not change the mutational patterns within the columns and thus the underlying phylogenetic tree is exactly preserved.

In this paper we attempt to simulate new alignments from scratch. Even the most sophisticated model cannot capture all evolutionary processes and therefore a simulation approach will inevitably change the original data more than shuffling does. So much care has to be taken to preserve all the relevant characteristics of the data. To qualitatively assess the most important parameters that need to be considered in our model, we performed a series of simulation experiments. Using a simple tree with four taxa we simulated alignments under the HKY evolutionary model [25]. We systematically varied model and tree parameters to study how they affect thermodynamic RNA consensus structure predictions in the alignment. We used RNAalifold [26] to predict consensus secondary structures which is the basis of the AlifoldZ [5] and RNAz [6] gene finders.

Not surprisingly, base composition is one of the parameters affecting the predicted folding energies strongest (Fig. 1A). High G+C content leads to more stable RNA predictions, while high A+T content gives less stable predictions. As mentioned in the introduction and in fact the main motivation of this paper, also dinucleotide content affects folding energies. We used our simulation algorithm that is described below to simulate alignments of the same mononucleotide content but varying dinucleotide content. Fig. 1B shows for example that a three times enriched ApT content lead to more stable predictions. The excess of some other dinucleotides like for example GpT can cause the opposite effect leading to less stable predictions.

Figure 1
figure 1

Parameters effecting thermodynamic consensus RNA structure predictions. As a basic parameter set we used equal base frequencies of 0.25, a transition/transversion rate ratio κ = 1, and the following tree ((A:0.09,B:0.09):0.09,(C:0.09,D:0.09):0.09) One parameter was varied at a time while others were kept constant. If necessary branch lengths were adjusted to keep a mean pairwise sequence identity (MPI) of 0.75 ± 0.01. 1000 alignments of length 80 were simulated under each condition. Cumulative histograms for the RNAalifold consensus folding energies are shown. Please note that we plot negative minimum free energies, i.e. higher values correspond to more stable folds. (A) Base frequencies were varied to get high and low G+C content. (B) Two specific dinucleotide frequencies were elevated 3-fold while the mononucleotide content was kept constant. (C) Branch lengths were equally scaled to produce alignments with lower or higher MPI identity than for the basic tree. (D) The transition/transversion rate ratio was varied. κ = 1 means equal rates, while κ > 1 gives more transition than transversions. (E) The alignment of size 80 was divided into a central block of 40 and two anking regions of 20. We set 100% conservation in the central block and low conservation in the anks (rate "high-low-high") and the other way round ("low-high-low"). The total average MPI was always 0.75. (F) We tested all possible topologies of this 4 taxa tree and adjusted the branch lengths to give a MPI of 0.75. For one given topology, all the branch lengths were of the same length.

Full size image

Another major parameter that needs to be controlled is the sequence diversity of the alignment. Variation of the branch lengths of the tree gives alignments with different sequence diversity which we usually measure as the mean pairwise sequence identity (MPI, also sometimes refered to as average pairwise sequence identity APSI). High diversity (i.e. low MPI) makes it difficult to predict a consensus structure if there is no selection pressure for it. On the other hand, almost perfectly conserved sequences fold readily in some random structure even if there is no natural RNA structure present. Therefore we observe a strong dependency on the MPI (Fig. 1C).

One well known characteristic of natural mutation processes are the different rates for transitions and transversions [27]. Interestingly, this also affects the consensus structure predictions. A model with equal transition/transversion rates (parameter κ = 1 in the HKY model) gives less stable predictions than a model with more realistic rates (e.g κ = 4, Fig. 1D). This parameter affects the type of column patterns observed in the simulated alignments which in turn affects how well they can form consensus base pairs.

Natural mutation processes are not homogeneous across all sites, in particular in functional genomic regions. It was observed previously that mutation patterns within an alignment can affect structure predictions [5]. For example, an alignment containing a 100% conserved block with low mutation rate that is flanked by highly divergent regions of high mutation rate can have different folding energies compared to an alignment with homogeneous rates but the same overall MPI (Fig. 1E). The same is true for patterns of insertions and deletions which was also already discussed in reference 5 and which we do not show here explicitly again.

We also tested the effect of different tree topologies, but did not find a significant influence of this parameter at least in our four taxa example.

Taken together, an accurate randomization procedure needs to generate alignments that preserve (i) mono- and dinucleotide content, (ii) mean pairwise sequence identity, (iii) transition/transversion rate ratio (iv) site-specific mutation rates, and (v) gap patterns.

In the next section we describe a model that is capable of simulating alignments under these constraints.

Algorithm

Model

Sequence evolution is usually described by a time-continuous Markov process [27, 28]. The most commonly used models assume that all sites of a sequence evolve independently from each other rendering it impossible to model dinucleotide dependencies between neighbouring pairs. Various evolutionary models have been proposed in the past years to overcome this limitation [2936]. We make use of the recently introduced framework called SISSI (SImulating Site-Specific Interactions). SISSI allows to define site dependencies of arbitrary complexity in the form of a "neighbourhood system" that also may include overlapping dependencies [37]. Given the requirements of our specific problem, we extended and simplified several aspects of SISSI as necessary.

Following the general framework of SISSI, we introduce a site-specific rate matrix Q k for every site k = 1, ..., l in a sequence x = (x1, ..., xl). This matrix defines the substitution process at site k, where the substitution of a given nucleotide x k ∈ A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ = {A,C,G,U} by another one depends on the states xk-1, x k , xk+1(Fig. 2).

Figure 2
figure 2

Site dependencies for overlapping dinucleotides (red-gray): The substitution process of a given nucleotide x k at site k by another one depends on the states xk-1, x k , xk+1, the subsequence s k . Q k has the dimension 64 × 64, where only one mutation is allowed at the current site k. The substitution rate for the whole sequence q(x) is the sum of each rate q(k) = Q k (s k , s k ) multiplied with a site-specific scaling factor f k , with k = 1, ..., l.

Full size image

Thus, the instantaneous rate matrix Q k has the dimension | A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ |3 × | A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ |3 = 64 × 64. The stationary distribution of Q k determines the equilibrium dinucleotide content of our system (see the next section for how the required trinucleotide frequencies of Q k are calculated from the dinucleotide frequencies).

To be able to control the transition/transversion rate ratio and the site-specific mutation rates, we have to add two additional parameters. Let s k = (xk-1, x k , xk+1) represent the current triplet of sequence x and y = (y1, y2, y3) another triplet in A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ 3. First, we introduce a general parameter r(s k , y) ≥ 0 to incorporate the additional mechanistic rates. Second, we introduce a site-specific scaling factor f k with k = 1, ..., l, such that:

1 l ⋅ ∑ 1 l f k = 1 . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqWGSbaBaaGccqGHflY1daaeWbqaaiabdAgaMnaaBaaaleaacqWGRbWAaeqaaOGaeyypa0JaeGymaedaleaacqaIXaqmaeaacqWGSbaBa0GaeyyeIuoakiabc6caUaaa@3BC7@

(1)

We impose the usual restriction, that only one substitution per unit time is admissible [38, 39]. Moreover, Q k only allows for substitutions at site k. The diagonal elements of our instantaneous rate matrix Q k are defined by the mathematical requirement that the sum of each row is zero.

The entries of Q k are thus given by

Q k ( s k , y ) = f k ⋅ { r ( s k , y ) ⋅ π k ( y ) if  H ( s k , y ) = 1  and  x k ≠ y 2 − ∑ z ∈ A 3 z ≠ s k Q k ( s k , z ) if  H ( s k , y ) = 0 0 otherwise MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyuae1aaSbaaSqaaiabdUgaRbqabaGccqGGOaakcqWHZbWCdaWgaaWcbaGaem4AaSgabeaakiabcYcaSiabhMha5jabcMcaPiabg2da9iabdAgaMnaaBaaaleaacqWGRbWAaeqaaOGaeyyXIC9aaiqaaeaafaqaaeWacaaabaGaemOCai3aaSbaaSqaaiabcIcaOiabhohaZnaaBaaameaacqWHRbWAaeqaaSGaeiilaWIaeCyEaKNaeiykaKcabeaakiabgwSixlabec8aWnaaBaaaleaacqWGRbWAaeqaaOGaeiikaGIaeCyEaKNaeiykaKcabaGaeeyAaKMaeeOzayMaeeiiaaIaemisaGKaeiikaGIaeC4Cam3aaSbaaSqaaiabdUgaRbqabaGccqGGSaalcqWH5bqEcqGGPaqkcqGH9aqpcqaIXaqmcqqGGaaicqqGHbqycqqGUbGBcqqGKbazcqqGGaaicqWG4baEdaWgaaWcbaGaem4AaSgabeaakiabgcMi5kabdMha5naaBaaaleaacqaIYaGmaeqaaaGcbaGaeyOeI0YaaabuaeaacqWGrbqudaWgaaWcbaGaem4AaSgabeaakiabcIcaOiabhohaZnaaBaaaleaacqWGRbWAaeqaaOGaeiilaWIaeCOEaONaeiykaKcalqaabeqaaiabhQha6jabgIGioprtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=bq8bnaaCaaameqabaGaeG4mamdaaaWcbaGaeCOEaONaeyiyIKRaeC4Cam3aaSbaaWqaaiabdUgaRbqabaaaaSqab0GaeyyeIuoaaOqaaiabbMgaPjabbAgaMjabbccaGiabdIeaijabcIcaOiabhohaZnaaBaaaleaacqWGRbWAaeqaaOGaeiilaWIaeCyEaKNaeiykaKIaeyypa0JaeGimaadabaGaeGimaadabaGaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzaugaaaGaay5Eaaaaaa@A77B@

(2)

where π k (y) is the stationary frequency of y and the Hamming distance H(s k , y) counts the number of differences between the sites of the triplets s k and y.

In principle, we can choose any rate for the parameter r(s k , y). However, based on the requirement that we want to use the counted dinucleotide content as the stationary distribution, we choose r(s k , y) so that the model becomes reversible. Any parameter of the commonly used independent nucleotide substitution models, like HKY [25] or the general time-reversible model GTR [40] can be chosen for r(s k , y). For our application, we use a transition/transversion rate ratio and set r(s k , y)= κ for transitions and r(s k , y) = 1 for transversions.

The restriction that a substitution is only possible at site k leads to sparse rate matrices. Q k has only | A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ |4 non-zero entries. Hence, we can write Q k in the form of 16 submatrices, which describe the admissible substitutions for site k depending on the left y1 and right y3 neighbours,

y 1 A y 3 y 1 C y 3 y 1 G y 3 y 1 U y 3 y 1 A y 3 y 1 C y 3 y 1 G y 3 y 1 U y 3 ( ∗ π y 1 C y 3 κ π y 1 G y 3 π y 1 U y 3 π y 1 A y 3 ∗ π y 1 G y 3 κ π y 1 U y 3 κ π y 1 A y 3 π y 1 C y 3 ∗ π y 1 U y 3 π y 1 A y 3 κ π y 1 C y 3 π y 1 G y 3 ∗ ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaauaabeqafeaaaaqaaaqaaiabdMha5naaBaaaleaacqaIXaqmaeqaaOGaeeyqaeKaemyEaK3aaSbaaSqaaiabiodaZaqabaaakeaacqWG5bqEdaWgaaWcbaGaeGymaedabeaakiabboeadjabdMha5naaBaaaleaacqaIZaWmaeqaaaGcbaGaemyEaK3aaSbaaSqaaiabigdaXaqabaGccqqGhbWrcqWG5bqEdaWgaaWcbaGaeG4mamdabeaaaOqaaiabdMha5naaBaaaleaacqaIXaqmaeqaaOGaeeyvauLaemyEaK3aaSbaaSqaaiabiodaZaqabaaaaaGcbaqbaeqabiqaaaqaauaabeqabqaaaaqaaiadaciP+daadMha5nacaciP+daaBaaaleacaciP+daacWaGasQ=aaaIXaqmaeqcaciP+daaaOGamaiGK6paaeyqaeKamaiGK6paamyEaK3aiaiGK6paaSbaaSqaiaiGK6paaiadaciP+daaiodaZaqajaiGK6paaaaakeaacWafaoyEaK3aiqbGBaaaleacuaOamqbGigdaXaqajqbGaOGamqbGboeadjadua4G5bqEdGafaUbaaSqaiqbGcWafaI4mamdabKafacaakeaacWaPaoyEaK3aiqkGBaaaleacKcOamqkGigdaXaqajqkGaOGamqkGbEeahjadKc4G5bqEdGaPaUbaaSqaiqkGcWaPaI4mamdabKaPacaakeaacWaGaInaaaWG5bqEdGaGaInaaaWgaaWcbGaGaInaaaGamaiGydaaaGymaedabKaGaInaaaaakiadaci2aaaabwfavjadaci2aaaadMha5nacaci2aaaaBaaaleacaci2aaaacWaGaInaaaaIZaWmaeqcaci2aaaaaaaaaOqaamaabmaabaqbaeqabqabaaaaaeaacqGHxiIkaeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGdbWqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaH6oWAcqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGhbWrcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGvbqvcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGdbWqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqGHxiIkaeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGhbWrcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaH6oWAcqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGvbqvcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaH6oWAcqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGbbqqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGdbWqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqGHxiIkaeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGvbqvcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGbbqqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaH6oWAcqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGdbWqcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqaHapaCdaWgaaWcbaGaemyEaK3aaSbaaWqaaiabigdaXaqabaWccqqGhbWrcqWG5bqEdaWgaaadbaGaeG4mamdabeaaaSqabaaakeaacqGHxiIkaaaacaGLOaGaayzkaaaaaaaaaaa@14F0@

(3)

Finally, we scale Q k such that the number of substitutions d k equals 1:

d k = − ∑ z ∈ A 3 π k ( z ) ⋅ Q k ( z , z ) = 1 . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemizaq2aaSbaaSqaaiabdUgaRbqabaGccqGH9aqpcqGHsisldaaeqbqaaiabec8aWnaaBaaaleaacqWGRbWAaeqaaOGaeiikaGIaeCOEaONaeiykaKIaeyyXICTaemyuae1aaSbaaSqaaiabdUgaRbqabaGccqGGOaakcqWH6bGEcqGGSaalcqWH6bGEcqGGPaqkcqGH9aqpcqaIXaqmaSqaaiabhQha6jabgIGioprtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=bq8bnaaCaaameqabaGaeG4mamdaaaWcbeqdcqGHris5aOGaeiOla4caaa@56CC@

(4)

and thus the total instantaneous substitution rate for a sequence x can be written as the sum over each rate Q k (s k , s k ) multiplied with the site-specific scaling factor f k , with k = 1, ..., l (Fig. 2),

q ( x ) = − ∑ k = 1 l f k ⋅ Q k ( s k , s k ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyCaeNaeiikaGIaeCiEaGNaeiykaKIaeyypa0JaeyOeI0YaaabCaeaacqWGMbGzdaWgaaWcbaGaem4AaSgabeaakiabgwSixlabdgfarnaaBaaaleaacqWGRbWAaeqaaOGaeiikaGIaeC4Cam3aaSbaaSqaaiabhUgaRbqabaGccqGGSaalcqWHZbWCdaWgaaWcbaGaeC4AaSgabeaakiabcMcaPaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemiBaWganiabggHiLdGccqGGUaGlaaa@4B34@

(5)

Without dependencies on the neighbours, Q k is of dimension 4 × 4 and the model reduces essentially to a HKY model with a specific rate for each site. We use this mononucleotide variant later in this paper for testing and comparison to the dinucleotide model.

Simulation

For the simulation process, we essentially used the same algorithm described previously [37] with some modifications. During the simulation process, we pick a site k with the relative mutability

P ( k ) = | f k ⋅ Q k ( s k , s k ) | q ( x ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaaLaeiikaGIaem4AaSMaeiykaKIaeyypa0tcfa4aaSaaaeaadaabdaqaaiabdAgaMnaaBaaabaGaem4AaSgabeaacqGHflY1cqWGrbqudaWgaaqaaiabdUgaRbqabaGaeiikaGIaeC4Cam3aaSbaaeaacqWGRbWAaeqaaiabcYcaSiabhohaZnaaBaaabaGaem4AaSgabeaacqGGPaqkaiaawEa7caGLiWoaaeaacqWGXbqCcqGGOaakcqWH4baEcqGGPaqkaaGaeiilaWcaaa@4ADF@

(6)

and for the chosen site k, the nucleotide x k will replaced by a new nucleotide y2 ∈ A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8haXheaaa@3749@ from the corresponding triplet y with probability:

P ( x k → y 2 ) = f k ⋅ Q k ( s k , y ) | f k ⋅ Q k ( s k , s k ) | = Q k ( s k , y ) | Q k ( s k , s k ) | MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaaLaeiikaGIaemiEaG3aaSbaaSqaaiabdUgaRbqabaGccqGHsgIRcqWG5bqEdaWgaaWcbaGaeGOmaidabeaakiabcMcaPiabg2da9KqbaoaalaaabaGaemOzay2aaSbaaeaacqWGRbWAaeqaaiabgwSixlabdgfarnaaBaaabaGaem4AaSgabeaacqGGOaakcqWHZbWCdaWgaaqaaiabdUgaRbqabaGaeiilaWIaeCyEaKNaeiykaKcabaWaaqWaaeaacqWGMbGzdaWgaaqaaiabdUgaRbqabaGaeyyXICTaemyuae1aaSbaaeaacqWGRbWAaeqaaiabcIcaOiabhohaZnaaBaaabaGaem4AaSgabeaacqGGSaalcqWHZbWCdaWgaaqaaiabdUgaRbqabaGaeiykaKcacaGLhWUaayjcSdaaaiabg2da9maalaaabaGaemyuae1aaSbaaeaacqWGRbWAaeqaaiabcIcaOiabhohaZnaaBaaabaGaem4AaSgabeaacqGGSaalcqWH5bqEcqGGPaqkaeaadaabdaqaaiabdgfarnaaBaaabaGaem4AaSgabeaacqGGOaakcqWHZbWCdaWgaaqaaiabdUgaRbqabaGaeiilaWIaeC4Cam3aaSbaaeaacqWGRbWAaeqaaiabcMcaPaGaay5bSlaawIa7aaaaaaa@737C@

(7)

In the most general SISSI framework Q k needs to be updated for all k sites every time one nucleotide in x is substituted. However, in our special case we can use the same instantaneous rate matrix Q k for each site with special conditions for r(s k , y). As a consequence, we can fix q(x) and do not need to sum over each rate of the site, which improves the running time of the algorithm.

Parameter estimation

The idea of our randomization procedure is to estimate a tree under the model described in the previous section and simulate sequences along this tree. Ideally, all parameters are estimated simultaneously within a maximum likelihood framework. One problem is the high number of parameters since we want to estimate a specific rate for each site. A more fundamental issue is, however, that our model includes overlapping dependencies which breaks the independence assumption necessary for basic maximum likelihood estimation. Other possible techniques like Markov chain Monte Carlo in a Bayesian framework are not a viable alternative either. Speed is a critical issue as the algorithm is meant to be applied to data on a genome wide scale.

Facing these difficulties, we have developed heuristic approximations to estimate the parameters and use a distance based approach to estimate the tree. The method is fast and yet surprisingly accurate for our application.

Equilibrium frequencies

The stationary frequencies of our model are set in a way that in equilibrium we obtain a dinucleotide frequency that is the same as the dinucleotide content of the alignment to be randomized. To this end, we first count the dinucleotide frequencies as an average of all sequences in the original alignment (see Methods on how gaps are treated). Then, we calculate the corresponding trinucleotide frequencies needed for Q k as a function of the single and dinucleotide frequencies using an approximation based on simple conditional probabilities [30, 32]:

π ( α β γ ) = π ( α β ) π ( β γ ) π ( β ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiWdaNaeiikaGIaeqySdeMaeqOSdiMaeq4SdCMaeiykaKIaeyypa0tcfa4aaSaaaeaacqaHapaCcqGGOaakcqaHXoqycqaHYoGycqGGPaqkcqaHapaCcqGGOaakcqaHYoGycqaHZoWzcqGGPaqkaeaacqaHapaCcqGGOaakcqaHYoGycqGGPaqkaaaaaa@4893@

(8)

where π(αβγ) are the trinucleotide frequencies, π(αβ) and π(βγ) the counted dinucleotide frequencies and π(β) = Σ α π(αβ) = Σ α π(βα) with α, β, γ ∈ {A, C, G, U}.

Fig. 3A shows an example of the dinucleotide frequency distribution of 1000 simulated alignments. We counted the dinucleotide frequencies of an alignment of 7 5.8 rRNA sequences and set the trinucleotide parameters of our model accordingly. On average, we get the same dinucleotide frequencies in the simulated alignments as in the original one.

Figure 3
figure 3

Key concepts of the algorithm shown on an example alignment of 5.8S rRNA. (A) Distribution of dinucleotide frequencies of 1000 simulated alignments are shown as box-plots (the line in the box indicates the median, the borders of the box the 25th and 75th quartile, and the dotted lines 1.5× the interquartile range). Red circles show the frequencies observed in the original alignment. (B) Relationship between the number of substitutions and observed differences empirically determined by sampling of 25 points. Each point shows the average of 10 simulations. Note that the short distances are sampled more densely. These settings are the default values in our program and used throughout the paper. (C) Distribution of mean pairwise identities for 1000 random samples. The MPI of the original alignment is shown in red. (D) Comparison of site-wise MPIs in the original alignment and the average of the corresponding sites of 1000 random alignments.

Full size image

Distances and tree construction

To build a distance based tree, we first have to estimate the number of substitutions that have taken place between two sequences. In other words, we have to estimate the genetic or evolutionary distance d from the Hamming distances p under our model. Both distances are different because back mutations have taken place that are not directly visible. To estimate the relationship between d and p, we simulate sequence pairs separated by different branch lengths d and calculate the corresponding Hamming distances p (Fig. 3B). We fit an exponential function to this curve:

p = a ^ ⋅ ( 1 − e b ^ ⋅ d ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeyypa0JafmyyaeMbaKaacqGHflY1cqGGOaakcqaIXaqmcqGHsislcqWGLbqzdaahaaWcbeqaaiqbdkgaIzaajaGaeyyXICTaemizaqgaaOGaeiykaKcaaa@3C48@

(9)

Using this function, all pairwise distances are calculated for the sequences in the original alignment. From this distance matrix a tree is constructed using the BIONJ algorithm [41]. BIONJ is a variant of the well known neighbour joining algorithm and currently one of the most accurate algorithms for distance based tree building.

Given that the distances and the tree are accurately estimated, we observe on average the same mean pairwise identity in the simulated alignment as in the original one. Fig. 3C shows the distribution of MPIs of 1000 simulations of our example rRNA alignment. The average MPI of the simulations is exactly the same as the MPI 0.73 of the original alignment.

Site-specific rates

Setting different mutation rates at different sites gives us the possibility to preserve natural mutation patterns of the original alignment. The problem of finding accurate site-specific rates is illustrated in Fig. 3D. For each site in the alignment, the MPI of this site is plotted against the average MPI observed in the simulated alignments on the same site. If we consider equal rates for all sites, each site will have the same average MPI which is of course equal to the overall MPI of 0.73 of the whole alignment. Ideally, the average MPI for each simulated site is the same as the original MPI at this site. In this case, the points in the plot are on a diagonal indicating that we have found accurate estimates for the rates.

Simple estimates for site-specific rates in combination with distance based trees have been described previously [42]. The method includes fits to a gamma distribution which requires data of at least 1000 nucleotides and 30 sequences to get reasonable results. Here we use a different approach that also gives good results for smaller alignments.

The substitution rate at a site is of course related to the observed sequence diversity at this site. If a site is highly conserved the rate is low, whereas high sequence diversity indicates a high mutation rate. So in a first step, we calculate the average number of pairwise differences ⟨p k ⟩ for each site k in the alignment with n sequences:

〈 p k 〉 = 2 n ( n − 1 ) ∑ i n ∑ j > i n δ i j k ; with  δ i j k = { 1 if nucleotides in sequences  i , j  differ at site  k 0 otherwise MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabgMYiHlabdchaWnaaBaaaleaacqWGRbWAaeqaaOGaeyOkJeVaeyypa0tcfa4aaSaaaeaacqaIYaGmaeaacqWGUbGBcqGGOaakcqWGUbGBcqGHsislcqaIXaqmcqGGPaqkaaGcdaaeWbqaamaaqahabaGaeqiTdq2aa0baaSqaaiabdMgaPjabdQgaQbqaaiabdUgaRbaaaeaacqWGQbGAcqGH+aGpcqWGPbqAaeaacqWGUbGBa0GaeyyeIuoakiabcUda7aWcbaGaemyAaKgabaGaemOBa4ganiabggHiLdaakeaacqqG3bWDcqqGPbqAcqqG0baDcqqGObaAcqqGGaaicqaH0oazdaqhaaWcbaGaemyAaKMaemOAaOgabaGaem4AaSgaaOGaeyypa0ZaaiqaaeaafaqaaeGacaaabaGaeGymaedabaGaeeyAaKMaeeOzayMaeeiiaaIaeeOBa4MaeeyDauNaee4yamMaeeiBaWMaeeyzauMaee4Ba8MaeeiDaqNaeeyAaKMaeeizaqMaeeyzauMaee4CamNaeeiiaaIaeeyAaKMaeeOBa4MaeeiiaaIaee4CamNaeeyzauMaeeyCaeNaeeyDauNaeeyzauMaeeOBa4Maee4yamMaeeyzauMaee4CamNaeeiiaaIaemyAaKMaeiilaWIaemOAaOMaeeiiaaIaeeizaqMaeeyAaKMaeeOzayMaeeOzayMaeeyzauMaeeOCaiNaeeiiaaIaeeyyaeMaeeiDaqNaeeiiaaIaee4CamNaeeyAaKMaeeiDaqNaeeyzauMaeeiiaaIaem4AaSgabaGaeGimaadabaGaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzaugaaaGaay5Eaaaaaaaa@A812@

(10)

p k ⟩ are observed differences ignoring multiple substitutions. If we naively choose our rates proportional to ⟨p k ⟩ we would underestimate high rates while overestimating low rates. We therefore use the relationship in equation 9 to correct for this effect and calculate estimates f ^ k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOzayMbaKaadaWgaaWcbaGaem4AaSgabeaaaaa@2EC5@ for the rates at site k as follows:

f ^ k = 1 b ^ ⋅ ln ( 1 − 〈 p k 〉 a ^ ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOzayMbaKaadaWgaaWcbaGaem4AaSgabeaakiabg2da9KqbaoaalaaabaGaeGymaedabaGafmOyaiMbaKaaaaGccqGHflY1cyGGSbaBcqGGUbGBdaqadaqaaiabigdaXiabgkHiTKqbaoaalaaabaGaeyykJeUaemiCaa3aaSbaaeaacqWGRbWAaeqaaiabgQYiXdqaaiqbdggaHzaajaaaaaGccaGLOaGaayzkaaaaaa@43FD@

(11)

It must be pointed out that the site-specific rates change the relationship between genetic distance and observed differences (Fig. 3B). For correcting the site-specific rates we use the estimates for a ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaKaaaaa@2D30@ and b ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOyaiMbaKaaaaa@2D32@ from our model without site-specific rates. So this is only an approximation and one could think about iteratively refining the estimates. However, we found that this approach already yields accurate rates within one step as can be seen in Fig. 3D. Using the model with site-specific rates, the simulated alignments have on average almost exactly the same site-wise MPI as the original one.

The reader will notice that the first three points deviate from the diagonal. This illustrates a limitation of our method. With our simulation procedure we can on average only reach the level of saturation even if we use very high rates. It is possible, however, that the original data contains sites below the level of saturation. For example in a four way alignment a column can be ACGT, i.e. MPI = 0. However, we cannot simulate on average columns with MPI = 0, since the MPI is bounded below by zero and our simulations will always contain columns with MPI > 0. In practice this does not seem to cause any obvious problems in particular when we have many sequences where it is unlikely to see columns below saturation.

Gaps

Gaps have been ignored completely so far. There are evolutionary models including deletions and insertions [4347] and, in principle, it would be possible to combine the insertion-deletion dynamics with our model. However, this does not appear practical in our case. Existing algorithms for joint estimation of phylogenies and alignments are not only very time-consuming [47], it also seems difficult to estimate reasonable indel model parameters on relatively short alignment blocks which hold only little information. Moreover, alignment programs produce gap patterns that do not necessarily reflect phylogenetically reasonable insertion/deletion events and thus cannot always be captured by an idealized model that is motivated by evolutionary processes and ignores algorithmic idiosyncrasies of alignment programs.

So we follow here a very pragmatic strategy that has also been used previously [5]: We keep exactly the same gap pattern in our randomized alignments as in the original alignment. To this end, we simply treat gaps as missing data and simulate nucleotide characters for the gapped positions. This is done in a way that the overall characteristics are not changed when they are replaced with gaps again at the end (see Methods for details).

Transition/transversion rate ratio

The transition/transversion rate ratio κ is a parameter in our model that cannot be simply counted as in the case of the dinucleotide frequencies, or empirically determined like the branch lengths. Given that the influence of this parameter is not that critical as for example the branch length or base composition (see Fig. 1), one possibility might be to use a fixed transition/transversion ratio if a reasonable average value is known for the genome at hand. Alternatively, we found that a good estimate can be obtained by using maximum likelihood on an independent mononucleotide model. We used here the HKY model with -distributed rates which is closest to our dinucleotide model.

Putting it together

Fig. 4 gives a short outline over the whole randomization procedure. We start by parametrization our model: we count the dinucleotides and calculate the corresponding stationary trinucleotide frequencies. A transition/transversion rate ratio for the alignment is estimated using maximum likelihood under a HKY+Γ model. Having set these parameters, we empirically estimate the relationship between substitutions and observed differences with equal rates for each site. This first estimate is used to calculate the site-specific rates, which are then used for the second estimation. In the next step, the pairwise distances between all sequences are calculated. For the calculation of the site-specific rates and the pairwise distances gap characters are treated in a special way as missing data (see Methods). From the distance matrix a tree is built using the BIONJ algorithm. An ancestral sequence is sampled from a first order Markov model parametrized according to the dinucleotide frequency in the original alignment. This is used as a starting sequence for the simulation that is guided by the tree. Finally, the gap pattern of the original alignment is introduced into the simulated one. Fig. 5 shows our rRNA example and two randomized versions obtained by this procedure.

Figure 4
figure 4

Overview of the algorithm. Left: The steps of the randomization procedure are shown. Right: In combination with RNAalifold consensus folding the randomization procedure can be used to calculate z-scores and to predict significant RNA structures. See text for details.

Full size image

Figure 5
figure 5

Example of randomized alignments. Part of the example alignment used in Fig. 3 are shown. The grey bars indicate the level of local conservation. Exactly conserved sites are marked by asterisks.

Full size image

Implementation

We implemented our method in ANSI C in a program called SISSIz. The source code is available under the GNU Public Licence for download [48].

Some words on running time: One might suspect that the randomization algorithm including two times the sampling procedure to estimate the parameters of equation 9 and the maximum likelihood estimation of the transition/transversion rate ratio is relatively slow. Indeed, it is much slower than for example randomization by shuffling, but still very fast. To build the model for our example of 7 rRNAs of 158 length takes 0.2 seconds on a modern Intel Core 2 Quad CPU at 2.4 GHz. To simulate 1000 alignments using this model takes another 0.6 seconds.

Testing

Randomizing vertebrate genomic alignments

We tested our randomization method on vertebrate genomic alignments. In a setting similar to recent genomic screens in vertebrates [11, 20], we extracted Multiz [49] alignment blocks from human chromosome 1. We randomly selected 1000 alignment blocks between 70 and 120 nt in length and between 4 and 10 sequences without considering annotation information of any sort. These alignments are meant to represent an unbiased "genomic background" that may also contain functional elements like coding exons or structured RNAs depending on their frequency in the genome.

The alignments were randomized using our new simulation procedure with both the dinucleotide and the mononucleotide model. In addition, we shuffled the alignments using shuffle-aln.pl. The global distribution of dinucleotides for the original and randomized data is shown in Fig. 6. As expected, the shuffling approach and the mononucleotide simulation give the same results. The dinucleotide distribution obtained by these methods, however, differs from the distribution in the native alignments. One can see for example the well known under-representation of CpGs in the native genomic data. Using our dinucleotide based model, we obtain simulated alignments which are statistically indistinguishable from the native data in terms of their average dinucleotide content.

Figure 6
figure 6

Dinucleotide frequencies of genomic alignments. 1000 vertebrate genome alignments were randomized using three different methods. The dinucleotide frequency of the native and randomized data is shown as box-plots.

Full size image

Also the observed sequence diversity of the simulated alignments closely follows the original data as shown in Fig. 7. 98.7% of the simulated alignments are within a range of ± 0.05 mean pairwise identity compared to the original alignments. It must be noted, that the distribution in Fig. 7 has a mean of +0.007 which indicates a subtle bias of the simulations towards higher MPIs. We suspect that this is an indirect result of the way we estimate site-specific rates and related to the issue of sites below saturation discussed before. However, this deviation does not have any practical consequences since it represents a conservative bias in the context of RNA folding controls and, more importantly, seems to be too small to have any noticeable effect at all.

Figure 7
figure 7

Mean pairwise identity in randomized genomic alignments. The distribution of the difference of the mean pairwise identity between the original genomic alignments and the simulated ones (dinucleotide model) is shown.

Full size image

Influence of randomization procedure on RNA predictions

The main motivation of this paper is to provide dinucleotide based controls for comparative RNA gene predictions. Therefore, we ran RNAalifold and RNAz on the alignments to demonstrate how different randomization procedures affect the results. Fig. 8A shows the distribution of RNAalifold consensus MFEs on the genomic alignments and their different randomizations. One can see that the genomic alignments show the most stable structures. There is a clear difference between the native genomic alignments and the shuffled and mononucleotide simulated ones. However, the folding energies of the dinucleotide simulated alignments are much closer to the native data. This difference between the di- and mononucleotide simulations reflects the bias caused by the genomic dinucleotide content. The difference between the native and the dinucleotide controls indicates the existence of RNA signals in the genome or, alternatively, another as yet unidentified bias.

Figure 8
figure 8

Influence of the randomization procedure on RNA predictions. (A) Cumulative frequency distribution of RNAalifold consensus folding energies for the native and randomized alignments. (B) Cumulative frequency distribution of RNAz scores. The "decision-value" is the result of the support vector machine classification. Positive values indicate a potential functional RNA while negative values indicate no significant fold. The positive tail is magnified.

Full size image

Clearly, the differences shown here in these cumulative histograms might appear very subtle. The results for the RNAz predictions, however, show that such differences can strongly affect the statistics of RNA gene predictions (Fig. 8B). On this particular test set, RNAz predicts RNA signals in 4.3% of the native alignments. Using the conventional shuffling strategy or mononucleotide based simulation one would estimate a false positive rate of 0.8% or 0.7%, respectively. Using the more conservative dinucleotide based model the estimate would be 2.1%, i.e. three times higher. This is consistent with the results obtained by Babak et al. using their dinucleotide shuffling approach on pairwise alignments.

Calculating z-scores to predict structural RNAs

We can directly assess the significance of a predicted RNA by calculating a z-score. The folding energy of the native data m and the mean μ and standard deviation σ of randomized data is calculated. The significance of the native fold can then be expressed as z = (m - μ)/σ, i.e. the number of standard deviations from the mean. This score has been repeatedly used on single sequences applying mono- or dinucleotide shuffling or simulation using a zero or first order Markov model [23, 24]. Using shuffled alignments as null model, this approach is implemented in the RNA gene finding program AlifoldZ [5]. The same strategy can be used in combination with our new dinucleotide base randomization strategy without any further modifications (Fig. 4).

To test the effectiveness of this approach, we conducted a benchmark similar to those used previously [5, 6] for testing AlifoldZ and RNAz. We used multiple sequence alignments of eight different structural RNA families taken from the Rfam database [50]. The alignments contained three to six sequences and had a mean pairwise identity between 50% and 100% (see Methods for details). For the tests of AlifoldZ and RNAz, shuffled alignments were used as negative controls. For obvious reasons, this is not possible here. So we used genomic alignments from random locations of the human genome (see Methods). Using the "genomic background" as negative controls in this test implies the assumption that the genome does not contain any structural RNAs at all, which is clearly not valid. However, if we assume true structural RNAs to be sparse in the genome this conservative assumption seems to be a sensible choice.

We calculated z-scores with a sample size of 1000 randomizations for both sets of true structured RNAs and the genomic background using three different randomization methods: Shuffling (AlifoldZ), simulation using a mononucleotide model (SISSIz mono) and simulation using the dinucleotide model (SISSIz di). The results are summarized in Tab. 1.

Table 1 z-scores and classification performance

Full size table

Using mononucleotide based randomization the z-scores of the genomic background are approximately half a standard deviation from zero (-0.44 and -0.58, for shuffling and mononucleotide simulation respectively). This shows the relatively strong "bias" of the genomic background that causes false positive predictions as shown in the previous section and in reference 21. Albeit the signal does not vanish completely, the dinucleotide based z-scores are much closer to zero (-0.15).

The z-scores of the structural RNAs in this test set are on average well below -4 indicating a clear structural signal. Also here, we observe that mononucleotide simulated z-scores are lower than the dinucleotide simulated ones. In this case, a dinucleotide content that favors stable RNA structures is clearly not only a general background effect of the genomic base composition but a feature of structural RNAs. However, this signal is lost if the more conservative dinucleotide based null model is used.

There is also a clear difference between the two monunucleotide randomization procedures: Shuffling leads to more significant z-scores than simulation. The main reason is the fact that simulation results in higher standard deviations than shuffling which in turn lead to more conservative z-scores.

This shows that there are many effects that have to be taken into account. To assess the overall classification performance we generated receiver operating characteristic curves based on the three different z-scores, as well as the support vector machine score from RNAz (Fig. 9). In addition, we calculated the sensitivity at two different levels of specificity (0.01 and 0.05) for all four approaches (Tab. 1).

Figure 9
figure 9

Accuracy of z -score based classification of structured RNAs. As positive examples, alignments from eight different classes of structural RNAs were used. As negative examples, random locations from genome wide vertebrate alignments were chosen. ROC curves are shown in dependence on the null model used. In addition, the results of the RNAz support vector machine are shown. The region of high specificity which is of special interest is magnified.

Full size image

The ROC curve shows that all the methods perform very well on this test set. The curve further suggests that there is not much difference between them. However, differences become evident when looking at the region of high specificity, the only relevant region for practical applications (see inset Fig. 9). Here, the dinucleotide based approach generally outperforms the mononucleotide based methods. The improvement is small but clearly noticeable: At a false positive rate of 0.01%, dinucleotide based simulation shows the highest sensitivity for 7 of the 8 RNA classes. For example, in the tRNA group the sensitivity is 13% higher than AlifoldZ and RNAz. The latter performs significantly worse than all other methods at this level. At a false positive rate of 0.05%, dinucleotide simulation still performs slightly better than mononucleotide shuffling/simulation but is on the same level as RNAz that performs significantly better here.