pubmed.ncbi.nlm.nih.gov

The (in)dependence of alternative splicing and gene duplication - PubMed

  • ️Mon Jan 01 2007

The (in)dependence of alternative splicing and gene duplication

David Talavera et al. PLoS Comput Biol. 2007.

Abstract

Alternative splicing (AS) and gene duplication (GD) both are processes that diversify the protein repertoire. Recent examples have shown that sequence changes introduced by AS may be comparable to those introduced by GD. In addition, the two processes are inversely correlated at the genomic scale: large gene families are depleted in splice variants and vice versa. All together, these data strongly suggest that both phenomena result in interchangeability between their effects. Here, we tested the extent to which this applies with respect to various protein characteristics. The amounts of AS and GD per gene are anticorrelated even when accounting for different gene functions or degrees of sequence divergence. In contrast, the two processes appear to be independent in their influence on variation in mRNA expression. Further, we conducted a detailed comparison of the effect of sequence changes in both alternative splice variants and gene duplicates on protein structure, in particular the size, location, and types of sequence substitutions and insertions/deletions. We find that, in general, alternative splicing affects protein sequence and structure in a more drastic way than gene duplication and subsequent divergence. Our results reveal an interesting paradox between the anticorrelation of AS and GD at the genomic level, and their impact at the protein level, which shows little or no equivalence in terms of effects on protein sequence, structure, and function. We discuss possible explanations that relate to the order of appearance of AS and GD in a gene family, and to the selection pressure imposed by the environment.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Equivalence between Alternative Splicing and Gene Duplication

(A) The alignment shows an example of molecular equivalence between the effects of AS and GD. The human U2AF35 gene has two known splice variants, Hs_U2AF35a and Hs_U2AF35b, that differ along the region marked with a red box. The fugu orthologue Fr_U2AF35-a does not have known splice variants, but instead has a paralogue, Fr_U2AF35-b [9]. All sequences have kindly been provided by T. R. Pacheco and M. Carmo-Fonseca. For some residues (bold, highlighted in light blue), the substitutions amongst the human splice variants are equivalent to those in the fugu GD. The cartoon illustrates the relationship between the human and fugu sequences. The names of genes and their protein products are denoted in small and capital letters, respectively. At the molecular level, AS and GD show equivalent changes to sequence, and therefore are likely to have interchangeable effects on structure and function of the proteins. In this work we study whether such molecular interchangeability holds in general. (B) We compared the characteristics of two types of sequence changes, indels and substitutions, between AS (both shown in dark blue) and GD (shown in dark and light blue). On top, we illustrate an indel event (the deleted stretch is highlighted in red, and two dotted lines denote its location); at the bottom, we illustrate substitution events (red lines represent residue matches between sequences, linked by dotted lines; the continuous lines between alternative splice isoforms represent the boundaries of the interchanged stretches). (C) We used this protocol in all sequence comparisons between AS and GD. Changes between alternative splice isoforms are obtained after comparing the SwissProt [44] reference isoform with the remaining isoforms. Changes between duplicates are obtained by comparing the SwissProt [44] reference isoforms of the genes that are part of one GD family.

Figure 2
Figure 2. The Relationship between AS and GD at the Genomic Level

(A) The diagram shows the uneven distribution of AS amongst GD families of different sizes for the human genome. Information on AS has been taken from the AltSplice database [43]. GD families were obtained by clustering all sequences of more than 40%, 60%, 80%, or 90% seq.id., respectively, using CD-HIT [47]. The dashed line marks the expected fraction of genes with AS, given an unbiased distribution of all known genes with splice variants across the whole genome. In accordance with previous results [12,13], for large GD families we observe fewer genes with AS than expected at random. (B) The cartoons illustrate that alternative splice isoforms and gene duplicates may be expressed in the same number and/or types of tissues. Here, we compared the extent of coexpression amongst alternative splice variants (AS coexpression) and gene duplicates (GD coexpression). (C) Coexpression levels amongst gene duplicates (GD coexpression) are estimated as the average pairwise PC between expression patterns of all genes within a GD family. GD coexpression amongst duplicates of >40% seq.id. (white diamonds) is more similar to the overall AS coexpression (red line indicating the value displayed in Figure 2D) than GD coexpression amongst duplicates of >80% seq.id. In other words, coexpression of alternative splice variants is similar to coexpression amongst gene duplicates of >40% seq.id. As this dataset [17] is too small for GD80 families to be split into further subsets, we examined GD coexpression in an additional dataset [53] (black diamonds). For both 40% and 80% seq.id., expression variation amongst gene duplicates with alternative splice variants (AS+) is slightly higher than variation amongst gene duplicates without alternative splice variants (AS−). p-Values are based on t-test calculations. Data on alternative splice variants was taken from the AltSplice database [43]. Further details and results are provided in Table S4 and Figure S10A and S10B. (D) Coexpression levels amongst alternative splice variants (AS coexpression) are estimated as average pairwise PC between the expression patterns of all exon junctions of a gene. High PC indicates little variation (high coexpression), and vice versa. The figure shows average AS coexpression across all genes in the dataset [17], and across subsets of the genes: GD families (GD+) and singletons (GD−) as defined by >40% and >80% seq.id., respectively. The overall AS coexpression is marked as a red diamond and indicated as a red line in Figure 2C. Further details are provided in the Table S4 and Figure S10A and S10B. p-Values are based on t-test calculations. Gene duplicates of high seq.id. (>80%) have slightly lower AS coexpression than singletons (p-value < 0.001).

Figure 3
Figure 3. Global and Local Sequence Identity in AS and GD Substitutions

AS data were obtained by querying SwissProt [44] database version 40, with the keywords VARSPLIC and HUMAN. GD data were obtained by clustering the SwissProt [44] data using CD-HIT [47] to 40% or 80% seq.id. (GD40 and GD80, respectively). We focus on AS+/GD+ cases, i.e., those sequences with both AS and GD, in Figure 3A–3C, and discuss the AS−/GD+ versus AS+/GD− case in Figure 3D. (A) Global seq.id. The seq.id. in GD families depends on the cutoff used for clustering, e.g., GD40 (dark red) or GD80 (light violet), respectively. The global seq.id. between alternative splice isoforms (light green) is very high ( >90% seq.id.), reflecting the underlying nature of AS changes. (B) Local seq.id. in alternative splice isoforms (dark green) is measured between substituted stretches, usually arising from mutually exclusive exons. The local seq.id. between gene duplicates is obtained using a moving window (GD80: light violet, GD40: dark red) and reporting the seq.id. observed in all possible window positions. (C) Local seq.id. in AS and GD at equivalent positions. The graph compares local seq.id. found in alternative splice variants of a gene with the local seq.id. of a duplicate of the same gene. The AS local seq.id. was computed between substituted sequence stretches. For GD, we mapped the sequence positions of the AS event to the aligned GD, and computed the seq.id. between the GD, considering only the aligned positions within that region. The comparison is shown for AS and GD40 (red) and GD80 (blue), respectively. The diagonal separates the plot into two halves: the upper half corresponds to the region for which GD seq.id. is higher than that for AS; the lower half corresponds to the opposite. For both types of gene families (GD40 and GD80), most substitutions show higher seq.id. amongst gene duplicates than amongst alternative splice variants, and this bias is significant (GD80: 111 of 142, χ2 test p-value < 1.9 × 10−11; and 492 of 786, χ2 test p-value < 6.5 × 10−15, respectively). This result confirms the overall distributions examined in Figure 3B: changes in AS are stronger and more localized than those in GD. (D) Local seq.id. in AS−/GD+ and AS+/GD− substitutions. To compute local seq.id. in AS−/GD+ families, we first align two GD, then slide a 100-aa window over the sequence of one protein, and compute the seq.id. at all sequence positions of the window. The results of all the possible comparisons are plotted for GD40 (dark red) and GD80 (light violet) families. For genes with AS but no duplicates (AS+/GD−) (dark green), local seq.id. was computed between the two substituted stretches resulting from AS events. As for AS+/GD+ families (Figure 3B), we find that, in general, local seq.id.s are substantially lower for AS events (AS+/GD−) than for GD (AS−/GD+ families). The overlap between the AS and GD40 families is higher than that between AS and GD80 families, which may partly be due to differences in the structure constraints applying to the proteins in each set.

Figure 4
Figure 4. The Distribution of Nonconservative Changes along Sequences

The maximal mismatch distance between nonconservative substitutions is much smaller in AS than in GD. The maximal mismatch distance is the number of residues between the two most distant, nonconservative substitutions, normalized by sequence length. Nonconservative mismatches have a negative value in the Blosum62 matrix [65] and were chosen for their stronger impact on protein structure and function. The plot depicts AS data in green, and GD data for families at 80% and 40% seq.id. in light violet and dark red, respectively. We observe that nonconservative substitutions in AS are much more localized than those in GD.

Figure 5
Figure 5. The 3-D Distribution of Physico–Chemical Changes in the Affected Residues of AS and GD

The example of mitogen-activated protein kinase 9 (MAPK9). The example of human MAPK9 illustrates how differences between AS and GD in the distribution of sequence changes result in different distributions of physico–chemical properties across the 3-D structure. The original structure of MAPK9 was homology-modelled after MAPK10 and is shown in blue; the residue changes are indicated following a colour scale related to the associated difference in hydrophobicity (we use the absolute value of the difference in order to avoid too many colours; the colour scale goes from blue to red, where the latter corresponds to the largest change). For comparison purposes, the location of the AS changes in the three structures is indicated by a yellow box. As a hydrophobicity measure, we used the free energy of water to octanol transfer [77]. (A) Alternative splice isoforms of MAPK9. (B) Gene duplicates of high seq.id. (MAPK10; isoform alpha2, 84% seq.id. to MAPK9). (C) Gene duplicates of medium seq.id. (MAPK13; 46% seq.id. to MAPK9). We observe, in accordance with the results from the sequence analysis, that while AS changes are located at a very specific location, GD changes are spread all over the protein surface. As expected, the number of changes between MAPK9 and MAPK13 is the largest. Neither one of MAPK9′s paralogues (MAPK10 and MAPK13) shows a set of residue changes identical to that in the alternative splice variant.

Figure 6
Figure 6. The Size Distribution of Insertions/Deletions in AS and GD

All analyses of indels have been made for gene families with both AS and GD (i.e., AS+/GD+). (A) AS indels are longer than GD indels. Indels for GD were obtained from the alignments of GD families at 40% (dark red) and 80% (light violet) seq.id. Information on AS indels (green) was obtained from the SwissProt record of the corresponding protein. Indel size distributions for both GD40 and GD80 are very similar, with most of the indels being shorter than five residues. In contrast, many AS indels are longer than 100 residues. (B,C) Size distribution for external and internal indels in AS and GD. External indels (B) lie at the N- or C-terminal ends of the protein; internal indels (C) lie in the middle. AS and GD40 indel sizes are different depending on the position of the indels in the sequence. While AS indels are generally larger than GD indels (also see Figure 6A), external indels (B) are larger than internal ones (C), both for AS and GD. The shift in indel sizes implies that large indels (as often introduced by AS) are better-tolerated at the N- and C-termini of proteins, where they are less likely to induce important structural changes.

Figure 7
Figure 7. The Overlap between AS and GD Insertions/Deletions

The overlap between AS and GD indels is very small. For the frequency distribution of the overlap between AS and GD indels, AS indels were taken as reference. GD data at 80% seq.id. are shown in light violet, while GD data at 40% seq.id. are shown in dark and light blue for both all indels and only short indels (≤30aa), respectively. Given the small overlap, AS and GD indels are likely to affect different locations in protein structure.

Similar articles

Cited by

References

    1. Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290:1151–1155. - PubMed
    1. Chothia C, Gough J, Vogel C, Teichmann SA. Evolution of the protein repertoire. Science. 2003;300:1701–1703. - PubMed
    1. Koonin EV, Aravind L, Kondrashov AS. The impact of comparative genomics on our understanding of evolution. Cell. 2000;101:573–576. - PubMed
    1. Graveley BR. Alternative splicing: Increasing diversity in the proteomic world. Trends Genet. 2001;17:100–107. - PubMed
    1. Brett D, Pospisil H, Valcarcel J, Reich J, Bork P. Alternative splicing and genome complexity. Nat Genet. 2002;30:29–30. - PubMed

Publication types

MeSH terms

Substances