Downloaded from genome. cshlp org on June 23, 2011-Published by Cold Spring Harbor Laboratory Press Itzkovitz and alon lier than the average alternative code(15 codons vs 23 codons). sal code: when frame shifted, the stop codons overlap with Conservative estimates suggest that such a difference, equivalent codons of abundant amino acids. We showed that this optimal- to a relative fitness advantage of about 10-, is readily selectable ity is strongly tied to a second useful property-minimization of (see Methods) the effect of translational frame shift errors nterestingly, the ability to abort translation after frame Robustness to frame-shift errors may be a reasonable inher. shift is closely related to the ability to include arbitrary parallel ent constraint on the early genetic code. One may therefore pro codes(Fig 4). Robustness to frame-shift errors occurs because the pose that the ability to carry parallel codes may have emerged as frame-shifted codons for abundant amino acids overlap with the a side effect that was later exploited to allow genes and mRNA stop codons, hence increasing the probability that stop is en- molecules to support a wide range of signals to regulate and countered upon frame shift. As mentioned above, it is precisely modify biological processes in cells(Kirschner et al. 2005). Alter- his property that allows the real genetic code to include arbitrary natively, the ability to include arbitrary parallel sequences within sequences within protein-coding regions, including those with coding regions may have contributed to the selection of the early op sequences, with a significantly higher probability than genetic code. For example, ly RNA molecules that had the ternative codes ability to both specify peptides and to include sequences that of the nonuniversal codes such as those found in mitochondria RNAs that were less effective at simultaneously fulfilling both Osawa et al. 1992; Knight et al. 2001)(see Supplemental mate. objectives. rial). For example, the fraction of alternative genetic codes with Whereas many of the currently known regulatory codes re- higher probabilities for encountering frame-shifted stop codon ide in nontranslated regions of the genome(Robison et al. 1998; is lower then 0.05 for all nonuniversal codes except for the flat. Lieb et al. 2001), the present findings support the view that pro- worm mitochondrial code(see Supplemental Table 3). It is also tein-coding regions can carry abundant parallel codes. It would found for a range of different codon usages(Muto and Osawa be interesting to use information-theoretical approaches(Gusev 1987), specifically those that represent GC content of <70%(see et al. 1999; Wan and wootton 2000; Troyanskaya et al. 2002)te Supplemental material). This range of GC contents is also the search for such codes in genomes range that supports the optimality of previously known features such as robustness to misread errors(Archetti 2004) Method Discussion Alternative genetic codes The alternative genetic codes were obtained by independently In summary, we found that the genetic code is nearly permuting the nucleotides in the three codon positions while encoding additional information in parallel to its mai preserving the amino acid assignment(Fig. 1). These of encoding for the amino acid sequence of proteins. tions preserve both the number of codons per amino acid and the mality is related to the identity of the stop codons in effect of misread errors on the translated protein, as defined in Freeland and Hurst(1998)and Gilis et al. (2001)(Fig. IE, F). There are 4!= 24 pos- mutations of the four nucleo- real code alternative codes. We additionally im- e the wobble constraint for base pair. ing in the third codon position, which states that any two codons differing only in u-C in the third letter cannot be dis. tinguished by the translation apparatus Crick 1968; Osawa et al. 1992). This re- sults in two allowed permutations in the third letter: the identity permutation and the AeG permutation. The en- mble of alternative codes therefore ins 2 the Supplemental material, we show ° that relaxing the wobble constraint does not change any of the present conclu- sions( Supplemental Fig. 1). (frame-shifted stop probability) Inclusion of arbitrary sequences within Figure 4. The parallel coding property is strongly tied to the translational frame-shift robustness protein-coding sequences property. Each point represents one of the alternative codes. The x-axis shows the probability of We calculated the probability of encoun- ountering a stop codon upon a frame-shifted event(average over +1 and-1 frame shift). The tering every n-mer in a coding sequence y-axis is the average probability of appearance of the 10%6 most difficult 6-mers. The arrow indicat for each alternative code for n= 4-25 n the two properties is 0. 8. The real code is on the Pareto front, This was done by scanning all codon heaning that no alternative code is better than the real code in both properties. Similar results ar obtained for n-mers of other sizes. Note that due to symmetries in the alternative codes with respect combinations in all three possible frame to the features studied(Supplemental material), multiple alternative codes often have the same shifts, which can include the n-mer se- quence, and summing the probabilities 410 Genome researchlier than the average alternative code (15 codons vs. 23 codons). Conservative estimates suggest that such a difference, equivalent to a relative fitness advantage of about 104 , is readily selectable (see Methods). Interestingly, the ability to abort translation after frame shift is closely related to the ability to include arbitrary parallel codes (Fig. 4). Robustness to frame-shift errors occurs because the frame-shifted codons for abundant amino acids overlap with the stop codons, hence increasing the probability that stop is encountered upon frame shift. As mentioned above, it is precisely this property that allows the real genetic code to include arbitrary sequences within protein-coding regions, including those with stop sequences, with a significantly higher probability than alternative codes. The present optimality features are shared also by almost all of the nonuniversal codes such as those found in mitochondria (Osawa et al. 1992; Knight et al. 2001) (see Supplemental material). For example, the fraction of alternative genetic codes with higher probabilities for encountering frame-shifted stop codons is lower then 0.05 for all nonuniversal codes except for the flatworm mitochondrial code (see Supplemental Table 3). It is also found for a range of different codon usages (Muto and Osawa 1987), specifically those that represent GC content of <70% (see Supplemental material). This range of GC contents is also the range that supports the optimality of previously known features such as robustness to misread errors (Archetti 2004). Discussion In summary, we found that the genetic code is nearly optimal for encoding additional information in parallel to its main function of encoding for the amino acid sequence of proteins. This optimality is related to the identity of the stop codons in the universal code: when frame shifted, the stop codons overlap with codons of abundant amino acids. We showed that this optimality is strongly tied to a second useful property—minimization of the effect of translational frame-shift errors. Robustness to frame-shift errors may be a reasonable inherent constraint on the early genetic code. One may therefore propose that the ability to carry parallel codes may have emerged as a side effect that was later exploited to allow genes and mRNA molecules to support a wide range of signals to regulate and modify biological processes in cells (Kirschner et al. 2005). Alternatively, the ability to include arbitrary parallel sequences within coding regions may have contributed to the selection of the early genetic code. For example, early RNA molecules that had the ability to both specify peptides and to include sequences that conferred useful RNA structure may have had an advantage over RNAs that were less effective at simultaneously fulfilling both objectives. Whereas many of the currently known regulatory codes reside in nontranslated regions of the genome (Robison et al. 1998; Lieb et al. 2001), the present findings support the view that protein-coding regions can carry abundant parallel codes. It would be interesting to use information-theoretical approaches (Gusev et al. 1999; Wan and Wootton 2000; Troyanskaya et al. 2002) to search for such codes in genomes. Methods Alternative genetic codes The alternative genetic codes were obtained by independently permuting the nucleotides in the three codon positions while preserving the amino acid assignment (Fig. 1). These permutations preserve both the number of codons per amino acid and the effect of misread errors on the translated protein, as defined in Freeland and Hurst (1998) and Gilis et al. (2001) (Fig. 1E,F). There are 4! = 24 possible permutations of the four nucleotides. There are, therefore, 243 = 13,824 alternative codes. We additionally impose the wobble constraint for base pairing in the third codon position, which states that any two codons differing only in U-C in the third letter cannot be distinguished by the translation apparatus (Crick 1968; Osawa et al. 1992). This results in two allowed permutations in the third letter: the identity permutation and the A↔G permutation. The ensemble of alternative codes therefore contains 24 24 2 = 1152 codes. In the Supplemental material, we show that relaxing the wobble constraint does not change any of the present conclusions (Supplemental Fig. 1). Inclusion of arbitrary sequences within protein-coding sequences We calculated the probability of encountering every n-mer in a coding sequence for each alternative code for n = 4–25. This was done by scanning all codon combinations in all three possible frame shifts, which can include the n-mer sequence, and summing the probabilities Figure 4. The parallel coding property is strongly tied to the translational frame-shift robustness property. Each point represents one of the alternative codes. The x-axis shows the probability of encountering a stop codon upon a frame-shifted event (average over +1 and 1 frame shift). The y-axis is the average probability of appearance of the 10% most difficult 6-mers. The arrow indicates the real code. The correlation between the two properties is 0.8. The real code is on the Pareto front, meaning that no alternative code is better than the real code in both properties. Similar results are obtained for n-mers of other sizes. Note that due to symmetries in the alternative codes with respect to the features studied (Supplemental material), multiple alternative codes often have the same values. Itzkovitz and Alon 410 Genome Research www.genome.org Downloaded from genome.cshlp.org on June 23, 2011 - Published by Cold Spring Harbor Laboratory Press