Downloaded from genome. cshlporg on June 23, 2011-Published by Cold Spring Harbor Laboratory Press Genetic code optimality for additional information with the most abundant codons ( Table 1). Therefore, n-mers with Table 2. Significance of the genetic code in representing the letters UGA can be included with high probability in protein arbitrary sequences sequences without generating an in-frame stop. The same idea P-value average applies to the other two stop codons in the real code; this prop- n-mer size log-probabilities P- value 20%o occurs in only very few of the alternative genetic codes. In short, optimality for including arbitrary n-mer sequences within 5 0.110 0.054 coding regions is due to stop codons that do not overlap each 7 other, but which do overlap codons for abundant amino acids. 031 We calculated the probability of including all n-mer se quences for each alternative genetic code by summing up, for 12 every n-mer sequence, the probabilities of all codon combina- 15 tions that contain it(Fig 2A; for details see Methods). The codon 20 probabilities were determined according to the known amino 22 acid frequencies in proteins(Table 1). The results presented in 25 0.02 0.009 the main text are for uniform codon usage, but they apply to a wide range of different codon usages(Supplemental material) the fractions of alternative codes for which the average of the We find that the real code shows significantly higher prob abilities to include arbitrary sequences. The average of the loga- erage probability of the 20% most-difficult n-mer sequenc rithm of all n-mer probabilities is significantly higher in the real is equal or higher than in the real genetic code. Similar results are ob- code than in the vast majority of alternative codes(Table 2), with tained for larger fractions of the most difficult n-mer sequences. Results a P-value <0.05 for n-mer sequences with n greater than seven. In addition, the real code shows significantly higher probabilities to include the most difficult sequences (n-mers with the lowest "=16, more than half of all n-mers include at least one stop probability of appearing in a coding region)than the vast major. codon. The real genetic code is able to include all n-mers with ity of alternative codes(Fig. 2E; Table 2: Supplemental Fig 4). For n< 11 in at least one, and often many combinations of amino example, the average probability of including the 20% most dif. acid codons For n-mers of any length, the real code appears to ficult sequences is exceeded by only 3% of the alternative codes exceed almost all of the alternative codes in its ability to include for 8-mers and 1% of the alternative codes for 9-mers. This prop- a large fraction of possible n-mers within coding regions (Fig. 2F; erty can be seen when examining the distribution of the n-mer Table 2) probabilities of appearing within protein-coding sequences. In Robustness to translational frame-shift errors the real code there are significantly fewer n-mers with low prob- abilities(Fig. 2E). How did such near optimality for parallel codes evolve? One The optimality of the real genetic code relative to alternative possibility is that the ability to include parallel codes within pro- des seems to increase with the length of the n-mers(Fig. 2F). tein-coding sequences conferred a selection advantage during the This is because as the length of the n-mers increases, the fraction early evolution of the genetic code. Alternatively, the genetic of n-mers that include stop codons increases dramatically. Above code might have been fixed in evolution before most parallel codes existed. We therefore sought a different selection pressure on the code, which could have existed in the early stages of the Table 1. Amino acid abundance(average amino acid frequency evolution of the genetic code. One such inherent feature of pro- over 134 organisms, sorted in decreasing order by codon tein translation is frame-shift translation errors (Parker 1989: Far- abundance abaugh and Bjork 1999: Seligmann and Pollock 2004). In these amino acid codons codon abundance errors, the ribosome shifts the reading frame, either forward or backward. This results in a nonsense translated peptide, and usu- ally loss of protein function. These errors occur in ribosomes 22213224244 nearly as frequently as misread errors(3 10-5 per codon, com- pared with misread errors of 10 per codon [ Parker 1989) These errors have a relatively large effect on fitness because they result in a nonsense polypeptide. Frame-shift errors may thus pose a selectable constraint on the genetic code: Codes that are ble to abort translation more rapidly following frame-shift er- rors have an advantage(Seligmann and Pollock 2004) To abort translation after a frame shift, the ribosome must 4624164262 encounter a stop codon in the shifted frame. It has been sug- gested that codon usage in some organisms may be biased toward codons that can form stop codons upon translational frame shift 6.5 (Seligmann and Pollock 2004). Here, we consider whether robust. ness to translational frame-shift errors may be linked to the struc- ture of the genetic code. We tested all alternative codes for the mean probability of encountering a stop in a frame-shifted pro tein-coding message. We find that the real genetic code encoun- Codon abundance is the amino acid frequency divided by number of ters a stop more rapidly on average than 99.3% of the alternative codons for that amino acid odes(Fig 3). The real code aborts translation eight codons ear- Genome Research 409with the most abundant codons (Table 1). Therefore, n-mers with the letters UGA can be included with high probability in protein sequences without generating an in-frame stop. The same idea applies to the other two stop codons in the real code; this property occurs in only very few of the alternative genetic codes. In short, optimality for including arbitrary n-mer sequences within coding regions is due to stop codons that do not overlap each other, but which do overlap codons for abundant amino acids. We calculated the probability of including all n-mer sequences for each alternative genetic code by summing up, for every n-mer sequence, the probabilities of all codon combinations that contain it (Fig. 2A; for details see Methods). The codon probabilities were determined according to the known amino acid frequencies in proteins (Table 1). The results presented in the main text are for uniform codon usage, but they apply to a wide range of different codon usages (Supplemental material). We find that the real code shows significantly higher probabilities to include arbitrary sequences. The average of the logarithm of all n-mer probabilities is significantly higher in the real code than in the vast majority of alternative codes (Table 2), with a P-value < 0.05 for n-mer sequences with n greater than seven. In addition, the real code shows significantly higher probabilities to include the most difficult sequences (n-mers with the lowest probability of appearing in a coding region) than the vast majority of alternative codes (Fig. 2E; Table 2; Supplemental Fig. 4). For example, the average probability of including the 20% most difficult sequences is exceeded by only 3% of the alternative codes for 8-mers and 1% of the alternative codes for 9-mers. This property can be seen when examining the distribution of the n-mer probabilities of appearing within protein-coding sequences. In the real code there are significantly fewer n-mers with low probabilities (Fig. 2E). The optimality of the real genetic code relative to alternative codes seems to increase with the length of the n-mers (Fig. 2F). This is because as the length of the n-mers increases, the fraction of n-mers that include stop codons increases dramatically. Above n = 16, more than half of all n-mers include at least one stop codon. The real genetic code is able to include all n-mers with n < 11 in at least one, and often many combinations of amino acid codons. For n-mers of any length, the real code appears to exceed almost all of the alternative codes in its ability to include a large fraction of possible n-mers within coding regions (Fig. 2F; Table 2). Robustness to translational frame-shift errors How did such near optimality for parallel codes evolve? One possibility is that the ability to include parallel codes within protein-coding sequences conferred a selection advantage during the early evolution of the genetic code. Alternatively, the genetic code might have been fixed in evolution before most parallel codes existed. We therefore sought a different selection pressure on the code, which could have existed in the early stages of the evolution of the genetic code. One such inherent feature of protein translation is frame-shift translation errors (Parker 1989; Farabaugh and Bjork 1999; Seligmann and Pollock 2004). In these errors, the ribosome shifts the reading frame, either forward or backward. This results in a nonsense translated peptide, and usually loss of protein function. These errors occur in ribosomes nearly as frequently as misread errors (3 105 per codon, compared with misread errors of 104 per codon [Parker 1989]). These errors have a relatively large effect on fitness because they result in a nonsense polypeptide. Frame-shift errors may thus pose a selectable constraint on the genetic code: Codes that are able to abort translation more rapidly following frame-shift errors have an advantage (Seligmann and Pollock 2004). To abort translation after a frame shift, the ribosome must encounter a stop codon in the shifted frame. It has been suggested that codon usage in some organisms may be biased toward codons that can form stop codons upon translational frame shift (Seligmann and Pollock 2004). Here, we consider whether robustness to translational frame-shift errors may be linked to the structure of the genetic code. We tested all alternative codes for the mean probability of encountering a stop in a frame-shifted protein-coding message. We find that the real genetic code encounters a stop more rapidly on average than 99.3% of the alternative codes (Fig. 3). The real code aborts translation eight codons earTable 1. Amino acid abundance (average amino acid frequency over 134 organisms, sorted in decreasing order by codon abundance) amino acid abundance # codons codon abundance Glu 6.5 2 3.2 Lys 6.0 2 3.0 Asp 5.3 2 2.6 Met 2.3 1 2.3 Ile 6.8 3 2.3 Asn 4.4 2 2.2 Phe 4.3 2 2.1 Ala 8.2 4 2.0 Gln 3.6 2 1.8 Gly 6.9 4 1.7 Val 6.9 4 1.7 Leu 10.1 6 1.7 Tyr 3.3 2 1.6 Thr 5.3 4 1.3 Trp 1.1 1 1.1 Ser 6.5 6 1.1 Pro 4.3 4 1.1 His 2.1 2 1.0 Arg 5.2 6 0.9 Cys 1.1 2 0.6 Codon abundance is the amino acid frequency divided by number of codons for that amino acid. Table 2. Significance of the genetic code in representing arbitrary sequences n-mer size P-value average log-probabilities P-value 20% 5 0.110 0.054 6 0.097 0.045 7 0.083 0.028 8 0.049 0.031 9 0.043 0.010 12 0.028 0.004 15 0.016 0.004 18 0.012 0.006 20 0.026 0.006 22 0.021 0.004 25 0.029 0.009 Shown are the fractions of alternative codes for which the average of the logarithm of the probabilities of all n-mers is equal or higher to that of the real code. Also shown are the fraction of alternative genetic codes for which the average probability of the 20% most-difficult n-mer sequences is equal or higher than in the real genetic code. Similar results are obtained for larger fractions of the most difficult n-mer sequences. Results for n > 8 are based on 105 randomly sampled n-mers. Genetic code optimality for additional information Genome Research 409 www.genome.org Downloaded from genome.cshlp.org on June 23, 2011 - Published by Cold Spring Harbor Laboratory Press