Downloaded from genome. cshlp org on June 23, 2011-Published by Cold Spring Harbor Laboratory Press Itzkovitz and alon coding regions have no in-frame stop codons. The sequence can, NNAJAAAJANN, or NAAJAAA. Alternative genetic codes that as- however, appear in one of the two other frames. Overall, the sign one of their stop codons as AAA (Fig. 3D),can probability that this 5-mer appears in coding regions will tend to S in a protein-coding sequence. The problem is be lower than that of 5-mers that do not include stop codons codon AAA overlaps with itself when frame hence Each genetic code has n-mer sequences, such as the above- strings such as S include a stop codon in each of the three frames, mentioned sequence UGACA in the real genetic code, which are precluding their presence in a coding region ifficult to include in coding regions: these "difficult"sequence Another example is the 5-mer S=CCGGU In an alternative contain stop codons, and thus cannot appear in at least one of code with stop codons CCA, CCG, and CGG, this n-mer can only the three frames, since protein-coding regions do not contain appear in one of the three reading frames(Fig 2D). This is be. stop codons. We find that the real genetic code is able to include cause two of the stop codons, CCG and CGG, overlap each other. even the most difficult n-mers because it has a special property: In contrast, the real genetic code has the stop codons UAA, UAG its stop codons, when frame shifted, tend to form abundant and UGA that do not overlap with themselves or with each other, codons. Hence, n-mers that cannot be included in one frame. no matter how they are frame shifted. Furthermore, frame shift can be included with high probability in other frame shifts. shifted versions of the real stop codons overlap with the codons To understand the relation between the stop codons and the of the most abundant amino acids. For example, the UGA stop ability of the genetic code to include arbitrary n-mers, consider codon in a -1 frame-shift message results in the di-codon he 5-mer S=AAAAA(Fig. 2C). This 5-mer can appear within a NNUIGAN, where N is any nucleotide(Fig 2B). The GAN codons coding sequence in one of the three reading frames: AAAJAAN, encode Asp and Glu, which are rl 加m::m frame-shsense polypeptide translated after a frame-shift event, and is the inverse of the frame-shifted stop probability, averaged over the tl an bn bfo marked by for a +l frame-shift and-for a-1 frame-shift Abundant codons are shown in heavier font. For example, the stop codon UAA, when frame in codons such as AAN(green box), or NUA(blue boxes), which are re ndant. (O The"best code, which achieves the me-shifted stop probability both in a+l frame-shift and in a-1 frame shift. Stop CAA, CAG, and CGA. In the "best code, " a stop has an overlap of two positions with codons of gly stead of codons of serine and in the real code. ( D)The"worst code"with the rame-shifted stop probability. Stop codons are AUA, AUG, and AAA. Note that the stop codons overlap either with themselves(AAA)or with codons for nonabundant amino-acids(those with light font), in contrast to B and C 408 Genome researchcoding regions have no in-frame stop codons. The sequence can, however, appear in one of the two other frames. Overall, the probability that this 5-mer appears in coding regions will tend to be lower than that of 5-mers that do not include stop codons. Each genetic code has n-mer sequences, such as the abovementioned sequence UGACA in the real genetic code, which are difficult to include in coding regions: these “difficult” sequences contain stop codons, and thus cannot appear in at least one of the three frames, since protein-coding regions do not contain stop codons. We find that the real genetic code is able to include even the most difficult n-mers because it has a special property: its stop codons, when frame shifted, tend to form abundant codons. Hence, n-mers that cannot be included in one frameshift can be included with high probability in other frame shifts. To understand the relation between the stop codons and the ability of the genetic code to include arbitrary n-mers, consider the 5-mer S = AAAAA (Fig. 2C). This 5-mer can appear within a coding sequence in one of the three reading frames: AAA|AAN, NNA|AAA|ANN, or NAA|AAA. Alternative genetic codes that assign one of their stop codons as AAA (Fig. 3D), can never include S in a protein-coding sequence. The problem is that the stop codon AAA overlaps with itself when frame shifted; hence, strings such as S include a stop codon in each of the three frames, precluding their presence in a coding region. Another example is the 5-mer S = CCGGU. In an alternative code with stop codons CCA, CCG, and CGG, this n-mer can only appear in one of the three reading frames (Fig. 2D). This is because two of the stop codons, CCG and CGG, overlap each other. In contrast, the real genetic code has the stop codons UAA, UAG, and UGA that do not overlap with themselves or with each other, no matter how they are frame shifted. Furthermore, frameshifted versions of the real stop codons overlap with the codons of the most abundant amino acids. For example, the UGA stop codon in a 1 frame-shift message results in the di-codon NNU|GAN, where N is any nucleotide (Fig. 2B). The GAN codons encode Asp and Glu, which are among the three amino acids Figure 3. Optimality of the genetic code for minimizing the impact of frame-shift translation errors. (A) Distribution of average number of translated codons until a stop codon is encountered after a frame-shift event for the alternative genetic codes. This number corresponds to the mean length of the nonsense polypeptide translated after a frame-shift event, and is the inverse of the frame-shifted stop probability, averaged over the +1 and 1 frame-shifts. (B) In the real code, frame-shifted stop codons overlap with abundant codons. Codons with two-letter overlap with a stop codon are marked by + for a +1 frame-shift and – for a 1 frame-shift. Abundant codons are shown in heavier font. For example, the stop codon UAA, when frame shifted, results in codons such as AAN (green box), or NUA (blue boxes), which are relatively abundant. (C) The “best code,” which achieves the highest frame-shifted stop probability both in a +1 frame-shift and in a 1 frame shift. Stop codons are CAA, CAG, and CGA. In the “best code,” a stop codon has an overlap of two positions with codons of Glycine instead of codons of Serine and Arginine in the real code. (D) The “worst code” with the lowest frame-shifted stop probability. Stop codons are AUA, AUG, and AAA. Note that the stop codons overlap either with themselves (AAA) or with codons for nonabundant amino-acids (those with light font), in contrast to B and C. Itzkovitz and Alon 408 Genome Research www.genome.org Downloaded from genome.cshlp.org on June 23, 2011 - Published by Cold Spring Harbor Laboratory Press