Downloaded from genome. cshlporg on June 23, 2011-Published by Cold Spring Harbor Laboratory Press Letter The genetic code is nearly optimal for allowing additional information within protein-coding sequences Shaley itzkowitz, 2 and Uri Alon 1,2,3 Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 76100, Israel; Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel DNA sequences that code for proteins need to convey in addition to the protein-coding information, several different signals at the same time. These "parallel codes"include binding sequences for regulatory and structural proteins, signals for splicing, and RNA secondary structure. Here we show that the universal genetic code can efficiently carry arbitrary parallel codes much better than the vast majority of other possible genetic codes. This property is related to the identity of the stop codons. We find that the ability to support parallel codes is strongly tied to another useful property of the genetic code-minimization of the effects of frame-shift translation errors. Whereas many of the known regulatory codes reside in nontranslated regions of the genome, the present findings suggest that protein-coding regions can readily carry abundant additional information. suPplementalmaterialisavailableonlineatwww.genome.org.] The genetic code is the mapping of 64 three-letter codons to 20 2006). Other codes include splicing signals( Cartegni et al. 2002) amino-acids and a stop signal (Woese 1965; Crick 1968; Knight et that include specific 6-8 bp sequences within coding regions and al. 2001). The genetic code has been shown to be nonrandom in mRNA secondary structure signals(Zuker and Stiegler 1981; at least two ways: first, the assignment of amino acids to codons Shpaer 1985; Konecny et al. 2000; Katz and Burge 2003). The appears to be optimal for minimizing the effect of translational latter often correspond to sequences of several dozen base pairs misread errors. This optimality is achieved by mapping close or longer. Since we do not know all of these additional codes, and codons(codons that differ by one letter) to either the same different organisms can use a vast array of different codes, we amino acids or to chemically related ones(Woese 1965). This tested the ability of the genetic code to support arbitrary se. feature has been attributed to an adaptive selection of a code, so quences of any length in parallel to the protein-coding sequence that errors that misread a codon by one letter would result in We find that the universal genetic code can allow arbitrary minimal effects on the translated protein(Freeland and Hurst sequences of nucleotides within coding regions much better than 1998; Freeland et al. 2000; Gilis et al. 2001; Wagner 2005b). Sec. the vast majority of other possible genetic codes. We further find ond, amino acids with simple chemical structure tend to have that the ability to support parallel codes is strongly correlated more codons assigned to them(Hasegawa and Miyata 1980: Duf- with an additional property-minimization of the effects of ton 1997; Di Giulio 2005). There exist a large number of alternative genetic codes that traits may have helped to shape the universal genetic code. re equivalent to the real code in these two prominent features (Fig. 1). Here we ask whether the real code stands out among these alternative codes as being optimal for other properties Results We consider the ability of the genetic code to support, in addition to the protein-coding sequence, additional information ability to include additional sequences hat can carry biologically meaningful signals. These signals can nclude binding sequences of regulatory proteins that bind We first considered the ability of the genetic code to support, in addition to the protein-coding sequence, additional sequences al. 2001; Kellis et al. 2003). Such binding sites are typically se. that can carry biological signals. For this purpose, we studied the quences of length 6-20 bp. In addition to regulatory proteins, properties of all alternative genetic codes that share the known there are binding sites of structural proteins such as DNA- and optimality features of the real code(Fig. 1).Each alternative code mRNA-binding proteins(Draper 1999). Histones, for example, has the same number of codons per each amino acid and the bind with a code that has a periodicity of about 10 bp over a site same impact of misread errors as in the real code of about 150 bp(Satchwell et al. 1986; Trifonov 1989; Segal et al. trary sequences, denoted n-mers, within protein-coding regions. As an example, consider the 5-mer"UGACA. This sequence may alon@weizmann. ac il: fax 972-8-934125. n date are be a protein-binding site, which should appear within a protei he. org/cgi/doi/10.1101/gr. 5987307. Freely available online coding region. This 5-mer sequence can appear within a coding Genome Research Open Access option sequence in one of the three reading frames: UGAICAN, 7:405-412e2007byColdSpringHarborLaboratoryPress;IsSn1088-9051/07:www.genome.org Genome Research 405The genetic code is nearly optimal for allowing additional information within protein-coding sequences Shalev Itzkovitz1,2 and Uri Alon1,2,3 1 Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 76100, Israel; 2 Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel DNA sequences that code for proteins need to convey, in addition to the protein-coding information, several different signals at the same time. These “parallel codes” include binding sequences for regulatory and structural proteins, signals for splicing, and RNA secondary structure. Here, we show that the universal genetic code can efficiently carry arbitrary parallel codes much better than the vast majority of other possible genetic codes. This property is related to the identity of the stop codons. We find that the ability to support parallel codes is strongly tied to another useful property of the genetic code—minimization of the effects of frame-shift translation errors. Whereas many of the known regulatory codes reside in nontranslated regions of the genome, the present findings suggest that protein-coding regions can readily carry abundant additional information. [Supplemental material is available online at www.genome.org.] The genetic code is the mapping of 64 three-letter codons to 20 amino-acids and a stop signal (Woese 1965; Crick 1968; Knight et al. 2001). The genetic code has been shown to be nonrandom in at least two ways: first, the assignment of amino acids to codons appears to be optimal for minimizing the effect of translational misread errors. This optimality is achieved by mapping close codons (codons that differ by one letter) to either the same amino acids or to chemically related ones (Woese 1965). This feature has been attributed to an adaptive selection of a code, so that errors that misread a codon by one letter would result in minimal effects on the translated protein (Freeland and Hurst 1998; Freeland et al. 2000; Gilis et al. 2001; Wagner 2005b). Second, amino acids with simple chemical structure tend to have more codons assigned to them (Hasegawa and Miyata 1980; Dufton 1997; Di Giulio 2005). There exist a large number of alternative genetic codes that are equivalent to the real code in these two prominent features (Fig. 1). Here we ask whether the real code stands out among these alternative codes as being optimal for other properties. We consider the ability of the genetic code to support, in addition to the protein-coding sequence, additional information that can carry biologically meaningful signals. These signals can include binding sequences of regulatory proteins that bind within coding regions (Robison et al. 1998; Stormo 2000; Lieb et al. 2001; Kellis et al. 2003). Such binding sites are typically sequences of length 6–20 bp. In addition to regulatory proteins, there are binding sites of structural proteins such as DNA- and mRNA-binding proteins (Draper 1999). Histones, for example, bind with a code that has a periodicity of about 10 bp over a site of about 150 bp (Satchwell et al. 1986; Trifonov 1989; Segal et al. 2006). Other codes include splicing signals (Cartegni et al. 2002) that include specific 6–8 bp sequences within coding regions and mRNA secondary structure signals (Zuker and Stiegler 1981; Shpaer 1985; Konecny et al. 2000; Katz and Burge 2003). The latter often correspond to sequences of several dozen base pairs or longer. Since we do not know all of these additional codes, and different organisms can use a vast array of different codes, we tested the ability of the genetic code to support arbitrary sequences of any length in parallel to the protein-coding sequence. We find that the universal genetic code can allow arbitrary sequences of nucleotides within coding regions much better than the vast majority of other possible genetic codes. We further find that the ability to support parallel codes is strongly correlated with an additional property—minimization of the effects of frame-shift translation errors. Selection for either or both of these traits may have helped to shape the universal genetic code. Results Ability to include additional sequences We first considered the ability of the genetic code to support, in addition to the protein-coding sequence, additional sequences that can carry biological signals. For this purpose, we studied the properties of all alternative genetic codes that share the known optimality features of the real code (Fig. 1). Each alternative code has the same number of codons per each amino acid and the same impact of misread errors as in the real code. We tested the ability of the genetic codes to include arbitrary sequences, denoted n-mers, within protein-coding regions. As an example, consider the 5-mer “UGACA.” This sequence may be a protein-binding site, which should appear within a proteincoding region. This 5-mer sequence can appear within a coding sequence in one of the three reading frames: UGA|CAN, 3Corresponding author. E-mail uri.alon@weizmann.ac.il; fax 972-8-934125. Article published online before print. Article and publication date are at http:// www.genome.org/cgi/doi/10.1101/gr.5987307. Freely available online through the Genome Research Open Access option. Letter 17:405–412 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07; www.genome.org Genome Research 405 www.genome.org Downloaded from genome.cshlp.org on June 23, 2011 - Published by Cold Spring Harbor Laboratory Press