Codon usage
The codon usage or codon preference is a statistical property of DNA sequences that encode proteins, i.e. open-reading frames. The degeneracy of the genetic code means that one amino acid can be encoded by several codons. It has been observed that natural genes do not use the available codons randomly, but show a certain preference for particular codons for the same amino acid.
In the same sense, the average codon usage of whole genomes is strongly biased. Each individual genome uses a preferred set of codons.
The Kazusa codon usage database contains codon usage tables created from complete genomes found in Genbank.
The codon usage preference is an important factor in gene expression. It has been shown that the codon usage preference correlates with the abundance of tRNAs for a given amino acid, i.e. more frequent codons have more abundant corresponding tRNAs. This means that in general a gene should use the most frequent codons of the expression host, in order to increase the expression efficiency.
Note that the average codon usage, created from all open-reading frames of a genome, may not be optimal. It includes codons from very weakly expressed or highly regulated genes. Therefore, if possible, it is advisable to create a codon usage table from a set of known well expressing genes.
A codon usage table typically lists all codons, followed by their frequency (in 1/1000′s) and their absolute count. For instance, this is the codon usage table of E. coli, as downloaded from the Kazusa database:
UUU 24.4( 56791) UCU 13.1( 30494) UAU 21.6( 50400) UGU 5.9( 13662) UUC 13.9( 32513) UCC 9.7( 22637) UAC 11.7( 27239) UGC 5.5( 12777) UUA 17.4( 40627) UCA 13.1( 30502) UAA 2.0( 4664) UGA 1.1( 2674) UUG 12.9( 30084) UCG 8.2( 19071) UAG 0.3( 751) UGG 13.4( 31207) CUU 14.5( 33816) CCU 9.5( 22121) CAU 12.4( 28919) CGU 15.9( 37134) CUC 9.5( 22074) CCC 6.2( 14379) CAC 7.3( 17117) CGC 14.0( 32720) CUA 5.6( 12951) CCA 9.1( 21237) CAA 14.4( 33607) CGA 4.8( 11216) CUG 37.4( 87261) CCG 14.5( 33795) CAG 26.7( 62329) CGG 7.9( 18434) AUU 29.6( 68942) ACU 13.1( 30518) AAU 29.3( 68348) AGU 13.2( 30749) AUC 19.4( 45213) ACC 18.9( 44139) AAC 20.3( 47233) AGC 14.3( 33255) AUA 13.3( 31065) ACA 15.1( 35293) AAA 37.2( 86726) AGA 7.1( 16583) AUG 23.7( 55356) ACG 13.6( 31794) AAG 15.3( 35652) AGG 4.0( 9238) GUU 21.6( 50261) GCU 18.9( 44034) GAU 33.7( 78663) GGU 23.7( 55283) GUC 13.1( 30515) GCC 21.6( 50411) GAC 17.9( 41619) GGC 20.6( 47962) GUA 13.1( 30461) GCA 23.0( 53619) GAA 35.1( 81727) GGA 13.6( 31729) GUG 19.9( 46309) GCG 21.1( 49169) GAG 19.4( 45154) GGG 12.3( 28720)
This format is very useful for importing into various software packages, but somewhat difficult to interpret since it does not separate codons according to the encoded amino acid. Therefore, Entelechon has written a converter that prints the relative codon frequencies for each amino acid.
One important application of codon usage tables is gene optimization, i.e. the selection of codons for a given protein sequence for the purpose of increasing the expression efficiency. Entelechon’s backtranslation tool allows you to translate a protein sequence into a DNA sequence with optimal codon usage for a given expression host. For a more sophisticated gene optimization, Entelechon has created a dedicated software package called Leto.
Note that when optimizing a gene sequence for expression, you should not just use the most frequent codon for each amino acid, for two reasons: First, this would lead to a strong bias in the nucleotide selection, often creating artificially high or low GC contents and repetitive motifs. Second, in most cases other codons beside the most frequent one are represented with a reasonably high frequency, and discarding those would lead to a suboptimal exploitation of the pool of available tRNAs.
Therefore, a gene containing the most abundant codons only would actually perform worse than a gene that uses all but the least frequent codons. As a rule of thumb, you should discard only those codons that occur with less than 50% of the theoretical frequency. For example, for an amino acid encoded by four codons, the theoretical frequency for each would be 25%. Any codon below 12.5% should be discarded.
You may also be interested in:


