Your Biotechnology Service Provider

Codon usage and gene synthesis

In many cases, the desire to adapt the codon usage of a given gene is the motivation to go for a synthetic gene. However, more often than not important principles of codon bias are ignored in the gene design process. This short article has been written to clarify some of the issues arising in codon optimization.

Most biologists take for granted that individual species have a more or less distinct codon bias, and that in order to express a gene efficiently, the gene should follow that same codon bias.

As a first approximation, this is true, and it works surprisingly well in many organisms. For instance, genes with a codon frequency matching that of Escherichia coli, usually show very good expression yields as compared to non-adapted genes from foreign organisms.

However, the relationship is not as direct as it may seem. First, the pool of tRNA molecules often correlates with the codon bias, but not perfectly. And while it can be assumed that diffusion of tRNAs to the ribosome is one important choke point for protein synthesis, sometimes other factors are much more important, in particular the folding of the nascent protein chain.

Speaking of which: In several instances, it could be shown that a protein domain will only fold correctly, if the remaining part of the protein is absent. Therefore, a brief translation arrest is required in order to allow for the folding of such a domain. One way of achieving such a delay is to incorporate rare codons. This shows that low frequency codons are sometimes not only desirable but essential for the production of a correct protein.

And second, some organisms don’t show much of a codon preference at all. It seems there are species which aren’t very picky about the codons they process in protein synthesis. In such cases, blindly optimizing against a more or less random average codon usage will not do much good. The difficulty is knowing whether a particular organism belongs to this group or not (but see below).

Now what does this mean for my protein? If such kinetic translation data is not available (and as Murphy has it, it never is for your protein), the best educated guess is still to go for high abundance codons. Which brings me to the next point: How can the optimal codon frequency be determined? Well, the answer isn’t as obvious as it seems. The very useful Kazusa codon usage database lists the codon frequencies of virtually all organisms found in Genbank. However, care must be taken to interpret the data correctly.

First, the codon usages of many organisms are based on very few coding sequences. If a codon usage is derived from, say, 7 CDS, it really isn’t very meaningful. In such cases, it is more reliable to move up the phylogenetic tree and look for cosely related organisms for which more data is available.

Second, keep in mind that the Kazusa database averages over all found CDS, treating all of them equal. Now, for many organisms the available CDS have a strong bias, due to preferences of the researchers annotating them. And even if that weren’t the case, clearly not all CDS are created equal: The codon bias of a rarely expressed regulatory gene cannot be compared to that of a constitutively produced housekeeping protein.

Upon closer inspection, it turns out that many weakly or unregularly expressed genes have in fact a codon usage that deviates considerably from the ‘consensus codon usage’ of their organism. It seems on top of a natural evolutionary drift of these genes which are under weak selection pressure for high expression yields, nature sometimes deliberately lowers the expression level of such genes by picking rare codons.

So it would be a good idea to go for abundant and strongly expressing genes only. However, even that strategy can fail, as some organisms seem to have two or more distinct codon usages for highly expressed genes. In other words: Their housekeeping genes segregate into two clusters, each with a distinct codon usage. Averaging over both clusters would yield an invalid codon usage table.

So what are the options if you need high expression yields for a synthetic gene? The good news is that Entelechon’s bioinformatics has strong tools to support a sound and solid codon optimization. First, if the expression yield of the gene is very important, we can perform an in-depth analysis of the genome of the target organism. This includes looking for CDS, analyzing their relevance, clustering them according to similar codon usage, and looking up expression data on the genes (where available). This results in a high quality ‘personal codon usage table’ which will usually lead to very good expression yields. We call this service Codon Census.

Second, a good correlate of the optimal codon usage is often the codon bias of ribosomal genes. Therefore, we can identify ribosomal genes for your target organism (if it has been sequenced and annotated to a reasonable degree) and build a codon usage table from these genes. Although their number is relatively small, the resulting codon usage table works quite well in most cases. Also, this will usually indicate whether the organism at question has a codon bias at all. Comparing the ribosomal genes against an average of the complete genome or random samples will provide this essential information.

Keep in mind that gene optimization isn’t just about codon usage. A number of other important factors, such as mRNA secondary structure, splice sites, mRNA destabilizing motifs, and repeats play a crucial role as well. Therefore, Entelechon uses an advanced multi-objective optimization algorithm to improve all relevant features of a given gene during optimization, thus helping you to get the best out of your custom gene synthesis.

You may also be interested in:

  1. Birthday discount of 10% on gene synthesis
  2. Codon usage table analysis
  3. Gene to codon usage
  4. Gene synthesis
  5. Codon usage