When trying to express a gene heterologously in an organism other than the source organism, it is advisable to optimize the gene sequence for maximum expression yields. With highly efficient and affordable gene synthesis readily available, the effort of gene optimization and subsequent synthesis of the resulting gene sequence are usually paying off very well in terms of significantly increased expression yields.
Both the gene and the amino acid sequence can be optimized. Due to the degeneracy of the genetic code – i.e. the fact that most amino acids are encoded by more than one codon – it is possible to alter, and in fact, optimize, the DNA sequence without changing the encoded amino acid sequence. Therefore, these two levels of optimization – the DNA and protein – need to be discussed separately:
Optimization of the protein sequence
The protein sequence encoded by a gene is usually well adapted to the purpose of the protein in the wildtype source organism. This does not necessarily mean that it works in an optimal way in the intended expression system. Therefore, it may be useful or even necessary to alter the protein sequence strategically.
One of the most important features to optimize is the removal, alteration or addition of export signals. Some proteins include such export signals which target them into specific cellular compartments or into the extracellular medium. In heterologous expression, these signals may interfere with the target host. On the other hand, an export signal compatible with the target host may lead to a a deliberate export into the medium which may simplify the purification of the protein drastically.
Another aspect of the protein sequence is its solubility and propensity to fold correctly. The expression system may have properties that lead to incorrect folding or alter the solubility of the protein. For instance, a lot of proteins require the presence of chaperone proteins for correct folding, and these chaperones may be absent in the expression host.
In order to account for these differences, it may be helpful to create a fusion protein – the fused additional protein domain may help with the folding of the protein or may change its solubility. Note that not always is maximum solubility the desired goal: Especially in bacterial expression systems, It might be easier to drive the expression product into inclusion bodies and purify the protein from there. If the target protein needs to be folded correctly, this of course requires a refolding step, the development of which may be the most costly part of the whole expression project.
Optimization of the DNA sequence
In addition to changes to the protein sequence, there are a number of optimizations that can be performed on the DNA level, without changing the protein sequence, by conservative codon exchanges (i.e. replacing one codon by another codon encoding the same amino acid).
The parameters that can be optimized, are:
- Codon usage
- Local GC content
- Absence of splice sites
- mRNA secondary structure
- Polyadenylation signals
- Secondary open reading frames in addition to the intended ORF
- Repetitive sequence motifs
- Codon tandem repeats
- Sequence motifs that destabilize the resulting mRNA
This relatively large number of optimization parameters means that we deal with a multi-objective optimization. Such an optimization cannot simply maximize a single ‘fitness score’, but has to take into consideration a number of individual scores that cannot be readily integrated into an all-encompassing fitness function.
A multi-objective optimization must carefully balance all parameters and avoid the preferred optimization of one or two parameters. This is usually done in a so-called Pareto optimization which considers two solutions as equal (or ‘non-dominating’ in the Pareto terminology) if both are better than the other in at least one parameter.
The Pareto optimization can ideally be performed using a genetic algorithm. Such an algorithm creates a number of alternative ‘solutions’ – in this case sequences – that are derived from a set of parent solutions by mutagenesis and crossing over. The solutions are then evaluated and the best ones are selected.
It is important that the whole sequence is considered during this selection step. It is not possible to optimize one part of a sequence after the other, in independent steps. This would reduce the complexity of the optimization problem drastically, but would introduce an artificial asymmetry which would create invalid – i.e. less than optimal – solutions.
Therefore, the optimization of a gene sequence is a rather complex and difficult matter, both conceptionally and from a computational perspective. The optimization needs to look at the largest possible number of solutions, and therefore requires much computation time. It’s not unusual that an optimization of a reasonably long DNA sequence takes several hours.
Entelechon has created a dedicated software package that implements such a gene optimization, using a genetic algorithm. This package, Leto, performs a Pareto optimization, taking into account the above-mentioned optimization parameters.
Additional optimization aspects
In addition to the optimization parameters discussed above, changes to untranslated parts of a sequence can affect the expression yield. Most notably, the selection of a suitable promoter sequence can have a huge impact. Most commercial expression vectors include very efficient and well characterized promoters. If such a commercial option is not available, the search for an efficient promoter may be equally important as the actual gene optimization. At Entelechon, if there is no suitable promoter available for a given expression system, we use a dedicated bioinformatics approach to retrieve well expressing target genes from public databases and extract optimal consensus promoters.
Another aspect of untranslated regions are introns. In most cases, it is desirable to remove introns, since they increase the cost of gene synthesis and make the handling of a gene sequence more complicated. However, in some expression systems, introns from genes of the target host may increase the expression efficiency – sometimes drastically.