Your Biotechnology Service Provider

Gene optimization

When trying to express a gene heterologously in an organism other than the source organism, it is advisable to optimize the gene sequence for maximum expression yields. With highly efficient and affordable gene synthesis readily available, the effort of gene optimization and subsequent synthesis of the resulting gene sequence are usually paying off very well in terms of significantly increased expression yields.

Both the gene and the amino acid sequence can be optimized. Due to the degeneracy of the genetic code – i.e. the fact that most amino acids are encoded by more than one codon – it is possible to alter, and in fact, optimize, the DNA sequence without changing the encoded amino acid sequence. Therefore, these two levels of optimization – the DNA and protein – need to be discussed separately:

Optimization of the protein sequence

The protein sequence encoded by a gene is usually well adapted to the purpose of the protein in the wildtype source organism. This does not necessarily mean that it works in an optimal way in the intended expression system. Therefore, it may be useful or even necessary to alter the protein sequence strategically.

One of the most important features to optimize is the removal, alteration or addition of export signals. Some proteins include such export signals which target them into specific cellular compartments or into the extracellular medium. In heterologous expression, these signals may interfere with the target host. On the other hand, an export signal compatible with the target host may lead to a a deliberate export into the medium which may simplify the purification of the protein drastically.

Another aspect of the protein sequence is its solubility and propensity to fold correctly. The expression system may have properties that lead to incorrect folding or alter the solubility of the protein. For instance, a lot of proteins require the presence of chaperone proteins for correct folding, and these chaperones may be absent in the expression host.

In order to account for these differences, it may be helpful to create a fusion protein – the fused additional protein domain may help with the folding of the protein or may change its solubility. Note that not always is maximum solubility the desired goal: Especially in bacterial expression systems, It might be easier to drive the expression product into inclusion bodies and purify the protein from there. If the target protein needs to be folded correctly, this of course requires a refolding step, the development of which may be the most costly part of the whole expression project.

Optimization of the DNA sequence

In addition to changes to the protein sequence, there are a number of optimizations that can be performed on the DNA level, without changing the protein sequence, by conservative codon exchanges (i.e. replacing one codon by another codon encoding the same amino acid).

The parameters that can be optimized, are:

  • Codon usage
  • Local GC content
  • Absence of splice sites
  • mRNA secondary structure
  • Polyadenylation signals
  • Secondary open reading frames in addition to the intended ORF
  • Repetitive sequence motifs
  • Codon tandem repeats
  • Sequence motifs that destabilize the resulting mRNA

This relatively large number of optimization parameters means that we deal with a multi-objective optimization. Such an optimization cannot simply maximize a single ‘fitness score’, but has to take into consideration a number of individual scores that cannot be readily integrated into an all-encompassing fitness function.

A multi-objective optimization must carefully balance all parameters and avoid the preferred optimization of one or two parameters. This is usually done in a so-called Pareto optimization which considers two solutions as equal (or ‘non-dominating’ in the Pareto terminology) if both are better than the other in at least one parameter.

The Pareto optimization can ideally be performed using a genetic algorithm. Such an algorithm creates a number of alternative ‘solutions’ – in this case sequences – that are derived from a set of parent solutions by mutagenesis and crossing over. The solutions are then evaluated and the best ones are selected.

It is important that the whole sequence is considered during this selection step. It is not possible to optimize one part of a sequence after the other, in independent steps. This would reduce the complexity of the optimization problem drastically, but would introduce an artificial asymmetry which would create invalid – i.e. less than optimal – solutions.

Therefore, the optimization of a gene sequence is a rather complex and difficult matter, both conceptionally and from a computational perspective. The optimization needs to look at the largest possible number of solutions, and therefore requires much computation time. It’s not unusual that an optimization of a reasonably long DNA sequence takes several hours.

Entelechon has created a dedicated software package that implements such a gene optimization, using a genetic algorithm. This package, Leto, performs a Pareto optimization, taking into account the above-mentioned optimization parameters.

Additional optimization aspects

In addition to the optimization parameters discussed above, changes to untranslated parts of a sequence can affect the expression yield. Most notably, the selection of a suitable promoter sequence can have a huge impact. Most commercial expression vectors include very efficient and well characterized promoters. If such a commercial option is not available, the search for an efficient promoter may be equally important as the actual gene optimization. At Entelechon, if there is no suitable promoter available for a given expression system, we use a dedicated bioinformatics approach to retrieve well expressing target genes from public databases and extract optimal consensus promoters.

Another aspect of untranslated regions are introns. In most cases, it is desirable to remove introns, since they increase the cost of gene synthesis and make the handling of a gene sequence more complicated. However, in some expression systems, introns from genes of the target host may increase the expression efficiency – sometimes drastically.

Blosum62

Blosum62 is a substitution matrix for pairwise protein sequence alignments. You will encounter Blosum62 in a number of bioinformatics applications that align protein sequences or analyze the homology between sequences. Most noteworthy, Blosum62 is used by BLAST by default for protein homology detection.

When comparing two sequences and looking at their homology, it is important to have a metric for how closely related to sequence symbols – or amino acids in the case of proteins – are. Among the 20 standard amino acids, some are more closely related than others when it comes to their physicochemical properties. For example, there are hydrophobic and hydrophilic amino acids, and the hydrophobic ones are more closely related to each other than the hydrophobic ones.

If two sequences are evolutionary related, it is plausible to assume that any amino acid changes have a high probability of being conservative, i.e. replacing one amino acid by a closely related – i.e. similar – one.

Therefore, when assigning a match or mismatch score to a pair of amino acids, we need a table where we can look up the score for any pair of two amino acids. That score should reflect the similarity of both amino acids.

This is what Blosum62 does: It contains similarity scores for all permutations of two amino acids, assigning higher (better) scores to similar amino acids. You can find the Blosum62 matrix for example at Expasy.

Blosum62 has been created by a rational process: A large sample set of homologous protein sequences has been aligned and the substitutions analyzed. In the analysis, blocks that showed a good alignment, where used to calculated summed and averaged scores. Therefore the name ‘Blosum’ stems from ‘BLOck SUMs’. ’62′ means that members of a homology block that shared at least 62% of identity with any other member of the block where averaged.

Biojava

Biojava is an open-source project that creates a framework for biology-oriented Java-based applications. You will find the project at this home page. The framework is relatively easy to integrate into Java projects, as it has few dependencies to other programs.

The project uses Ant for a build tool, therefore it is somewhat difficult to integrate into a Maven-based workflow.

The two major disadvantages of Biojava are bulk and version problems. The package is relatively large and adds a significant size overhead to small bioinformatics projects. And the various release iterations of Biojava are not entirely compatible, often introducing hard to find bugs and discrepancies in dependent projects. Also, a significant part of the functionality has been moved into the BiojavaX project.

Biojava provides excellent implementations for a number of boilerplate bioinformatics problems such as alignments. Therefore, it is ideally suited for a rapid start on a complex problem. However, for limited or for very specialized problems, it may be worthwhile to provide one’s own implementations of standard bioinformatics problems. This removes yet another dependency, reduces the risk of version conflicts, and makes it easier to control the source code.

RNA secondary structure

The path from a DNA template to a protein is via the so-called transcription and translation. During transcription, an RNA molecule is derived from the DNA template.

In eukaryotes, internal, non-coding parts, so-called introns, are removed by a process called splicing. Finally, a poly-A tail is appended to the 3′ end of the mRNA, and a CAP region at the 5′ end.

Once these processes are finished in the cell’s nucleus, the RNA – which subsequently is namd messenger or mRNA – is exported into the cytoplasm. The mRNA is single stranded. The free bases are therefore capable of forming bonds between adenosine and uracil, guanosince and cytidine and uracil, and guanosine residues, thus leading to a complex, non-linear structure.

Note that virtually all RNA single strands form such secondary structures. Some RNA molecules have evolved to maintain a very specific structure, such as tRNA. Others have a more or less purposeless structure.

Knowledge about the secondary structure can serve two purposes: First it can help to infer or understand the function of an RNA molecule. Second, it can predict whether an mRNA will cause problems during translation: Extended secondary structures can interfere with the translation process at the ribosome.

Typical structures

RNA forms a number of typical structures. The most prominent is the so-called hairpin: Two stretches of complementary nucleotides form a base-paired double helix, ending in a small loop of free, unpaired nucleotides.

The hairpin stem is the double helical part, the hairpin loop is the circular ‘end’. In the example above, the hairpin contains two unpaired nucleotides, a so-called ‘bulge’.

Predicting RNA secondary structure

Predictions of RNA secondary structures are based on a model of the base interactions. Theoretically, each potential interaction between any two nucleotides must be considered. To accelerate the prediction, certain heuristics about base interactions are applied, such as steric limitations of hairpin loops. In addition, the thermodynamic efficiency of different base pairings are usually considered, in order to get a realistic model of secondary structure formation.

For performance reasons, prediction algorithms usually exclude pseudknots. These are structures where two parts of the RNA form a hairpin stem, and another two parts form a second hairpin stem – one of them being located in between the first two parts, the other outside.

Pseudoknot of an RNA molecule: Two hairpin stems are 'interlocked'.

Note that the prediction of any RNA secondary structure is just an approximation, using a number of simplifying assumptions. In addition, experimental conditions for RNA folding such as ion concentration or temperature are difficult to control to a degree where absolute reproducibility is possible. Depending on the length and nucleotide sequence of an RNA molecule, the prediction may be close to a structure found in nature or very far off.

NCBI – National Center for Biotechnology Information

The National Center for Biotechnology Information, NCBI, is a US-based organization founded in 1988 as a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). NCBI is one of the most important public resources for DNA and protein sequence database, other life sciences-specific databases, bioinformatics tools and services.

The home page of NCBI is http://www.ncbi.nlm.nih.gov/.

Two of the most important services of NCBI is a public web-based BLAST service for the search of sequence homologies, and Entrez, a web-based system for the retrieval of database contents. Entrez allows to download sequence information, literature references and other biological information stored in NCBI databases.

In addition, NCBI proides PubMed, a database of scientific articles, and Genbank, a database of DNA sequences. Genbank is an exhaustive resource for virtually all publicly available DNA sequences.

FASTA file format

FASTA (pronounced ‘fast a’) is a file format for DNA and protein sequences. The format is very convenient as it is a text-based format and can be read and written with a simple text editor or word processor.

Description of the format

Each FASTA file may contain one or more sequences. Each sequence begins with a description line, identified by a ‘>’ (greater-than) character at the beginning of the line. The rest of the line may contain arbitrary characters and is interpreted as the description of the sequence. Note that there are some subformats of FASTA which contain data fields in the description line. These fields are separated by a ‘|’ (pipe) character.

The following lines contain the sequence in plain text format. For DNA, valid characters are the IUPAC one-letter nucleotide codes (A, C, G, T and the wobble nucleotide codes B, D, R, etc), and ‘-’ (minus) for gaps. For proteins, valid characters are the IUPAC one-letter amino acid codes, ‘-’ (minus) for gaps and ‘*’ (asterisk) for translation stops.

Each line, including the description line, should be at most 80 characters long. However, many programs read and write FASTA files with longer lines, and if you intend to write software that processes FASTA files, you should accept longer lines.

After the first sequence, additional sequences may follow, indicated by additional description lines. Thus, a typical FASTA file looks like this:

>sequence 1

AGCCACATTGACACGGAGA

ACCCCACATTTATAGAGGA

ACCAGAG

>sequence 2

ACCACAGATTGAGTTAGAC

CAGTTAATGAGAACACCAC

Note that both upper and lower case characters are allowed. However, some programs interpret lower and/or upper case characters in a special way. For instance, there is an option in BLAST to filter (ignore) lower case characters in FASTA files.

Advantages

The file format is text-based and easy to understand. It’s easy to create standard-compliant FASTA files manually. Text files and FASTA files can be transformed into each other, simply by adding or removing the description line.

It is easy to write parsers for FASTA files and even easier to write software that creates FASTA files.

The ability to store more than one sequence in a FASTA file facilitates the distribution of sequence libraries. Therefore, not surprisingly, FASTA is a format of choice for the transfer of whole genome sequences and similar large bulk data.

The simplicity of the format makes processing of large amounts of data straightforward. In particular, software can process FASTA files in small chunks, since the format does not introduce dependencies between different parts of a file.

FASTA files can be easily merged, simply by appending the content of one to that of another file.

Disadvantages

The FASTA format does not provide a standardized way to encode meta data for sequences, such as accession numbers, descriptions or especially annotations of parts of a sequence. People have worked around this limitation by introducing pseudoformats for the description lines. However, these formats are not well defined nor well documented, and are not consistent across different software packages.

It is difficult to see whether a FASTA file is intact or has been damaged. For instance, a shortened FASTA file usually looks perfectly valid.

There are many poor implementations of software that reads or writes FASTA files, which leads to non-stanadard compliant files or errors when reading a FASTA file.

BLAST – Install locally

You can download the BLAST software for Windows or Linux and install it on your computer. The software is provided by NCBI under this link:

http://www.ncbi.nlm.nih.gov/BLAST/download.shtml

Make sure you download the ‘blast’ package, not the ‘netblast’ or ‘wwwblast’ package. This will retrieve an executable which contains the compressed BLAST software package. For Windows, the executable has a name similar to

blast-2.2.17-ia32-win32.exe

Save this file to a directory where you want to install BLAST. Then double click on it in the explorer – this will uncompress all contained files into the same directory where you have stored the file. The directory structure should look like this:

Directory structure created by BLAST installation

Directory structure created by BLAST installation

The bin directory contains the executables that you will need to perform actual BLAST searches. Make sure that you are in that directory when you execute BLAST or that your system path contains the BLAST bin directory.

BLAST is a command-line program. Therefore, in order to use it, you will need to go to the command line. Under Windows, you can do that by clicking on ‘Start’, then ‘Run…’, then enter ‘cmd’ and press ‘Ok’.

To perform your first sequence search, enter this command line:

blastall -pblastn -d..\data\UniVec

‘-pblastn’ means that we will use the blastn program out of the BLAST family of sequence-specific programs. ‘-d..\data\UniVec’ specifies the database to use. See the BLAST parameters article for more information.

This will start BLAST which in turn will wait for your input. Enter this sequence:

atcgctgacgagctattacgtagctgcgcgtcagtcgatgcgcgctagc

You need to tell BLAST that the input is finished. Press CTRL+Z twice to do that. BLAST will then run a homology search of the entered sequence against the UniVec database (a database of common vector and oligo sequences that can be used for the detection of DNA cross contamination; UniVec comes with the BLAST installation).

The result will look similar to this one:

BLASTN 2.2.18 [Mar-02-2008]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
“Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs”,  Nucleic Acids Res. 25:3389-3402.

Query=
(49 letters)

Database: UniVec (build 4.0)
2416 sequences; 597,480 total letters

Searching…………………………………………..done

Score    E
Sequences producing significant alignments:                      (bits) Value

gnl|uv|U59231.1:5112-7487 Cloning vector cLHYGpk                       24   1.0
gnl|uv|NGB00020.1:2420-2635 New England BioLabs cloning vector p…    24   1.0
gnl|uv|NGB00021.1:2421-2633 New England BioLabs cloning vector p…    24   1.0
gnl|uv|NGB00023.1:2420-2630 New England BioLabs cloning vector p…    24   1.0
gnl|uv|NGB00022.1:2398-2663 New England BioLabs cloning vector p…    24   1.0
gnl|uv|U89960.1:1471-2257 Cloning vector pEG202 (pLexA)                24   1.0
gnl|uv|AF234290.1:269-4296 Binary vector pCAMBIA-0380                  22   4.1
gnl|uv|U16857.1:2678-3227 Fusion cloning vector pTRXFUS                22   4.1
gnl|uv|U09365.1:9777-10659 Binary vector Bin19                         22   4.1

>gnl|uv|U59231.1:5112-7487 Cloning vector cLHYGpk
Length = 2376

Score = 24.3 bits (12), Expect = 1.0
Identities = 12/12 (100%)
Strand = Plus / Plus

Query: 34   gtcgatgcgcgc 45
||||||||||||
Sbjct: 1499 gtcgatgcgcgc 1510

>gnl|uv|NGB00020.1:2420-2635 New England BioLabs cloning vector
pLITMUS28
Length = 216

Score = 24.3 bits (12), Expect = 1.0
Identities = 12/12 (100%)
Strand = Plus / Minus

Query: 14 tattacgtagct 25
||||||||||||
Sbjct: 34 tattacgtagct 23

>gnl|uv|NGB00021.1:2421-2633 New England BioLabs cloning vector
pLITMUS29
Length = 213

Score = 24.3 bits (12), Expect = 1.0
Identities = 12/12 (100%)
Strand = Plus / Minus

Query: 14 tattacgtagct 25
||||||||||||
Sbjct: 33 tattacgtagct 22

>gnl|uv|NGB00023.1:2420-2630 New England BioLabs cloning vector
pLITMUS39
Length = 211

Score = 24.3 bits (12), Expect = 1.0
Identities = 12/12 (100%)
Strand = Plus / Minus

Query: 14 tattacgtagct 25
||||||||||||
Sbjct: 34 tattacgtagct 23

>gnl|uv|NGB00022.1:2398-2663 New England BioLabs cloning vector
pLITMUS38
Length = 266

Score = 24.3 bits (12), Expect = 1.0
Identities = 12/12 (100%)
Strand = Plus / Minus

Query: 14 tattacgtagct 25
||||||||||||
Sbjct: 56 tattacgtagct 45

>gnl|uv|U89960.1:1471-2257 Cloning vector pEG202 (pLexA)
Length = 787

Score = 24.3 bits (12), Expect = 1.0
Identities = 12/12 (100%)
Strand = Plus / Plus

Query: 23  gctgcgcgtcag 34
||||||||||||
Sbjct: 403 gctgcgcgtcag 414

>gnl|uv|AF234290.1:269-4296 Binary vector pCAMBIA-0380
Length = 4028

Score = 22.3 bits (11), Expect = 4.1
Identities = 11/11 (100%)
Strand = Plus / Minus

Query: 20   gtagctgcgcg 30
|||||||||||
Sbjct: 2245 gtagctgcgcg 2235

>gnl|uv|U16857.1:2678-3227 Fusion cloning vector pTRXFUS
Length = 550

Score = 22.3 bits (11), Expect = 4.1
Identities = 11/11 (100%)
Strand = Plus / Plus

Query: 1   atcgctgacga 11
|||||||||||
Sbjct: 182 atcgctgacga 192

>gnl|uv|U09365.1:9777-10659 Binary vector Bin19
Length = 883

Score = 22.3 bits (11), Expect = 4.1
Identities = 11/11 (100%)
Strand = Plus / Minus

Query: 22  agctgcgcgtc 32
|||||||||||
Sbjct: 362 agctgcgcgtc 352

As you can see, BLAST prints a list of matches against the database, including a convenient alignment of matching subsequences. See the BLAST output article for more information.

Now that you have performed your first local BLAST search, you may want to read up on how to create BLAST databases from existing sequence files, or how to download readymade databases.

BLAST

Use the BLAST tool to find sequences that are similar to a given protein or DNA sequence. BLAST stands for ‘Basic Local Alignment Search Tool’. The reference for BLAST is Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ., ‘Basic local alignment search tool.’, J Mol Biol. 1990 Oct 5;215(3):403-10.

You can use BLAST in two ways: Either go to the public BLAST website maintained by NCBI. Or download the BLAST tool, install it locally on your computer and perform BLAST searches offline.