Multiplex protein quantification - assay development
Entelechon is a proud member of the International Association Synthetic Biology.

 

Online tool for the calculation of randomized mutant libraries

Entelechon has published a free online tool for the calculation of the complexity of randomized mutant libraries. This tool allows to enter the number of randomized amino acid positions, as well as the number of codon variants at each position, in order to estimate the total number of possible permutations. Based on this complexity, the tool tests whether a synthetic library will contain sufficient molecules in order to cover each variant at least once.

In addition, the tool compares the performance of codon-precision libraries with conventional libraries based on single nucleotide randomizations. This allows the user to choose the most economic and efficient approach for the problem at hand.

The tool can be found here.

From protein to DNA

If you want to produce a given protein sequence by recombinant expression, in some cases you do not have a corresponding DNA sequence at hand. And even if you do, it’s codon bias may not match that of your expression host. In such cases, you need to backtranslate the protein sequence into a DNA sequence. This can be done in a number of ways, and there are many tools out there which do the job.

We have created two software tools specifically for the purpose of protein expression: The backtranslation tool is an online tool which allows to adjust the codon frequency for each amino acid in detail. It can download codon frequency tables for a wide range of potential expression hosts.

However, the backtranslation tool is rather simplistic. It accepts a list of DNA motives which can be avoided (such as specific restriction sites), but apart from that it doesn’t control other parameters. Therefore, we have created a dedicated software package called Leto. Leto uses a genetic algorithm to iteratively optimize a given DNA or protein sequence. It takes into consideration a wide range of
optimization parameters which have an impact on the expression yield.

For instance, Leto can remove potential splice sites, avoid mRNA destabilizing motives, adjust the GC ratio, reduce the mRNA secondary structure, and – of course – adapt the codon frequencies. Therefore, an optimization using Leto will be very likely to improve the expression yield significantly.

Leto is a standalone application running under Windows, Linux and Mac OS. Also, we offer custom Leto optimizations of your gene sequences as part of the gene synthesis service.

PHPEclipse causes problem with Display view

Recently my Eclipse started to misbehave. The ‘Display’ view (where snippets of code can be executed during debugging) wasn’t coming up; instead an exception was showing:

Error
Wed Jan 19 15:17:33 CET 2011
Unable to create view ID org.eclipse.jdt.debug.ui.DisplayView: An unexpected exception was thrown.

java.lang.NullPointerException
at org.eclipse.jdt.internal.ui.JavaPlugin.getTemplateContextRegistry(JavaPlugin.java:802)
at org.eclipse.jdt.internal.debug.ui.contentassist.JavaDebugContentAssistProcessor.(JavaDebugContentAssistProcessor.java:54)
at org.eclipse.jdt.internal.debug.ui.display.DisplayViewerConfiguration.getContentAssistantProcessor(DisplayViewerConfiguration.java:55)
at org.eclipse.jdt.internal.debug.ui.display.DisplayViewerConfiguration.getContentAssistant(DisplayViewerConfiguration.java:65)
at org.eclipse.jface.text.source.SourceViewer.configure(SourceViewer.java:452)
at org.eclipse.jdt.internal.debug.ui.JDISourceViewer.configure(JDISourceViewer.java:304)
at org.eclipse.jdt.internal.debug.ui.display.DisplayView.createPartControl(DisplayView.java:168)
at org.eclipse.ui.internal.ViewReference.createPartHelper(ViewReference.java:375)
at org.eclipse.ui.internal.ViewReference.createPart(ViewReference.java:229)
at org.eclipse.ui.internal.WorkbenchPartReference.getPart(WorkbenchPartReference.java:595)
at org.eclipse.ui.internal.WorkbenchPage$ActivationList.setActive(WorkbenchPage.java:4218)
at org.eclipse.ui.internal.WorkbenchPage$18.runWithException(WorkbenchPage.java:3277)
at org.eclipse.ui.internal.StartupThreading$StartupRunnable.run(StartupThreading.java:31)
at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:35)
at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:134)
at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:4041)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3660)
at org.eclipse.ui.application.WorkbenchAdvisor.openWindows(WorkbenchAdvisor.java:803)
at org.eclipse.ui.internal.Workbench$31.runWithException(Workbench.java:1566)
at org.eclipse.ui.internal.StartupThreading$StartupRunnable.run(StartupThreading.java:31)
at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:35)
at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:134)
at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:4041)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3660)
at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:2537)
at org.eclipse.ui.internal.Workbench.access$4(Workbench.java:2427)
at org.eclipse.ui.internal.Workbench$7.run(Workbench.java:670)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:332)
at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:663)
at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:149)
at org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:115)
at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:110)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:79)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:369)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:179)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:619)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:574)
at org.eclipse.equinox.launcher.Main.run(Main.java:1407)

After some searching on the web, I found out that the recently installed PHPEclipse was the culprit. It seems PHPEclipse intereferes with starting AJDT which in turn prevents Display from working.

Since I rarely work with PHP anway, I just uninstalled PHPEclipse.

Entelechon at SynBioSafe workshop

Entelechon’s bioinformatics expert Markus Fischer will attend the SynBioSafe workshop which deals with ‘Safety and Ethical Aspects of Synthetic Biology’. The workshop takes place in Vienna on November 15, 2008. The workshop aims at enhancing safety in the growing field of synthetic biology. (more…)

Codon optimization tool

This codon optimization tool allows you to translate a protein or amino acid sequence back into a DNA sequence. It uses a preferred codon frequency table, thus adapting the resulting gene to the codon preference of a particular expression host.

If you are interested in protein engineering or antibody maturation, try our codon-precision randomized mutant libraries.

You can enter sequence motives, which will be included in or excluded from the resulting optimal sequence. Moreover, we have added automatic optimization functionality for the codon usage table and have assembled a detailed online help.

The codon optimization tool has been cited in several publications. If you want to cite the tool, please see this page for details.

The backtranslation tool contains a detailed help function. Under each tab, you will find a “Help” button. By clicking on the button, you will open a window which contains a description of the currently selected tab.

Gene synthesis

If you need help synthesizing the resulting optimized gene sequence, we offer a very competitive and efficient gene synthesis service. Our gene optimization and synthesis team would be happy to assist you with the design of the gene sequence prior to custom synthesis. In addition, we provide a cloning service and a service for the synthesis of codon-precision randomized gene libraries.

Possible reasons why the tool is not working

Your browser does not support Java: Some operating systems and browsers do not support Java – upon which the backtranslation tool is based – by default. For example, Windows XP does not always come with Java support on board. For some versions of Netscapce, Java support is optional as well. In such a case, you should download the Java plugin. To do so, please visit this website: www.java.com

Java support is switched off in your browser, or a virus protection program has deactivated the execution of Java applets. We hereby explicitly emphasize that we do not think that Java applications such as the backtranslation tool pose a security risk. If Java support is deactivated or its execution is prevented, you have to change the respective settings. Please consult your system administrator on how to do this.

Your system is behind a firewall, which prevents the download of Java applets. In this case, only your system administrator can reconfigure the firewall to allow access to Java applets. Most firewalls allow access to Java applets by default.

Gene optimization

When trying to express a gene heterologously in an organism other than the source organism, it is advisable to optimize the gene sequence for maximum expression yields. With highly efficient and affordable gene synthesis readily available, the effort of gene optimization and subsequent synthesis of the resulting gene sequence are usually paying off very well in terms of significantly increased expression yields.

Both the gene and the amino acid sequence can be optimized. Due to the degeneracy of the genetic code – i.e. the fact that most amino acids are encoded by more than one codon – it is possible to alter, and in fact, optimize, the DNA sequence without changing the encoded amino acid sequence. Therefore, these two levels of optimization – the DNA and protein – need to be discussed separately:

Optimization of the protein sequence

The protein sequence encoded by a gene is usually well adapted to the purpose of the protein in the wildtype source organism. This does not necessarily mean that it works in an optimal way in the intended expression system. Therefore, it may be useful or even necessary to alter the protein sequence strategically.

One of the most important features to optimize is the removal, alteration or addition of export signals. Some proteins include such export signals which target them into specific cellular compartments or into the extracellular medium. In heterologous expression, these signals may interfere with the target host. On the other hand, an export signal compatible with the target host may lead to a a deliberate export into the medium which may simplify the purification of the protein drastically.

Another aspect of the protein sequence is its solubility and propensity to fold correctly. The expression system may have properties that lead to incorrect folding or alter the solubility of the protein. For instance, a lot of proteins require the presence of chaperone proteins for correct folding, and these chaperones may be absent in the expression host.

In order to account for these differences, it may be helpful to create a fusion protein – the fused additional protein domain may help with the folding of the protein or may change its solubility. Note that not always is maximum solubility the desired goal: Especially in bacterial expression systems, It might be easier to drive the expression product into inclusion bodies and purify the protein from there. If the target protein needs to be folded correctly, this of course requires a refolding step, the development of which may be the most costly part of the whole expression project.

Optimization of the DNA sequence

In addition to changes to the protein sequence, there are a number of optimizations that can be performed on the DNA level, without changing the protein sequence, by conservative codon exchanges (i.e. replacing one codon by another codon encoding the same amino acid).

The parameters that can be optimized, are:

  • Codon usage
  • Local GC content
  • Absence of splice sites
  • mRNA secondary structure
  • Polyadenylation signals
  • Secondary open reading frames in addition to the intended ORF
  • Repetitive sequence motifs
  • Codon tandem repeats
  • Sequence motifs that destabilize the resulting mRNA

This relatively large number of optimization parameters means that we deal with a multi-objective optimization. Such an optimization cannot simply maximize a single ‘fitness score’, but has to take into consideration a number of individual scores that cannot be readily integrated into an all-encompassing fitness function.

A multi-objective optimization must carefully balance all parameters and avoid the preferred optimization of one or two parameters. This is usually done in a so-called Pareto optimization which considers two solutions as equal (or ‘non-dominating’ in the Pareto terminology) if both are better than the other in at least one parameter.

The Pareto optimization can ideally be performed using a genetic algorithm. Such an algorithm creates a number of alternative ‘solutions’ – in this case sequences – that are derived from a set of parent solutions by mutagenesis and crossing over. The solutions are then evaluated and the best ones are selected.

It is important that the whole sequence is considered during this selection step. It is not possible to optimize one part of a sequence after the other, in independent steps. This would reduce the complexity of the optimization problem drastically, but would introduce an artificial asymmetry which would create invalid – i.e. less than optimal – solutions.

Therefore, the optimization of a gene sequence is a rather complex and difficult matter, both conceptionally and from a computational perspective. The optimization needs to look at the largest possible number of solutions, and therefore requires much computation time. It’s not unusual that an optimization of a reasonably long DNA sequence takes several hours.

Entelechon has created a dedicated software package that implements such a gene optimization, using a genetic algorithm. This package, Leto, performs a Pareto optimization, taking into account the above-mentioned optimization parameters.

Additional optimization aspects

In addition to the optimization parameters discussed above, changes to untranslated parts of a sequence can affect the expression yield. Most notably, the selection of a suitable promoter sequence can have a huge impact. Most commercial expression vectors include very efficient and well characterized promoters. If such a commercial option is not available, the search for an efficient promoter may be equally important as the actual gene optimization. At Entelechon, if there is no suitable promoter available for a given expression system, we use a dedicated bioinformatics approach to retrieve well expressing target genes from public databases and extract optimal consensus promoters.

Another aspect of untranslated regions are introns. In most cases, it is desirable to remove introns, since they increase the cost of gene synthesis and make the handling of a gene sequence more complicated. However, in some expression systems, introns from genes of the target host may increase the expression efficiency – sometimes drastically.

Blosum62

Blosum62 is a substitution matrix for pairwise protein sequence alignments. You will encounter Blosum62 in a number of bioinformatics applications that align protein sequences or analyze the homology between sequences. Most noteworthy, Blosum62 is used by BLAST by default for protein homology detection.

When comparing two sequences and looking at their homology, it is important to have a metric for how closely related to sequence symbols – or amino acids in the case of proteins – are. Among the 20 standard amino acids, some are more closely related than others when it comes to their physicochemical properties. For example, there are hydrophobic and hydrophilic amino acids, and the hydrophobic ones are more closely related to each other than the hydrophobic ones.

If two sequences are evolutionary related, it is plausible to assume that any amino acid changes have a high probability of being conservative, i.e. replacing one amino acid by a closely related – i.e. similar – one.

Therefore, when assigning a match or mismatch score to a pair of amino acids, we need a table where we can look up the score for any pair of two amino acids. That score should reflect the similarity of both amino acids.

This is what Blosum62 does: It contains similarity scores for all permutations of two amino acids, assigning higher (better) scores to similar amino acids. You can find the Blosum62 matrix for example at Expasy.

Blosum62 has been created by a rational process: A large sample set of homologous protein sequences has been aligned and the substitutions analyzed. In the analysis, blocks that showed a good alignment, where used to calculated summed and averaged scores. Therefore the name ‘Blosum’ stems from ‘BLOck SUMs’. ’62′ means that members of a homology block that shared at least 62% of identity with any other member of the block where averaged.

Ensembl

Ensembl is a collaborative effort of the EMBL- EBI and the Sanger Institute to create automatically annotated eukaryotic genomes. It provides an excellent data source on genes, transcripts and their translation products of a large number of eukaryotes. Each datum is extensively annotated and cross-references with related data.

Ensembl can be accessed via a sophisticated, yet relatively easy to use web interface. However, the true value of Ensembl comes from the fact that it has an open read-access SQL interface. Therefore, you can easily use any SQL-savvy client for complex queries. In addition, you can download complete genomic datasets as SQL dumps and import them into an SQL database such as MySQL or Postgres.

The one drawback of Ensembl is its rather convoluted and complex (and not perfectly documented) entity relationship model.

Biojava

Biojava is an open-source project that creates a framework for biology-oriented Java-based applications. You will find the project at this home page. The framework is relatively easy to integrate into Java projects, as it has few dependencies to other programs.

The project uses Ant for a build tool, therefore it is somewhat difficult to integrate into a Maven-based workflow.

The two major disadvantages of Biojava are bulk and version problems. The package is relatively large and adds a significant size overhead to small bioinformatics projects. And the various release iterations of Biojava are not entirely compatible, often introducing hard to find bugs and discrepancies in dependent projects. Also, a significant part of the functionality has been moved into the BiojavaX project.

Biojava provides excellent implementations for a number of boilerplate bioinformatics problems such as alignments. Therefore, it is ideally suited for a rapid start on a complex problem. However, for limited or for very specialized problems, it may be worthwhile to provide one’s own implementations of standard bioinformatics problems. This removes yet another dependency, reduces the risk of version conflicts, and makes it easier to control the source code.

RNA secondary structure

The path from a DNA template to a protein is via the so-called transcription and translation. During transcription, an RNA molecule is derived from the DNA template.

In eukaryotes, internal, non-coding parts, so-called introns, are removed by a process called splicing. Finally, a poly-A tail is appended to the 3′ end of the mRNA, and a CAP region at the 5′ end.

Once these processes are finished in the cell’s nucleus, the RNA – which subsequently is namd messenger or mRNA – is exported into the cytoplasm. The mRNA is single stranded. The free bases are therefore capable of forming bonds between adenosine and uracil, guanosince and cytidine and uracil, and guanosine residues, thus leading to a complex, non-linear structure.

Note that virtually all RNA single strands form such secondary structures. Some RNA molecules have evolved to maintain a very specific structure, such as tRNA. Others have a more or less purposeless structure.

Knowledge about the secondary structure can serve two purposes: First it can help to infer or understand the function of an RNA molecule. Second, it can predict whether an mRNA will cause problems during translation: Extended secondary structures can interfere with the translation process at the ribosome.

Typical structures

RNA forms a number of typical structures. The most prominent is the so-called hairpin: Two stretches of complementary nucleotides form a base-paired double helix, ending in a small loop of free, unpaired nucleotides.

The hairpin stem is the double helical part, the hairpin loop is the circular ‘end’. In the example above, the hairpin contains two unpaired nucleotides, a so-called ‘bulge’.

Predicting RNA secondary structure

Predictions of RNA secondary structures are based on a model of the base interactions. Theoretically, each potential interaction between any two nucleotides must be considered. To accelerate the prediction, certain heuristics about base interactions are applied, such as steric limitations of hairpin loops. In addition, the thermodynamic efficiency of different base pairings are usually considered, in order to get a realistic model of secondary structure formation.

For performance reasons, prediction algorithms usually exclude pseudknots. These are structures where two parts of the RNA form a hairpin stem, and another two parts form a second hairpin stem – one of them being located in between the first two parts, the other outside.

Pseudoknot of an RNA molecule: Two hairpin stems are 'interlocked'.

Note that the prediction of any RNA secondary structure is just an approximation, using a number of simplifying assumptions. In addition, experimental conditions for RNA folding such as ion concentration or temperature are difficult to control to a degree where absolute reproducibility is possible. Depending on the length and nucleotide sequence of an RNA molecule, the prediction may be close to a structure found in nature or very far off.