Your Biotechnology Service Provider

FASTA file format

FASTA (pronounced ‘fast a’) is a file format for DNA and protein sequences. The format is very convenient as it is a text-based format and can be read and written with a simple text editor or word processor.

Description of the format

Each FASTA file may contain one or more sequences. Each sequence begins with a description line, identified by a ‘>’ (greater-than) character at the beginning of the line. The rest of the line may contain arbitrary characters and is interpreted as the description of the sequence. Note that there are some subformats of FASTA which contain data fields in the description line. These fields are separated by a ‘|’ (pipe) character.

The following lines contain the sequence in plain text format. For DNA, valid characters are the IUPAC one-letter nucleotide codes (A, C, G, T and the wobble nucleotide codes B, D, R, etc), and ‘-’ (minus) for gaps. For proteins, valid characters are the IUPAC one-letter amino acid codes, ‘-’ (minus) for gaps and ‘*’ (asterisk) for translation stops.

Each line, including the description line, should be at most 80 characters long. However, many programs read and write FASTA files with longer lines, and if you intend to write software that processes FASTA files, you should accept longer lines.

After the first sequence, additional sequences may follow, indicated by additional description lines. Thus, a typical FASTA file looks like this:

>sequence 1

AGCCACATTGACACGGAGA

ACCCCACATTTATAGAGGA

ACCAGAG

>sequence 2

ACCACAGATTGAGTTAGAC

CAGTTAATGAGAACACCAC

Note that both upper and lower case characters are allowed. However, some programs interpret lower and/or upper case characters in a special way. For instance, there is an option in BLAST to filter (ignore) lower case characters in FASTA files.

Advantages

The file format is text-based and easy to understand. It’s easy to create standard-compliant FASTA files manually. Text files and FASTA files can be transformed into each other, simply by adding or removing the description line.

It is easy to write parsers for FASTA files and even easier to write software that creates FASTA files.

The ability to store more than one sequence in a FASTA file facilitates the distribution of sequence libraries. Therefore, not surprisingly, FASTA is a format of choice for the transfer of whole genome sequences and similar large bulk data.

The simplicity of the format makes processing of large amounts of data straightforward. In particular, software can process FASTA files in small chunks, since the format does not introduce dependencies between different parts of a file.

FASTA files can be easily merged, simply by appending the content of one to that of another file.

Disadvantages

The FASTA format does not provide a standardized way to encode meta data for sequences, such as accession numbers, descriptions or especially annotations of parts of a sequence. People have worked around this limitation by introducing pseudoformats for the description lines. However, these formats are not well defined nor well documented, and are not consistent across different software packages.

It is difficult to see whether a FASTA file is intact or has been damaged. For instance, a shortened FASTA file usually looks perfectly valid.

There are many poor implementations of software that reads or writes FASTA files, which leads to non-stanadard compliant files or errors when reading a FASTA file.

You may also be interested in:

  1. BLAST – NCBI web service
  2. BLAST parameters
  3. BLAST – Install locally