I'd like to learn the differences between 3 common formats such as FASTA, FASTQ and SAM. How they are different? Are there any benefits of using one over another?

Based on Wikipedia pages, I can't tell the differences between them.

Let’s start with what they have in common: All three formats store

  1. sequence data, and
  2. sequence metadata.

Furthermore, all three formats are text-based.

However, beyond that all three formats are different and serve different purposes.

Let’s start with the simplest format:


FASTA stores a variable number of sequence records, and for each record it stores the sequence itself, and a sequence ID. Each record starts with a header line whose first character is >, followed by the sequence ID. The next lines of a record contain the actual sequence.

The Wikipedia artice gives several examples for peptide sequences, but since FASTQ and SAM are used exclusively (?) for nucleotide sequences, here’s a nucleotide example:

>Mus_musculus_tRNA-Ala-AGC-1-1 (chr13.trna34-AlaAGC)
>Mus_musculus_tRNA-Ala-AGC-10-1 (chr13.trna457-AlaAGC)

The ID can be in any arbitrary format, although several conventions exist.

In the context of nucleotide sequences, FASTA is mostly used to store reference data; that is, data extracted from a curated database; the above is adapted from GtRNAdb (a database of tRNA sequences).


FASTQ was conceived to solve a specific problem arising during sequencing: Due to how different sequencing technologies work, the confidence in each base call (that is, the estimated probability of having correctly identified a given nucleotide) varies. This is expressed in the Phred quality score. FASTA had no standardised way of encoding this. By contrast, a FASTQ record contains a sequence of quality scores for each nucleotide.

A FASTQ record has the following format:

  1. A line starting with @, containing the sequence ID.
  2. One or more lines that contain the sequence.
  3. A new line starting with the character +, and being either empty or repeating the sequence ID.
  4. One or more lines that contain the quality scores.

Here’s an example of a FASTQ file with two records:


FASTQ files are mostly used to store short-read data from high-throughput sequencing experiments. The sequence and quality scores are usually put into a single line each, and indeed many tools assume that each record in a FASTQ file is exactly four lines long, even though this isn’t guaranteed.

As with FASTA, the format of the sequence ID isn’t standardised, but different producers of FASTQ use fixed notations that follow strict conventions.


SAM files are so complex that a complete description [PDF] takes 15 pages. So here’s the short version.

The original purpose of SAM files is to store mapping information for sequences from high-throughput sequencing. As a consequence, a SAM record needs to store more than just the sequence and its quality, it also needs to store information about where and how a sequence maps into the reference.

Unlike the previous formats, SAM is tab-based, and each record, consisting of either 11 or 12 fields, fills exactly one line. Here’s an example (tabs replaced by fixed-width spacing):

r001  99  chr1  7 30  17M         =  37  39  TTAGATAAAGGATACTG   IIIIIIIIIIIIIIIII
r002  0   chrX  9 30  3S6M1P1I4M  *  0   0   AAAAGATAAGGATA      IIIIIIIIII6IBI    NM:i:1

For a description of the individual fields, refer to the documentation. The relevant bit is this: SAM can express exactly the same information as FASTQ, plus, as mentioned, the mapping information. However, SAM is also used to store read data without mapping information.

In addition to sequence records, SAM files can also contain a header, which stores information about the reference that the sequences were mapped to, and the tool used to create the SAM file. Header information precede the sequence records, and consist of lines starting with @.

SAM itself is almost never used as a storage format; instead, files are stored in BAM format, which is a compact, gzipped, binary representation of SAM. It stores the same information, just more efficiently. And, in conjunction with a search index, allows fast retrieval of individual records from the middle of the file (= fast random access). BAM files are also much more compact than compressed FASTQ or FASTA files.

The above implies a hierarchy in what the formats can store: FASTA ⊂ FASTQ ⊂ SAM.

In a typical high-throughput analysis workflow, you will encounter all three file types:

  1. FASTA to store the reference genome/transcriptome that the sequence fragments will be mapped to.
  2. FASTQ to store the sequence fragments before mapping.
  3. SAM/BAM to store the sequence fragments after mapping.
  Why is there a '+' sign in FASTQ format?
  • 2
    @charlesdarwin I have no idea. The line with the plus sign is completely redundant. The original developers of the FASTQ format probably intended it as a redundancy to simplify error checking (= to see if the record was complete) but it fails at that. In hindsight it shouldn't have been included. Unfortunately we're stuck with it for now.
  • 2
    @KonradRudolph as far as I know fastq is a combination of fasta and qual files, see also https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/ This explains the header of the quality part. It, however, doesn't make sense we're stuck with it...

In a nutshell,

FASTA file format is a DNA sequence format for specifying or representing DNA sequences and was first described by Pearson (Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448)

FASTQ is another DNA sequence file format that extends the FASTA format with the ability to store the sequence quality. The quality scores are often represented in ASCII characters which correspond to a phred score)

Both FASTA and FASTQ are common sequence representation formats and have emerged as key data interchange formats for molecular biology and bioinformatics.

SAM is format for representing sequence alignment information from a read aligner. It represents sequence information in respect to a given reference sequence. The information is stored in a series of tab delimited ascii columns. The full SAM format specification is available at http://samtools.sourceforge.net/SAM1.pdf

FASTA (officially) just stores the name of a sequence and the sequence, unofficially people also add comment fields after the name of the sequence. FASTQ was invented to store both sequence and associated quality values (e.g. from sequencing instruments). SAM was invented to store alignments of (small) sequences (e.g. generated from sequencing) with associated quality values and some further data onto a larger sequences, called reference sequences, the latter being anything from a tiny virus sequence to ultra-large plant sequences.

FASTA and FASTQ formats are both file formats that contain sequencing reads while SAM files are these reads aligned to a reference sequence. In other words, FASTA and FASTQ are the "raw data" of sequencing while SAM is the product of aligning the sequencing reads to a refseq.

A FASTA file contains a read name followed by the sequence. An example of one of these reads for RNASeq might be:

>Flow cell number: lane number: chip coordinates etc.

The FASTQ version of this read will have two more lines, one + as a place holder and then a line of quality scores for the base calls. The qualities are given as characters with '!' being the lowest and '~' being the highest, in increasing ASCII value. It would look something like this

@Flow cell number: lane number: chip coordinates etc.

A SAM file has many fields for each alignment, the header begins with the @ character. The alignment contains 11 mandatory fields and various optional ones. You can find the spec file here: https://samtools.github.io/hts-specs/SAMv1.pdf .

Often you'll see BAM files which are just compressed binary versions of SAM files. You can view these alignment files using various tools, such as SAMtools, IGV or USCS Genome browser.

As to the benefits, FASTA/FASTQ vs. SAM/BAM is comparing apples and oranges. I do a lot of RNASeq work so generally we take the FASTQ files and align them the a refseq using an aligner such as STAR which outputs SAM/BAM files. There's a lot you can do with just these alignment files, looking at expression, but usually I'll use a tool such as RSEM to "count" the reads from various genes to create an expression matrix, samples as columns and genes as rows. Whether you get FASTQ or FASTA files just depends on your sequencing platform. I've never heard of anybody really using the quality scores.

    Careful, the FASTQ format description is wrong: a FASTQ record can span more than four lines; also, + isn't a placeholder, it's a separator between the sequence and the quality score, with an optional repetition of the record ID following it. Finally, the quality score string has to be the same length as the sequence.