fast.genomics

fast.genomics – compare 6,575 genera of bacteria & archaea

About fast.genomics

Fast.genomics includes representative genomes for 6,575 genera of Archaea and Bacteria. These were classified by using the Genome Tree Database (GTDB). Only high-quality genomes are included. Potential chimeras were excluded using CheckM2. Where possible, genomes were taken from NCBI's RefSeq.

Fast.genomics uses mmseqs2 to find homologs for a protein sequence of interest. This usually takes a few seconds. To speed up the search, fast.genomics keeps the mmseqs2 index in memory and runs the alignment step in parallel. The protein of interest need not be in the fast.genomics database.

Once the homologs are identified, fast.genomics can quickly show:

Gene neighborhoods
Which taxa contain homologs
Compare the presence/absence of two proteins
- see which taxa have two genes nearby

(These examples are for a putative 3-ketoglycoside lyase, ING2E5A_RS06865. This family of proteins was formerly known as DUF1080.)

A database for each order

Fast.genomics also includes a database for each order, with every species represented, and up to 10 genomes per species. The per-order database will often include many more close homologs than the top-level database (example). You can reach the per-order database from the taxon or genome pages. Also, most gene pages have a link to search for homologs within that genome's order.

To speed up searches for homologs within an order, fast.genomics uses a pre-computed clustering (from CD-HIT) of all of the proteins in that order. First, fast.genomics searches against clusters (using protein BLAST and E ≤ 0.001); then it compares the query to all members of those clusters (using lastal and E ≤ 0.001, with E-values rescaled).

Statistics for the main database of diverse Bacteria and Archaea

Phyla	131
Classes	293
Orders	786
Families	1,683
Genomes	6,575
Protein sequences	22,373,411
Genes	23,215,514
Database date	May 17, 2024
GTDB version	09-RS220 (Apr 2024)

Further information

Fast.genomics was described in detail in "A fast comparative genome browser for diverse bacteria and archaea".
- Changes since the original publication:
- We now use CheckM2 (instead of GUNC and CheckM) to identify low-quality genomes.
- We now require a coding density of at least 60%.
For advice on using fast.genomics and related tools, see "Interactive tools for functional annotation of bacterial genomes".

Downloads for the main database

Genomes (tab-delimited)
Protein sequences (fasta format, gzipped)
SQLite3 database (see schema)
Source code (github repository)

by Morgan Price, Arkin group
Lawrence Berkeley National Laboratory