fast.genomicscompare 6,575 genera of bacteria & archaea

 

Fast.genomics has been updated to GTDB release 09-RS220 (from April 2024)

Enter an identifier, a protein sequence, or a genus name and a protein description
examples: ING2E5A_RS06865, P0A884, 3osdA, Escherichia thymidylate synthase

Or search for a taxon:
use % for wild cards

About fast.genomics

Fast.genomics includes representative genomes for 6,575 genera of Archaea and Bacteria. These were classified by using the Genome Tree Database (GTDB). Only high-quality genomes are included. Potential chimeras were excluded using CheckM2. Where possible, genomes were taken from NCBI's RefSeq.

Fast.genomics uses mmseqs2 to find homologs for a protein sequence of interest. This usually takes a few seconds. To speed up the search, fast.genomics keeps the mmseqs2 index in memory and runs the alignment step in parallel. The protein of interest need not be in the fast.genomics database.

Once the homologs are identified, fast.genomics can quickly show:

(These examples are for a putative 3-ketoglycoside lyase, ING2E5A_RS06865. This family of proteins was formerly known as DUF1080.)

A database for each order

Fast.genomics also includes a database for each order, with every species represented, and up to 10 genomes per species. The per-order database will often include many more close homologs than the top-level database (example). You can reach the per-order database from the taxon or genome pages. Also, most gene pages have a link to search for homologs within that genome's order.

To speed up searches for homologs within an order, fast.genomics uses a pre-computed clustering (from CD-HIT) of all of the proteins in that order. First, fast.genomics searches against clusters (using protein BLAST and E ≤ 0.001); then it compares the query to all members of those clusters (using lastal and E ≤ 0.001, with E-values rescaled).

Statistics for the main database of diverse Bacteria and Archaea

Phyla131
Classes293
Orders786
Families1,683
Genomes6,575
Protein sequences22,373,411
Genes23,215,514
Database dateMay 17, 2024
GTDB version09-RS220 (Apr 2024)

Further information

Downloads for the main database

by Morgan Price, Arkin group
Lawrence Berkeley National Laboratory