Fast.genomics includes representative genomes for 6,575 genera of Archaea and Bacteria. These were classified by using the Genome Tree Database (GTDB). Only high-quality genomes are included. Potential chimeras were excluded using CheckM2. Where possible, genomes were taken from NCBI's RefSeq.
Fast.genomics uses mmseqs2 to find homologs for a protein sequence of interest. This usually takes a few seconds. To speed up the search, fast.genomics keeps the mmseqs2 index in memory and runs the alignment step in parallel. The protein of interest need not be in the fast.genomics database.
Once the homologs are identified, fast.genomics can quickly show:
(These examples are for a putative 3-ketoglycoside lyase, ING2E5A_RS06865. This family of proteins was formerly known as DUF1080.)
Fast.genomics also includes a database for each order, with every species represented, and up to 10 genomes per species. The per-order database will often include many more close homologs than the top-level database (example). You can reach the per-order database from the taxon or genome pages. Also, most gene pages have a link to search for homologs within that genome's order.
To speed up searches for homologs within an order, fast.genomics uses a pre-computed clustering (from CD-HIT) of all of the proteins in that order. First, fast.genomics searches against clusters (using protein BLAST and E ≤ 0.001); then it compares the query to all members of those clusters (using lastal and E ≤ 0.001, with E-values rescaled).
Phyla | 131 |
Classes | 293 |
Orders | 786 |
Families | 1,683 |
Genomes | 6,575 |
Protein sequences | 22,373,411 |
Genes | 23,215,514 |
Database date | May 17, 2024 |
GTDB version | 09-RS220 (Apr 2024) |
Lawrence Berkeley National Laboratory