Child pages
  • Configure Data for Metagenomics Classification

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

74 Gb 
Data

Archive size

Unpacked data sizeDescriptionData source
NCBI taxonomy classification2.5 Gb31 GbThis includes a set of taxonomy data files from NCBI. These data should be present for any type the NGS classification analysis.The Original data were downloaded from the NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/).
NCBI RefSeq bacterial genomes130 Gb132 Gb

The data can be used to build a database for CLARK-l (light version of CLARK), CLARK, or Kraken.

As UGENE integrates modified version of CLARK/CLARK-l, it is possible to provide *.gz archives as input for building the database. In particular, "CLARK-l DB: RefSeq bacterial+viral genomes" (see below) was generated using the archived data.

Also, keep in mind that changing of some parameters of the "Classify Sequences with CLARK" element may cause re-building of the reference database. The reference data should be present in this case!

For building a Kraken database usage of *.gz archives is not supported, it is required to unpack each *.gz file, so even more disk space will be required.

Note that the data were used to build  

Original data were downloaded from the NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/bacteria*.genomic.fna.gz)
NCBI RefSeq viral genomes77 Mb77 Mb

Similarly to "NCBI RefSeq bacterial genomes", although the size of the data is rather small.

The reference data are included into "CLARK-l DB: RefSeq bacterial+viral genomes" and "CLARK-l DB: RefSeq viral genomes" (see below)databases.

The Original data were downloaded from the NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteriaviral/bacteriaviral*.genomic.fna.gz)NCBI RefSeq viral genomes.11 Gb  
NCBI RefSeq GRCh38 human genome837 Mb838 Mb 

Similarly to "NCBI RefSeq bacterial genomes".

The data are not included into any database, but provided in case one would like to use them when building a custom database.

Original data were downloaded from the NCBI FTP (ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_*/hs_ref_GRC*chr*.fa.gz).
Kraken DB: MiniKraken 4Gb database2.5 Gb4.3 Gb  
CLARK-l DB: RefSeq bacterial+viral genomes7.4 Gb11 Gb  
CLARK-l DB: RefSeq viral genomes16 Mb72 Mb  
DIAMOND DB: UniRef505.2 Gb13 Gb  
DIAMOND DB: UniRef9013 Gb34 Gb  
Total:161 Gb226 Gb  

...