Child pages
  • Classify Sequences with CLARK

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Build a Kraken database from a genomic library or shrink a Kraken databaseCLARK (CLAssifier based on Reduced K-mers) is a tool for supervised sequence classification based on discriminative k-mers.

UGENE provides the GUI for CLARK and CLARK-l variants of the CLARK framework for solving the problem of the assignment of metagenomic reads to known genomes.

Parameters in GUI

 

ParameterDescriptionDefault valueDefaultvalue
Mode

Select "Build" to create a new database from a genomic library (--build).
Select "Shrink" to shrink an existing database to haveonly specified number of k-mers (--shrink).

 Build
Database

Name of the output Kraken database (corresponds to --dbthat is used with --build, and to --new-dbthat is used with --shrink).

 
Genomic library

Genomes that should be used to build the database.
The genomes should be specified in FASTA format. The sequence IDs must contain either a GI number or a taxonomy ID.

 
K-mer length K-mer length in bp (--kmer-len).31
Minimizer lengthMinimizer length in bp (--minimizer-len).

The minimizers serve to keep k-mersthat are adjacent in query sequences close to each other in the database, which allows Kraken to exploit the CPU cache.
Changing the value of the parameter can significantly affect the speed of Kraken, and neither increasing nor decreasing of the value will guarantee faster or slower speed.

15
Maximum database sizeBy default, a full database build is done.

To shrink the database before the full build, input the size of the database in Mb (this corresponds to the --max-db-size parameter, but Mb is used instead of Gb).

The size is specified together for the database and the index.

No limit
CleanRemove unneeded files from a built database to reduce the disk usage (--clean).True
Work on diskPerforms most operations on disk rather than in RAM (this will slow down build in most cases).False
Jellyfiah hash sizeThe "kraken-build" tool uses the "jellyfish" tool. This parameter specifies the hash size for Jellyfish.

Supply a smaller hash size to Jellyfish, if you encounter problems with allocating enough memory during the build process (--jellyfish-hash-size).
By default, the parameter is not used.

Skip
Number of threads

Use multiple threads (--threads). 

8

Parameters in Workflow File

Type: kraken-build
Input data

To classify single-end (SE) reads or scaffolds, received by reads de novo assembly, set this parameter to "SE reads or scaffolds".
To classify paired-end (PE) reads, set the value to "PE reads".

SE reads or skaffolds
Classification tool

Use CLARK-l on workstations with limited memory (i.e., "l" for light), this software tool provides precise classification on small metagenomes.

It works with a sparse or ''light'' database (up to 4 GB of RAM) while still performing ultra accurate and fast results.

CLARK-l
Database

A path to the folder with the CLARK database files (-D).
It is assumed that "targets.txt" file is located in this folder (the file is passed to the "classify_metagenome.sh" script from the CLARK package via parameter -T).

 
Minimum k-mer frequencyMinimum of k-mer frequency/occurrence for the discriminative k-mers (-t).

For example, for 1 (or, 2), the program will discard any discriminative k-mer that appear only once (or, less than twice).

0
Mode

Set the mode of the execution (-m):

  • "Full" to get detailed results, confidence scores and other statistics.
  • "Default" to get results summary and perform best trade-off between classification speed, accuracy and RAM usage.
  • "Express" to get results summary with the highest speed possible.
Default
Gap

"Gap" or number of non-overlapping k-mers to pass when creating the database (-п).

Increase the value if it is required to reduce the RAM usage. Note that this will degrade the sensitivity.

4
Load database into memory

Request the loading of database file by memory mapped-file (--ldm).

This option accelerates the loading time but it will require an additional amount of RAM significant.

This option also allows to load the database in multithreaded-task (see also the "Number of threads" parameter).

False
Number of threads

Use multiple threads for the classification and, with the "Load database into memory" option enabled, for the loading of the database into RAM (-n).

8
Output file

Specify the output file name.

auto

Parameters in Workflow File

Type: clark-classify

ParameterParameter in the GUIType

mode

Mode

sequencing-reads

Input data

string

tool-variant

Classification tool

number

database

Database

string

genomic-libraryGenomic libraryurl-datasets

k-min-freq

Minimum k-mer

-lengthK-mer length

frequency

number

minimizer-lengthMinimizer lengthnumber
maximum-database-sizeMaximum database size

mode

Mode

bool

gap

Gap

number

cleanCleanbool
work-on-diskWork on diskbool
jellyfish-hash-sizeJellyfiah hash sizenumber

preload

Load database into memory

bool

threads

Number of threads

number

output-url

Output file

string

Input/Output Ports

The element has output port1 input port:

Name in GUI: Output Kraken database: Input sequences: 

URL(s) to FASTQ or FASTA file(s) should be provided. In case of SE reads or scaffolds use the "Input URL 1" slot only.

In case of PE reads input "left" reads to "Input URL 1", "right" reads to "Input URL 2". See also the "Input data" parameter of the element.

Name in Workflow File: out in

Slots:

SlotInGUISlot in Workflow FileType
Output Input URL 1urlstring

The element has 1 output port:

Name in GUI: CLARK Classification: 

A map of sequence names with the associated taxonomy IDs, classified by CLARK.

Name in Workflow File: out

Slots:

SlotInGUISlot in Workflow FileType
Taxonomy classification data

tax-data

tax-classification