NAME

galah cluster - Cluster genome FASTA files by average nucleotide identity (version 0.4.0)

SYNOPSIS

galah cluster <GENOME_INPUTS> <OUTPUT_ARGUMENTS>

DESCRIPTION

This cluster mode dereplicates genomes, choosing a subset of the input genomes as representatives. Required inputs are (1) a genome definition, and (2) an output format definition.

The source code for this program can be found at https://github.com/wwood/galah or https://github.com/wwood/coverm

GENOME INPUT

-f, --genome-fasta-files PATH ..

Path(s) to FASTA files of each genome e.g. pathA/genome1.fna pathB/genome2.fa.

d, --genome-fasta-directory PATH

Directory containing FASTA files of each genome.

-x, --genome-fasta-extension EXT

File extension of genomes in the directory specified with -d/--genome-fasta-directory. [default: fna]

--genome-fasta-list PATH

File containing FASTA file paths, one per line.

FILTERING PARAMETERS

--checkm2-quality-report PATH

CheckM version 2 quality_report.tsv (i.e. the quality_report.tsv in the output directory output of checkm2 predict ..) for defining genome quality, which is used both for filtering and to rank genomes during clustering.

--checkm-tab-table PATH

CheckM tab table (i.e. the output of checkm .. --tab_table -f PATH ..). The information contained is used like --checkm2-quality-report.

--genome-info PATH

dRep style genome info table for defining quality. The information contained is used like --checkm2-quality-report.

--min-completeness FLOAT

Ignore genomes with less completeness than this percentage. [default: not set]

--max-contamination FLOAT

Ignore genomes with more contamination than this percentage. [default: not set]

CLUSTERING PARAMETERS

--ani FLOAT

Overall ANI level to dereplicate at with FastANI. [default: 95]

--min-aligned-fraction FLOAT

Min aligned fraction of two genomes for clustering. [default: 15]

--fragment-length FLOAT

Length of fragment used in FastANI calculation (i.e. --fragLen). [default: 3000]

--quality-formula FORMULA

Scoring function for genome quality [default: Parks2020_reduced]. One of:
formula description
Parks2020_reduced (default) A quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8 (Supplementary Table 19) but only including those scoring criteria that can be calculated from the sequence without homology searching: completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000
completeness-4contamination completeness-4*contamination
completeness-5contamination completeness-5*contamination
dRep completeness-5*contamination+contamination*(strain_heterogeneity/100)+0.5*log10(N50)
--precluster-ani FLOAT

Require at least this dashing-derived ANI for preclustering and to avoid FastANI on distant lineages within preclusters. [default: 90]

--precluster-method NAME

method of calculating rough ANI for dereplication. 'dashing' for HyperLogLog, 'finch' for finch MinHash, 'skani' for Skani. [default: skani]

--cluster-method NAME

method of calculating ANI. 'fastani' for FastANI, 'skani' for Skani. [default: skani]

OUTPUT

--output-cluster-definition PATH

Output a file of representative<TAB>member lines.

--output-representative-fasta-directory PATH

Symlink representative genomes into this directory.

--output-representative-fasta-directory-copy PATH

Copy representative genomes into this directory.

--output-representative-list PATH

Print newline separated list of paths to representatives into this file.

GENERAL PARAMETERS

-t, --threads INT

Number of threads. [default: 1]

-v, --verbose

Print extra debugging information

-q, --quiet

Unless there is an error, do not print log messages

-h, --help

Output a short usage message.

--full-help

Output a full help message and display in 'man'.

--full-help-roff

Output a full help message in raw ROFF format for conversion to other formats.

EXIT STATUS

0

Successful program execution.

1

Unsuccessful program execution.

101

The program panicked.

AUTHOR

Ben J. Woodcroft, Centre for Microbiome Research, Queensland University of Technology <benjwoodcroft near gmail.com>