NAME
galah cluster - Cluster genome FASTA files by average nucleotide identity (version 0.4.2)
SYNOPSIS
galah cluster <GENOME_INPUTS> <OUTPUT_ARGUMENTS>
DESCRIPTION
This cluster mode dereplicates genomes, choosing a subset of the input genomes as representatives. Required inputs are (1) a genome definition, and (2) an output format definition.
The source code for this program can be found at https://github.com/wwood/galah or https://github.com/wwood/coverm
GENOME INPUT
- -f, --genome-fasta-files PATH ..
Path(s) to FASTA files of each genome e.g.
pathA/genome1.fna pathB/genome2.fa
.- -d, --genome-fasta-directory PATH
Directory containing FASTA files of each genome.
- -x, --genome-fasta-extension EXT
File extension of genomes in the directory specified with
-d/--genome-fasta-directory
. [default:fna
]- --genome-fasta-list PATH
File containing FASTA file paths, one per line.
FILTERING PARAMETERS
- --checkm2-quality-report PATH
CheckM version 2 quality_report.tsv (i.e. the
quality_report.tsv
in the output directory output ofcheckm2 predict ..
) for defining genome quality, which is used both for filtering and to rank genomes during clustering.- --checkm-tab-table PATH
CheckM tab table (i.e. the output of
checkm .. --tab_table -f PATH ..
). The information contained is used like--checkm2-quality-report
.- --genome-info PATH
dRep style genome info table for defining quality. The information contained is used like
--checkm2-quality-report
.- --min-completeness FLOAT
Ignore genomes with less completeness than this percentage. [default: not set]
- --max-contamination FLOAT
Ignore genomes with more contamination than this percentage. [default: not set]
CLUSTERING PARAMETERS
- --ani FLOAT
Overall ANI level to dereplicate at with FastANI. [default:
95
]- --min-aligned-fraction FLOAT
Min aligned fraction of two genomes for clustering. [default:
15
]- --fragment-length FLOAT
Length of fragment used in FastANI calculation (i.e.
--fragLen
). [default:3000
]- --quality-formula FORMULA
Scoring function for genome quality [default:
Parks2020_reduced
]. One of:
formula | description |
---|---|
Parks2020_reduced |
(default) A quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8 (Supplementary Table 19) but only including those scoring criteria that can be calculated from the sequence without homology searching: completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000 |
completeness-4contamination |
completeness-4*contamination |
completeness-5contamination |
completeness-5*contamination |
dRep |
completeness-5*contamination+contamination*(strain_heterogeneity/100)+0.5*log10(N50) |
- --precluster-ani FLOAT
Require at least this dashing-derived ANI for preclustering and to avoid FastANI on distant lineages within preclusters. [default:
90
]- --precluster-method NAME
method of calculating rough ANI for dereplication. '
dashing
' for HyperLogLog, 'finch
' for finch MinHash, 'skani
' for Skani. [default:skani
]- --cluster-method NAME
method of calculating ANI. '
fastani
' for FastANI, 'skani
' for Skani. [default:skani
]
OUTPUT
- --output-cluster-definition PATH
Output a file of representative<TAB>member lines.
- --output-representative-fasta-directory PATH
Symlink representative genomes into this directory.
- --output-representative-fasta-directory-copy PATH
Copy representative genomes into this directory.
- --output-representative-list PATH
Print newline separated list of paths to representatives into this file.
GENERAL PARAMETERS
- -t, --threads INT
Number of threads. [default:
1
]- -v, --verbose
Print extra debugging information
- -q, --quiet
Unless there is an error, do not print log messages
- -h, --help
Output a short usage message.
- --full-help
Output a full help message and display in 'man'.
- --full-help-roff
Output a full help message in raw ROFF format for conversion to other formats.
EXIT STATUS
- 0
Successful program execution.
- 1
Unsuccessful program execution.
- 101
The program panicked.