NAME

coverm genome - Calculate read coverage per-genome (version 0.7.0)

SYNOPSIS

coverm genome <GENOME_DESCRIPTION> <MAPPING_INPUT> ..

DESCRIPTION

coverm genome calculates the coverage of a set of reads on a set of genomes.

This process can be undertaken in several ways, for instance by specifying BAM files or raw reads as input, defining genomes in different input formats, dereplicating genomes before mapping, using different mapping programs, thresholding read alignments, using different methods of calculating coverage and printing the calculated coverage in various formats.

The source code for CoverM is available at https://github.com/wwood/CoverM

READ MAPPING PARAMETERS

-1 PATH ..: Forward FASTA/Q file(s) for mapping. These may be gzipped or not.

-2 PATH ..: Reverse FASTA/Q file(s) for mapping. These may be gzipped or not.

-c, --coupled PATH ..: One or more pairs of forward and reverse possibly gzipped FASTA/Q files for mapping in order <sample1_R1.fq.gz> <sample1_R2.fq.gz> <sample2_R1.fq.gz> <sample2_R2.fq.gz> ..

--interleaved PATH ..: Interleaved FASTA/Q files(s) for mapping. These may be gzipped or not.

--single PATH ..: Unpaired FASTA/Q files(s) for mapping. These may be gzipped or not.

-b, --bam-files PATH: Path to BAM file(s). These must be reference sorted (e.g. with samtools sort) unless --sharded is specified, in which case they must be read name sorted (e.g. with samtools sort -n). When specified, no read mapping algorithm is undertaken.

GENOME DEFINITION

-f, --genome-fasta-files PATH ..: Path(s) to FASTA files of each genome e.g. pathA/genome1.fna pathB/genome2.fa.

d, --genome-fasta-directory PATH: Directory containing FASTA files of each genome.

-x, --genome-fasta-extension EXT: File extension of genomes in the directory specified with -d/--genome-fasta-directory. [default: fna]

--genome-fasta-list PATH: File containing FASTA file paths, one per line.

-r, --reference PATH: FASTA file of contigs e.g. concatenated genomes or metagenome assembly, or minimap2 index (with --minimap2-reference-is-index), strobealign index (with --strobealign-use-index), or BWA index stem (with -p bwa-mem/bwa-mem2). If multiple reference FASTA files are provided and --sharded is specified, then reads will be mapped to references separately as sharded BAMs. NOTE : If genomic FASTA files are specified elsewhere (e.g. with --genome-fasta-files or --genome-fasta-directory), then --reference is not needed as a reference FASTA file can be derived by concatenating input genomes. In these situations, --reference can be optionally specified if an alternate reference sequence set is desired.

-s, --separator CHARACTER: This character separates genome names from contig names in the reference file. Requires --reference. [default: unspecified]

--single-genome: All contigs are from the same genome. Requires --reference. [default: not set]

--genome-definition FILE: File containing list of genome_name<tab>contig lines to define the genome of each contig. Requires --reference. [default: not set]

--use-full-contig-names: Specify that the input BAM files have been generated with mapping software that includes the full name of each contig in the reference definition (i.e. characters after the space), so when reading in genomes, record contig names as such.

DEREPLICATION / GENOME CLUSTERING

--dereplicate: Do genome dereplication via average nucleotide identity (ANI) - choose a genome to represent all within a small distance, using Dashing for preclustering and FastANI for final ANI calculation. When this flag is used, dereplication occurs transparently through the Galah method (https://github.com/wwood/galah) [default: not set]

--checkm2-quality-report PATH: CheckM version 2 quality_report.tsv (i.e. the quality_report.tsv in the output directory output of checkm2 predict ..) for defining genome quality, which is used both for filtering and to rank genomes during clustering.

--checkm-tab-table PATH: CheckM tab table (i.e. the output of checkm .. --tab_table -f PATH ..). The information contained is used like --checkm2-quality-report.

--genome-info PATH: dRep style genome info table for defining quality. The information contained is used like --checkm2-quality-report.

--min-completeness FLOAT: Ignore genomes with less completeness than this percentage. [default: not set]

--max-contamination FLOAT: Ignore genomes with more contamination than this percentage. [default: not set]

--dereplication-ani FLOAT: Overall ANI level to dereplicate at with FastANI. [default: 95]

--dereplication-aligned-fraction FLOAT: Min aligned fraction of two genomes for clustering. [default: 15]

--dereplication-fragment-length FLOAT: Length of fragment used in FastANI calculation (i.e. --fragLen). [default: 3000]

--dereplication-quality-formula FORMULA

Scoring function for genome quality [default: `Parks2020_reduced`]. One of:
formula	description
`Parks2020_reduced`	(default) A quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8 (Supplementary Table 19) but only including those scoring criteria that can be calculated from the sequence without homology searching: `completeness-5contamination-5num_contigs/100-5*num_ambiguous_bases/100000`
`completeness-4contamination`	`completeness-4*contamination`
`completeness-5contamination`	`completeness-5*contamination`
`dRep`	`completeness-5contamination+contamination(strain_heterogeneity/100)+0.5*log10(N50)`

--dereplication-prethreshold-ani FLOAT: Require at least this dashing-derived ANI for preclustering and to avoid FastANI on distant lineages within preclusters. [default: 90]

--dereplication-precluster-method NAME: method of calculating rough ANI for dereplication. 'dashing' for HyperLogLog, 'finch' for finch MinHash, 'skani' for Skani. [default: skani]

--dereplication-cluster-method NAME: method of calculating ANI. 'fastani' for FastANI, 'skani' for Skani. [default: skani]

--dereplication-output-cluster-definition PATH: Output a file of representative<TAB>member lines.

--dereplication-output-representative-fasta-directory PATH: Symlink representative genomes into this directory.

--dereplication-output-representative-fasta-directory-copy PATH: Copy representative genomes into this directory.

--dereplication-output-representative-list PATH: Print newline separated list of paths to representatives into this file.

SHARDING

--sharded: If -b/--bam-files was used: Input BAM files are read-sorted alignments of a set of reads mapped to multiple reference contig sets. Choose the best hit for each read pair. Otherwise if mapping was carried out: Map reads to each reference, choosing the best hit for each pair. [default: not set]

--exclude-genomes-from-deshard: Ignore genomes whose name appears in this newline-separated file when combining shards. [default: not set]

MAPPING ALGORITHM OPTIONS

-p, --mapper NAME

Underlying mapping software used [default: `minimap2-sr`]. One of:
name	description
`minimap2-sr`	minimap2 with '`-x sr`' option
`bwa-mem`	bwa mem using default parameters
`bwa-mem2`	bwa-mem2 using default parameters
`minimap2-ont`	minimap2 with '`-x map-ont`' option
`minimap2-pb`	minimap2 with '`-x map-pb`' option
`minimap2-hifi`	minimap2 with '`-x map-hifi`' option
`minimap2-no-preset`	minimap2 with no '`-x`' option

--minimap2-params PARAMS: Extra parameters to provide to minimap2, both indexing command (if used) and for mapping. Note that usage of this parameter has security implications if untrusted input is specified. '-a' is always specified to minimap2. [default: none]

--minimap2-reference-is-index: Treat reference as a minimap2 database, not as a FASTA file. [default: not set]

--bwa-params PARAMS: Extra parameters to provide to BWA or BWA-MEM2. Note that usage of this parameter has security implications if untrusted input is specified. [default: none]

--strobealign-params PARAMS: Extra parameters to provide to strobealign. Note that usage of this parameter has security implications if untrusted input is specified. [default: none]

--strobealign-use-index: Use a pregenerated index (one that has been created with 'strobealign --create-index'). The --reference option should be specified as the original FASTA file i.e. 'ref.fna' not 'ref.fna.r100.sti' [default: not set]

ALIGNMENT THRESHOLDING

--min-read-aligned-length INT: Exclude reads with smaller numbers of aligned bases. [default: 0]

--min-read-percent-identity FLOAT: Exclude reads by overall percent identity e.g. 95 for 95%. [default: 0]

--min-read-aligned-percent FLOAT: Exclude reads by percent aligned bases e.g. 95 means 95% of the read's bases must be aligned. [default: 0]

--min-read-aligned-length-pair INT: Exclude pairs with smaller numbers of aligned bases. Implies --proper-pairs-only. [default: 0]

--min-read-percent-identity-pair FLOAT: Exclude pairs by overall percent identity e.g. 95 for 95%. Implies --proper-pairs-only. [default: 0]

--min-read-aligned-percent-pair FLOAT: Exclude reads by percent aligned bases e.g. 95 means 95% of the read's bases must be aligned. Implies --proper-pairs-only. [default: 0]

--proper-pairs-only: Require reads to be mapped as proper pairs. [default: not set]

--exclude-supplementary: Exclude supplementary alignments. [default: not set]

--include-secondary: Include secondary alignments. [default: not set]

COVERAGE CALCULATION OPTIONS

-m, --methods METHOD

Method(s) for calculating coverage [default: `relative_abundance`]. A more thorough description of the different methods is available at https://github.com/wwood/CoverM\#calculation-methods but briefly:
method	description
`relative_abundance`	(default) Percentage relative abundance of each genome, and the unmapped read percentage
`mean`	Average number of aligned reads overlapping each position on the genome
`trimmed_mean`	Average number of aligned reads overlapping each position after removing the most deeply and shallow-ly covered positions. See `--trim-min`/`--trim-max` to adjust.
`coverage_histogram`	Histogram of coverage depths
`covered_bases`	Number of bases covered by 1 or more reads
`variance`	Variance of coverage depths
`length`	Length of each genome in base pairs
`count`	Number of reads aligned to each genome. Note that supplementary alignments are not counted.
`reads_per_base`	Number of reads aligned divided by the length of the genome
`anir`	Average BLAST-like identity of mapped reads
`rpkm`	Reads mapped per kilobase of genome, per million mapped reads
`tpm`	Transcripts Per Million as described in Li et al 2010 https://doi.org/10.1093/bioinformatics/btp692

--min-covered-fraction FRACTION: Genomes with less covered bases than this are reported as having zero coverage. [default: 10]

--contig-end-exclusion INT: Exclude bases at the ends of reference sequences from calculation [default: 75]

--trim-min FRACTION: Remove this smallest fraction of positions when calculating trimmed_mean [default: 5]

--trim-max FRACTION: Maximum fraction for trimmed_mean calculations [default: 95]

OUTPUT

-o, --output-file FILE: Output coverage values to this file, or '-' for STDOUT. [default: output to STDOUT]

--output-format FORMAT: Shape of output: 'sparse' for long format, 'dense' for species-by-site. [default: dense]

--no-zeros: Omit printing of genomes that have zero coverage. [default: not set]

--bam-file-cache-directory DIRECTORY: Output BAM files generated during alignment to this directory. The directory may or may not exist. Note that BAM files in this directory contain all mappings, including those that later are excluded by alignment thresholding (e.g. --min-read-percent-identity) or genome-wise thresholding (e.g. --min-covered-fraction). [default: not used]

--discard-unmapped: Exclude unmapped reads from cached BAM files. [default: not set]

GENERAL OPTIONS

-t, --threads INT: Number of threads for mapping, sorting and reading. [default: 1]

-h, --help: Output a short usage message. [default: not set]

--full-help: Output a full help message and display in 'man'. [default: not set]

--full-help-roff: Output a full help message in raw ROFF format for conversion to other formats. [default: not set]

-v, --verbose: Print extra debugging information. [default: not set]

-q, --quiet: Unless there is an error, do not print log messages. [default: not set]

FREQUENTLY ASKED QUESTIONS (FAQ)

Can the temporary directory used be changed? CoverM makes use of the system temporary directory (often /tmp) to store intermediate files. This can cause problems if the amount of storage available there is small or used by many programs. To fix, set the TMPDIR environment variable e.g. to set it to use the current directory: TMPDIR=. coverm genome <etc>

For thresholding arguments e.g. \-\-dereplication\-ani and \-\-min\-read\-percent\-identity, should a percentage (e.g 97%) or fraction (e.g. 0.97) be specified? Either is fine, CoverM determines which is being used by virtue of being less than or greater than 1.

EXIT STATUS

0: Successful program execution.

1: Unsuccessful program execution.

101: The program panicked.

EXAMPLES

Map paired reads to 2 genomes, and output relative abundances to output.tsv: $ coverm genome --coupled read1.fastq.gz read2.fastq.gz --genome-fasta-files genome1.fna genome2.fna -o output.tsv
Calculate coverage of genomes defined as .fna files in genomes_directory/ from a sorted BAM file: $ coverm genome --bam-files my.bam --genome-fasta-directory genomes_directory/
Dereplicate genomes at 99% ANI before mapping unpaired reads: $ coverm genome --genome-fasta-directory genomes/ --dereplicate --single single_reads.fq.gz

AUTHOR

Ben J. Woodcroft, Centre for Microbiome Research, School of Biomedical Sciences, Faculty of Health, Queensland University of Technology <benjwoodcroft near gmail.com>