NAME

coverm genome - Calculate read coverage per-genome (version 0.7.0)

SYNOPSIS

coverm genome <GENOME_DESCRIPTION> <MAPPING_INPUT> ..

DESCRIPTION

coverm genome calculates the coverage of a set of reads on a set of genomes.

This process can be undertaken in several ways, for instance by specifying BAM files or raw reads as input, defining genomes in different input formats, dereplicating genomes before mapping, using different mapping programs, thresholding read alignments, using different methods of calculating coverage and printing the calculated coverage in various formats.

The source code for CoverM is available at https://github.com/wwood/CoverM

READ MAPPING PARAMETERS

-1 PATH ..

Forward FASTA/Q file(s) for mapping. These may be gzipped or not.

-2 PATH ..

Reverse FASTA/Q file(s) for mapping. These may be gzipped or not.

-c, --coupled PATH ..

One or more pairs of forward and reverse possibly gzipped FASTA/Q files for mapping in order <sample1_R1.fq.gz> <sample1_R2.fq.gz> <sample2_R1.fq.gz> <sample2_R2.fq.gz> ..

--interleaved PATH ..

Interleaved FASTA/Q files(s) for mapping. These may be gzipped or not.

--single PATH ..

Unpaired FASTA/Q files(s) for mapping. These may be gzipped or not.

-b, --bam-files PATH

Path to BAM file(s). These must be reference sorted (e.g. with samtools sort) unless --sharded is specified, in which case they must be read name sorted (e.g. with samtools sort -n). When specified, no read mapping algorithm is undertaken.

GENOME DEFINITION

-f, --genome-fasta-files PATH ..

Path(s) to FASTA files of each genome e.g. pathA/genome1.fna pathB/genome2.fa.

d, --genome-fasta-directory PATH

Directory containing FASTA files of each genome.

-x, --genome-fasta-extension EXT

File extension of genomes in the directory specified with -d/--genome-fasta-directory. [default: fna]

--genome-fasta-list PATH

File containing FASTA file paths, one per line.

-r, --reference PATH

FASTA file of contigs e.g. concatenated genomes or metagenome assembly, or minimap2 index (with --minimap2-reference-is-index), strobealign index (with --strobealign-use-index), or BWA index stem (with -p bwa-mem/bwa-mem2). If multiple reference FASTA files are provided and --sharded is specified, then reads will be mapped to references separately as sharded BAMs. NOTE : If genomic FASTA files are specified elsewhere (e.g. with --genome-fasta-files or --genome-fasta-directory), then --reference is not needed as a reference FASTA file can be derived by concatenating input genomes. In these situations, --reference can be optionally specified if an alternate reference sequence set is desired.

-s, --separator CHARACTER

This character separates genome names from contig names in the reference file. Requires --reference. [default: unspecified]

--single-genome

All contigs are from the same genome. Requires --reference. [default: not set]

--genome-definition FILE

File containing list of genome_name<tab>contig lines to define the genome of each contig. Requires --reference. [default: not set]

--use-full-contig-names

Specify that the input BAM files have been generated with mapping software that includes the full name of each contig in the reference definition (i.e. characters after the space), so when reading in genomes, record contig names as such.

DEREPLICATION / GENOME CLUSTERING

--dereplicate

Do genome dereplication via average nucleotide identity (ANI) - choose a genome to represent all within a small distance, using Dashing for preclustering and FastANI for final ANI calculation. When this flag is used, dereplication occurs transparently through the Galah method (https://github.com/wwood/galah) [default: not set]

--checkm2-quality-report PATH

CheckM version 2 quality_report.tsv (i.e. the quality_report.tsv in the output directory output of checkm2 predict ..) for defining genome quality, which is used both for filtering and to rank genomes during clustering.

--checkm-tab-table PATH

CheckM tab table (i.e. the output of checkm .. --tab_table -f PATH ..). The information contained is used like --checkm2-quality-report.

--genome-info PATH

dRep style genome info table for defining quality. The information contained is used like --checkm2-quality-report.

--min-completeness FLOAT

Ignore genomes with less completeness than this percentage. [default: not set]

--max-contamination FLOAT

Ignore genomes with more contamination than this percentage. [default: not set]

--dereplication-ani FLOAT

Overall ANI level to dereplicate at with FastANI. [default: 95]

--dereplication-aligned-fraction FLOAT

Min aligned fraction of two genomes for clustering. [default: 15]

--dereplication-fragment-length FLOAT

Length of fragment used in FastANI calculation (i.e. --fragLen). [default: 3000]

--dereplication-quality-formula FORMULA

Scoring function for genome quality [default: Parks2020_reduced]. One of:
formula description
Parks2020_reduced (default) A quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8 (Supplementary Table 19) but only including those scoring criteria that can be calculated from the sequence without homology searching: completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000
completeness-4contamination completeness-4*contamination
completeness-5contamination completeness-5*contamination
dRep completeness-5*contamination+contamination*(strain_heterogeneity/100)+0.5*log10(N50)
--dereplication-prethreshold-ani FLOAT

Require at least this dashing-derived ANI for preclustering and to avoid FastANI on distant lineages within preclusters. [default: 90]

--dereplication-precluster-method NAME

method of calculating rough ANI for dereplication. 'dashing' for HyperLogLog, 'finch' for finch MinHash, 'skani' for Skani. [default: skani]

--dereplication-cluster-method NAME

method of calculating ANI. 'fastani' for FastANI, 'skani' for Skani. [default: skani]

--dereplication-output-cluster-definition PATH

Output a file of representative<TAB>member lines.

--dereplication-output-representative-fasta-directory PATH

Symlink representative genomes into this directory.

--dereplication-output-representative-fasta-directory-copy PATH

Copy representative genomes into this directory.

--dereplication-output-representative-list PATH

Print newline separated list of paths to representatives into this file.

SHARDING

--sharded

If -b/--bam-files was used: Input BAM files are read-sorted alignments of a set of reads mapped to multiple reference contig sets. Choose the best hit for each read pair. Otherwise if mapping was carried out: Map reads to each reference, choosing the best hit for each pair. [default: not set]

--exclude-genomes-from-deshard

Ignore genomes whose name appears in this newline-separated file when combining shards. [default: not set]

MAPPING ALGORITHM OPTIONS

-p, --mapper NAME

Underlying mapping software used [default: minimap2-sr]. One of:
name description
minimap2-sr minimap2 with '-x sr' option
bwa-mem bwa mem using default parameters
bwa-mem2 bwa-mem2 using default parameters
minimap2-ont minimap2 with '-x map-ont' option
minimap2-pb minimap2 with '-x map-pb' option
minimap2-hifi minimap2 with '-x map-hifi' option
minimap2-no-preset minimap2 with no '-x' option
--minimap2-params PARAMS

Extra parameters to provide to minimap2, both indexing command (if used) and for mapping. Note that usage of this parameter has security implications if untrusted input is specified. '-a' is always specified to minimap2. [default: none]

--minimap2-reference-is-index

Treat reference as a minimap2 database, not as a FASTA file. [default: not set]

--bwa-params PARAMS

Extra parameters to provide to BWA or BWA-MEM2. Note that usage of this parameter has security implications if untrusted input is specified. [default: none]

--strobealign-params PARAMS

Extra parameters to provide to strobealign. Note that usage of this parameter has security implications if untrusted input is specified. [default: none]

--strobealign-use-index

Use a pregenerated index (one that has been created with 'strobealign --create-index'). The --reference option should be specified as the original FASTA file i.e. 'ref.fna' not 'ref.fna.r100.sti' [default: not set]

ALIGNMENT THRESHOLDING

--min-read-aligned-length INT

Exclude reads with smaller numbers of aligned bases. [default: 0]

--min-read-percent-identity FLOAT

Exclude reads by overall percent identity e.g. 95 for 95%. [default: 0]

--min-read-aligned-percent FLOAT

Exclude reads by percent aligned bases e.g. 95 means 95% of the read's bases must be aligned. [default: 0]

--min-read-aligned-length-pair INT

Exclude pairs with smaller numbers of aligned bases. Implies --proper-pairs-only. [default: 0]

--min-read-percent-identity-pair FLOAT

Exclude pairs by overall percent identity e.g. 95 for 95%. Implies --proper-pairs-only. [default: 0]

--min-read-aligned-percent-pair FLOAT

Exclude reads by percent aligned bases e.g. 95 means 95% of the read's bases must be aligned. Implies --proper-pairs-only. [default: 0]

--proper-pairs-only

Require reads to be mapped as proper pairs. [default: not set]

--exclude-supplementary

Exclude supplementary alignments. [default: not set]

--include-secondary

Include secondary alignments. [default: not set]

COVERAGE CALCULATION OPTIONS

-m, --methods METHOD

Method(s) for calculating coverage [default: relative_abundance]. A more thorough description of the different methods is available at https://github.com/wwood/CoverM\#calculation-methods but briefly:
method description
relative_abundance (default) Percentage relative abundance of each genome, and the unmapped read percentage
mean Average number of aligned reads overlapping each position on the genome
trimmed_mean Average number of aligned reads overlapping each position after removing the most deeply and shallow-ly covered positions. See --trim-min/--trim-max to adjust.
coverage_histogram Histogram of coverage depths
covered_bases Number of bases covered by 1 or more reads
variance Variance of coverage depths
length Length of each genome in base pairs
count Number of reads aligned to each genome. Note that supplementary alignments are not counted.
reads_per_base Number of reads aligned divided by the length of the genome
rpkm Reads mapped per kilobase of genome, per million mapped reads
tpm Transcripts Per Million as described in Li et al 2010 https://doi.org/10.1093/bioinformatics/btp692
--min-covered-fraction FRACTION

Genomes with less covered bases than this are reported as having zero coverage. [default: 10]

--contig-end-exclusion INT

Exclude bases at the ends of reference sequences from calculation [default: 75]

--trim-min FRACTION

Remove this smallest fraction of positions when calculating trimmed_mean [default: 5]

--trim-max FRACTION

Maximum fraction for trimmed_mean calculations [default: 95]

OUTPUT

-o, --output-file FILE

Output coverage values to this file, or '-' for STDOUT. [default: output to STDOUT]

--output-format FORMAT

Shape of output: 'sparse' for long format, 'dense' for species-by-site. [default: dense]

--no-zeros

Omit printing of genomes that have zero coverage. [default: not set]

--bam-file-cache-directory DIRECTORY

Output BAM files generated during alignment to this directory. The directory may or may not exist. Note that BAM files in this directory contain all mappings, including those that later are excluded by alignment thresholding (e.g. --min-read-percent-identity) or genome-wise thresholding (e.g. --min-covered-fraction). [default: not used]

--discard-unmapped

Exclude unmapped reads from cached BAM files. [default: not set]

GENERAL OPTIONS

-t, --threads INT

Number of threads for mapping, sorting and reading. [default: 1]

-h, --help

Output a short usage message. [default: not set]

--full-help

Output a full help message and display in 'man'. [default: not set]

--full-help-roff

Output a full help message in raw ROFF format for conversion to other formats. [default: not set]

-v, --verbose

Print extra debugging information. [default: not set]

-q, --quiet

Unless there is an error, do not print log messages. [default: not set]

FREQUENTLY ASKED QUESTIONS (FAQ)

Can the temporary directory used be changed? CoverM makes use of the system temporary directory (often /tmp) to store intermediate files. This can cause problems if the amount of storage available there is small or used by many programs. To fix, set the TMPDIR environment variable e.g. to set it to use the current directory: TMPDIR=. coverm genome <etc>

For thresholding arguments e.g. \-\-dereplication\-ani and \-\-min\-read\-percent\-identity, should a percentage (e.g 97%) or fraction (e.g. 0.97) be specified? Either is fine, CoverM determines which is being used by virtue of being less than or greater than 1.

EXIT STATUS

0

Successful program execution.

1

Unsuccessful program execution.

101

The program panicked.

EXAMPLES

Map paired reads to 2 genomes, and output relative abundances to output.tsv

$ coverm genome --coupled read1.fastq.gz read2.fastq.gz --genome-fasta-files genome1.fna genome2.fna -o output.tsv

Calculate coverage of genomes defined as .fna files in genomes_directory/ from a sorted BAM file

$ coverm genome --bam-files my.bam --genome-fasta-directory genomes_directory/

Dereplicate genomes at 99% ANI before mapping unpaired reads

$ coverm genome --genome-fasta-directory genomes/ --dereplicate --single single_reads.fq.gz

AUTHOR

Ben J. Woodcroft, Centre for Microbiome Research, School of Biomedical Sciences, Faculty of Health, Queensland University of Technology <benjwoodcroft near gmail.com>