NAME
coverm genome - Calculate read coverage per-genome (version 0.7.0)
SYNOPSIS
coverm genome <GENOME_DESCRIPTION> <MAPPING_INPUT> ..
DESCRIPTION
coverm genome calculates the coverage of a set of reads on a set of genomes.
This process can be undertaken in several ways, for instance by specifying BAM files or raw reads as input, defining genomes in different input formats, dereplicating genomes before mapping, using different mapping programs, thresholding read alignments, using different methods of calculating coverage and printing the calculated coverage in various formats.
The source code for CoverM is available at https://github.com/wwood/CoverM
READ MAPPING PARAMETERS
- -1 PATH ..
Forward FASTA/Q file(s) for mapping. These may be gzipped or not.
- -2 PATH ..
Reverse FASTA/Q file(s) for mapping. These may be gzipped or not.
- -c, --coupled PATH ..
One or more pairs of forward and reverse possibly gzipped FASTA/Q files for mapping in order <sample1_R1.fq.gz> <sample1_R2.fq.gz> <sample2_R1.fq.gz> <sample2_R2.fq.gz> ..
- --interleaved PATH ..
Interleaved FASTA/Q files(s) for mapping. These may be gzipped or not.
- --single PATH ..
Unpaired FASTA/Q files(s) for mapping. These may be gzipped or not.
- -b, --bam-files PATH
Path to BAM file(s). These must be reference sorted (e.g. with samtools sort) unless
--sharded
is specified, in which case they must be read name sorted (e.g. withsamtools sort -n
). When specified, no read mapping algorithm is undertaken.
GENOME DEFINITION
- -f, --genome-fasta-files PATH ..
Path(s) to FASTA files of each genome e.g.
pathA/genome1.fna pathB/genome2.fa
.
- d, --genome-fasta-directory PATH
Directory containing FASTA files of each genome.
- -x, --genome-fasta-extension EXT
File extension of genomes in the directory specified with
-d/--genome-fasta-directory
. [default:fna
]
- --genome-fasta-list PATH
File containing FASTA file paths, one per line.
- -r, --reference PATH
FASTA file of contigs e.g. concatenated genomes or metagenome assembly, or minimap2 index (with
--minimap2-reference-is-index
), strobealign index (with--strobealign-use-index
), or BWA index stem (with-p bwa-mem/bwa-mem2
). If multiple reference FASTA files are provided and--sharded
is specified, then reads will be mapped to references separately as sharded BAMs. NOTE : If genomic FASTA files are specified elsewhere (e.g. with--genome-fasta-files
or--genome-fasta-directory
), then--reference
is not needed as a reference FASTA file can be derived by concatenating input genomes. In these situations,--reference
can be optionally specified if an alternate reference sequence set is desired.
- -s, --separator CHARACTER
This character separates genome names from contig names in the reference file. Requires
--reference
. [default: unspecified]
- --single-genome
All contigs are from the same genome. Requires
--reference
. [default: not set]
- --genome-definition FILE
File containing list of genome_name<tab>contig lines to define the genome of each contig. Requires
--reference
. [default: not set]
- --use-full-contig-names
Specify that the input BAM files have been generated with mapping software that includes the full name of each contig in the reference definition (i.e. characters after the space), so when reading in genomes, record contig names as such.
DEREPLICATION / GENOME CLUSTERING
- --dereplicate
Do genome dereplication via average nucleotide identity (ANI) - choose a genome to represent all within a small distance, using Dashing for preclustering and FastANI for final ANI calculation. When this flag is used, dereplication occurs transparently through the Galah method (https://github.com/wwood/galah) [default: not set]
- --checkm2-quality-report PATH
CheckM version 2 quality_report.tsv (i.e. the
quality_report.tsv
in the output directory output ofcheckm2 predict ..
) for defining genome quality, which is used both for filtering and to rank genomes during clustering.
- --checkm-tab-table PATH
CheckM tab table (i.e. the output of
checkm .. --tab_table -f PATH ..
). The information contained is used like--checkm2-quality-report
.
- --genome-info PATH
dRep style genome info table for defining quality. The information contained is used like
--checkm2-quality-report
.
- --min-completeness FLOAT
Ignore genomes with less completeness than this percentage. [default: not set]
- --max-contamination FLOAT
Ignore genomes with more contamination than this percentage. [default: not set]
- --dereplication-ani FLOAT
Overall ANI level to dereplicate at with FastANI. [default:
95
]
- --dereplication-aligned-fraction FLOAT
Min aligned fraction of two genomes for clustering. [default:
15
]
- --dereplication-fragment-length FLOAT
Length of fragment used in FastANI calculation (i.e.
--fragLen
). [default:3000
]
--dereplication-quality-formula FORMULA
formula | description |
---|---|
Parks2020_reduced |
(default) A quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8 (Supplementary Table 19) but only including those scoring criteria that can be calculated from the sequence without homology searching: completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000 |
completeness-4contamination |
completeness-4*contamination |
completeness-5contamination |
completeness-5*contamination |
dRep |
completeness-5*contamination+contamination*(strain_heterogeneity/100)+0.5*log10(N50) |
- --dereplication-prethreshold-ani FLOAT
Require at least this dashing-derived ANI for preclustering and to avoid FastANI on distant lineages within preclusters. [default:
90
]
- --dereplication-precluster-method NAME
method of calculating rough ANI for dereplication. '
dashing
' for HyperLogLog, 'finch
' for finch MinHash, 'skani
' for Skani. [default:skani
]
- --dereplication-cluster-method NAME
method of calculating ANI. '
fastani
' for FastANI, 'skani
' for Skani. [default:skani
]
- --dereplication-output-cluster-definition PATH
Output a file of representative<TAB>member lines.
- --dereplication-output-representative-fasta-directory PATH
Symlink representative genomes into this directory.
- --dereplication-output-representative-fasta-directory-copy PATH
Copy representative genomes into this directory.
- --dereplication-output-representative-list PATH
Print newline separated list of paths to representatives into this file.
MAPPING ALGORITHM OPTIONS
-p, --mapper NAME
name | description |
---|---|
minimap2-sr |
minimap2 with '-x sr ' option |
bwa-mem |
bwa mem using default parameters |
bwa-mem2 |
bwa-mem2 using default parameters |
minimap2-ont |
minimap2 with '-x map-ont ' option |
minimap2-pb |
minimap2 with '-x map-pb ' option |
minimap2-hifi |
minimap2 with '-x map-hifi ' option |
minimap2-no-preset |
minimap2 with no '-x ' option |
- --minimap2-params PARAMS
Extra parameters to provide to minimap2, both indexing command (if used) and for mapping. Note that usage of this parameter has security implications if untrusted input is specified. '
-a
' is always specified to minimap2. [default: none]
- --minimap2-reference-is-index
Treat reference as a minimap2 database, not as a FASTA file. [default: not set]
- --bwa-params PARAMS
Extra parameters to provide to BWA or BWA-MEM2. Note that usage of this parameter has security implications if untrusted input is specified. [default: none]
- --strobealign-params PARAMS
Extra parameters to provide to strobealign. Note that usage of this parameter has security implications if untrusted input is specified. [default: none]
- --strobealign-use-index
Use a pregenerated index (one that has been created with 'strobealign --create-index'). The --reference option should be specified as the original FASTA file i.e. 'ref.fna' not 'ref.fna.r100.sti' [default: not set]
ALIGNMENT THRESHOLDING
- --min-read-aligned-length INT
Exclude reads with smaller numbers of aligned bases. [default:
0
]
- --min-read-percent-identity FLOAT
Exclude reads by overall percent identity e.g. 95 for 95%. [default:
0
]
- --min-read-aligned-percent FLOAT
Exclude reads by percent aligned bases e.g. 95 means 95% of the read's bases must be aligned. [default:
0
]
- --min-read-aligned-length-pair INT
Exclude pairs with smaller numbers of aligned bases. Implies --proper-pairs-only. [default:
0
]
- --min-read-percent-identity-pair FLOAT
Exclude pairs by overall percent identity e.g. 95 for 95%. Implies --proper-pairs-only. [default:
0
]
- --min-read-aligned-percent-pair FLOAT
Exclude reads by percent aligned bases e.g. 95 means 95% of the read's bases must be aligned. Implies --proper-pairs-only. [default:
0
]
- --proper-pairs-only
Require reads to be mapped as proper pairs. [default: not set]
- --exclude-supplementary
Exclude supplementary alignments. [default: not set]
- --include-secondary
Include secondary alignments. [default: not set]
COVERAGE CALCULATION OPTIONS
-m, --methods METHOD
method | description |
---|---|
relative_abundance |
(default) Percentage relative abundance of each genome, and the unmapped read percentage |
mean |
Average number of aligned reads overlapping each position on the genome |
trimmed_mean |
Average number of aligned reads overlapping each position after removing the most deeply and shallow-ly covered positions. See --trim-min /--trim-max to adjust. |
coverage_histogram |
Histogram of coverage depths |
covered_bases |
Number of bases covered by 1 or more reads |
variance |
Variance of coverage depths |
length |
Length of each genome in base pairs |
count |
Number of reads aligned to each genome. Note that supplementary alignments are not counted. |
reads_per_base |
Number of reads aligned divided by the length of the genome |
rpkm |
Reads mapped per kilobase of genome, per million mapped reads |
tpm |
Transcripts Per Million as described in Li et al 2010 https://doi.org/10.1093/bioinformatics/btp692 |
- --min-covered-fraction FRACTION
Genomes with less covered bases than this are reported as having zero coverage. [default:
10
]
- --contig-end-exclusion INT
Exclude bases at the ends of reference sequences from calculation [default:
75
]
- --trim-min FRACTION
Remove this smallest fraction of positions when calculating trimmed_mean [default:
5
]
- --trim-max FRACTION
Maximum fraction for trimmed_mean calculations [default:
95
]
OUTPUT
- -o, --output-file FILE
Output coverage values to this file, or '-' for STDOUT. [default: output to STDOUT]
- --output-format FORMAT
Shape of output: 'sparse' for long format, 'dense' for species-by-site. [default:
dense
]
- --no-zeros
Omit printing of genomes that have zero coverage. [default: not set]
- --bam-file-cache-directory DIRECTORY
Output BAM files generated during alignment to this directory. The directory may or may not exist. Note that BAM files in this directory contain all mappings, including those that later are excluded by alignment thresholding (e.g. --min-read-percent-identity) or genome-wise thresholding (e.g. --min-covered-fraction). [default: not used]
- --discard-unmapped
Exclude unmapped reads from cached BAM files. [default: not set]
GENERAL OPTIONS
- -t, --threads INT
Number of threads for mapping, sorting and reading. [default:
1
]
- -h, --help
Output a short usage message. [default: not set]
- --full-help
Output a full help message and display in 'man'. [default: not set]
- --full-help-roff
Output a full help message in raw ROFF format for conversion to other formats. [default: not set]
- -v, --verbose
Print extra debugging information. [default: not set]
- -q, --quiet
Unless there is an error, do not print log messages. [default: not set]
FREQUENTLY ASKED QUESTIONS (FAQ)
Can the temporary directory used be changed? CoverM makes use of the system temporary directory (often /tmp
) to store intermediate files. This can cause problems if the amount of storage available there is small or used by many programs. To fix, set the TMPDIR
environment variable e.g. to set it to use the current directory: TMPDIR=. coverm genome <etc>
For thresholding arguments e.g. \-\-dereplication\-ani and \-\-min\-read\-percent\-identity, should a percentage (e.g 97%) or fraction (e.g. 0.97) be specified? Either is fine, CoverM determines which is being used by virtue of being less than or greater than 1.
EXIT STATUS
- 0
Successful program execution.
- 1
Unsuccessful program execution.
- 101
The program panicked.
EXAMPLES
- Map paired reads to 2 genomes, and output relative abundances to output.tsv
$ coverm genome --coupled read1.fastq.gz read2.fastq.gz --genome-fasta-files genome1.fna genome2.fna -o output.tsv
- Calculate coverage of genomes defined as .fna files in genomes_directory/ from a sorted BAM file
$ coverm genome --bam-files my.bam --genome-fasta-directory genomes_directory/
- Dereplicate genomes at 99% ANI before mapping unpaired reads
$ coverm genome --genome-fasta-directory genomes/ --dereplicate --single single_reads.fq.gz