FAQ

Can you target the 16S rRNA gene instead of the default set of single copy marker genes with SingleM?

Yes. By default, SingleM builds OTU tables from protein genes rather than 16S because this in general gives more strain-level resolution due to redundancy in the genetic code. If you are really keen on using 16S, then you can use SingleM with a 16S SingleM package (spkg). There is a repository of auxiliary packages at which includes a 16S package that is suitable for this purpose. The resolution won't be as high taxonomically, and there are issues around copy number variation, but it could be useful to use 16S for various reasons e.g. linking it to an amplicon study or using the GreenGenes taxonomy. For now there's no 16S spkg that gets installed by default, you have to use the --singlem-packages flag in pipe mode pointing to a separately downloaded package - see https://github.com/wwood/singlem_extra_packages. Searching for 16S reads is also much slower than searching for protein-encoding reads.

How should SingleM be run on multiple samples?

There are two ways. It is possible to specify multiple input files to the singlem pipe subcommand directly by space separating them. Alternatively singlem pipe can be run on each sample and OTU tables combined using singlem summarise. The results should be identical, though there are some performance trade-offs. For large numbers of metagenomes (>100) it is probably preferable to run each sample individually in smaller groups.

Note that the performance of a single pipe when run on many genomes drastically improved in version 0.17.0, and it now sensible to run up to 10,000 genomes at a time.

What is the difference between the num_hits and coverage columns in the OTU table and taxonomic profiles generated by the pipe mode?

num_hits is the number of reads found from the sample in that OTU. The coverage is the expected coverage of a genome with that OTU sequence i.e. the average number of bases covering each position in a genome after read mapping. This is calculated from num_hits. In particular, num_hits is the 'kmer coverage' formula used by genome assembly programs, and so coverage is calculated according to the following formula, adapted from the one given in the Velvet assembler's manual:

coverage = num_hits * L / (L - k + 1)

Where L is the length of a read and k is the length of the OTU sequence including inserts and gaps (usually 60 bp).

Powered by Doctave