Streaming histogram sketching for rapid microbiome analytics.
Rowe WP, Carrieri AP, Alcon-Giner C, Caim S, Shaw A, Sim K, Kroll JS, Hall LJ, Pyzer-Knapp EO, Winn MD
Microbiome. 03 2019. doi: 10.1186/s40168-019-0653-2
COMMENT: Microbiome sequence data is being augmented with large volumes of data, in part as a result of large-scale sequencing initiatives such as the Human Microbiome Project, the Earth Microbiome Project and the Global Ocean Survey. The ability to analyse microbiome data quickly and effectively is the main bottleneck in workflows. Additional de novo microbiome analysis methods are needed. This paper presents a novel method, as well as several practical examples, for rapid microbiome analytics using streaming histogram sketching.
Here we present a data sketching method for clustering, indexing and classifying microbiome sequencing data. We also describe and demonstrate our software implementation, Histosketching Using Little K-mers (HULK), that is a user-friendly and efficient implementation of the method.
We show our method to accurately cluster microbiome samples by sample type and demonstrate the utility of these histosketches to create and search microbiome sequence databases.
The results presented here evaluate our implementation of histosketching for rapid microbiome comparisons, in terms of both the accuracy of the tool and its potential applications. All analyses can be re-run using the analysis workbooks (https://github.com/will-rowe/hulk/tree/master/paper/analysis-notebooks).
In terms of the advantages of HULK over other de novo analysis methods (e.g. k-mer spectra dissimilarity analysis), we have shown here that the computation of histogram sketches from complete metagenomic datasets is 16 times faster than the computation of the full k-mer spectra and 17 times faster than the computation of the MinHash sketch
we have shown that microbiome samples can be histosketched on a laptop with a few cores and a small, fixed amount of memory. In order to fully take advantage of this performance, histosketching needs to move beyond command line interfaces. To this end, we have begun working on a WebAssembly (WASM) port of HULK to enable client side sketching (WASM available Go Version 1.11) so that users can histosketch their own microbiome data and compare just the sketches against online databases, ensuring their microbiome data remains private but enabling quick and easy microbiome analytics.
we show that histosketches are suitable features for training ML classifiers and can accurately classify microbiome samples according to antibiotic treatment history in at-risk preterm infant populations.
Histosketching generates compact representations of microbiomes from data streams, facilitating sample indexing, similarity-search queries, clustering, and the application of machine learning methods to analyse microbiome samples in the context of the global microbiome corpus.
Note: ML, machine learning