Ultrafast search of all deposited bacterial and viral genomic data.
Bradley P, den Bakker HC, Rocha EPC, McVean G, Iqbal Z
Nat Biotechnol. Feb 2019. doi: 10.1038/s41587-018-0010-1
COMMENT: This article describes a new method called BIGSI for storing and searching large nucleotide sequence databases. The new method is based on the use of Sequence Bloom Trees that were previously used for the detection of specific transcripts in RNA-sequencing (RNA-Seq) data. BIGSI (BItsliced Genomic Signature Index) adapts Sequence Bloom Tree methods to the high diversity of k-tuples that exists in the microbial genomes sequences adding an alternative type of k-mer index that use a fixed-length binary ‘signature’ for representing each sequence. It reduces in several orders of magnitude the storage needs although increases false-positive rate in the detection of sequences in the indexed database. False positive rate can be reduced augmenting the size of the bit-vector and the number of hashing functions used to generate the binary encoding, always considering that the storage requirements and the access cost would be higher in that case.
The authors apply the method to index bacterial and viral whole-genome sequence datasets to search for antibiotic resistance genes.
Here, we use BIGSI to index the entire bacterial and viral WGS content in the ENA as of December 2016 and apply it to solve basic microbial surveillance problems. We also make a demonstration search publicly available at http://bigsi.io.
Schematically these are the steps of BIGSI method:
- Each input sequence is converted to a non-redundant list of overlapping k-mers present in the input sequence
- A fixed set of x hash functions is applied to each k-mer giving a tuple of x numbers (with values from 0 to n-1, being n the size of the bit vector) for each non-redundant sequence k-mer.
- Each sequence is stored as a fixed-length bit-vector, as a column in a matrix (a column for each sequence) and a row for each position of the bit-vector. Each bit-vector position is set to 1 if that number is included in any of the tuples of any of the k-mers of that sequence, and set to 0 if not.
- To query the BIGSI for a k-mer the hash functions are applied to the query k-mer, returning the rows to be checked. All sequences (columns) that have 1 in all of those rows contain the query k-mer.
We developed a data structure suitable for storing microbial genomic data, called BIGSI. We use the generic term ‘dataset’ to refer to either an assembled genome or unassembled sequence read files from clonal or non-clonal samples. BIGSI combines a k-mer index with constraints on sequence queries, described below. A Bloom filter is a data structure that stores data (here, k-mers) in a bit-vector (array of zeroes and ones) and answers set-membership queries (“is this k-mer contained in the set?”) probabilistically. The false-negative rate is zero, and the false-positive rate is controlled by two parameters (size of bit-vector and the number of (hash) functions used to generate the binary encoding), creating a trade-off between false-positive rate and compression.
BIGSI encodes data as a matrix in which each column is a Bloom filter of the k-mers in a dataset. Querying for presence of a k-mer involves applying the Bloom filter hash functions to this k-mer, which each return an integer, and taking the rows of the matrix (bitslices) indexed by these integers; datasets (columns) containing the k-mer have a 1 in all of those rows. Fast (O(1)) access to the rows is achieved using a hash mapping from row index to the corresponding bit-vector.
Incorporating a new dataset simply requires the addition of a new column, without needing to rebuild the index. However, because appending a new column to a BIGSI requires modifying every key in the index, this is an expensive operation. Our solution is to batch new inserts, building a new index per batch, and then merging indexes
Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as realtime genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.