Building the foundations: Apache Lucene Accelerated with the NVIDIA cuVS 25.06 Release

GPU-accelerated Lucene

Apache Lucene is a fast search engine with capabilities ranging from traditional TF-IDF and BM25-based search to intelligent semantic search. When it comes to semantic search, one of the core pillars is the ability to perform nearest neighbour searches in a high-dimensional vector space using a series of techniques that have come to be known as vector search. These high-dimensional vectors are often created by running multimedia like images, video, and text through deep learning models to convert them into a representation known as vector embeddings.

Apache Lucene offers a vector search capability through the popular Hierarchical Navigable Small-World graphs (HNSW) algorithm. Like other methods for approximate nearest neighbors search, HNSW works by first modeling a set of training vectors, called an index, to enable fast approximate lookup of the nearest neighbors. HNSW is the industry standard for vector search on CPUs. However, as data volumes grow ever larger, the index creation process becomes progressively slower. Especially for high-dimensional vectors, such as those encountered with embedding vectors, the impact on indexing is much more pronounced than textual (simple inverted indexes) and algorithms optimized for lower-dimensional vectors, such as those encountered in geospatial data (BKD trees) in Lucene.

With the increased availability of highly parallel computation devices like GPUs, there are now opportunities to build vector search graphs much faster and more efficiently. The NVDIA cuVS library contains state-of-the-art algorithms for vector search and data clustering on the GPU that are much faster than HNSW (on the CPU) for building and searching vector indexes.

The value of GPU acceleration

The volumes of unstructured data have been growing exponentially since 2017, creating mounting challenges for databases to continue supporting vector search indexes at scale. What’s more – since vector search indexes are approximate, they are more similar to predictive machine learning models than they are to traditional database indexes. That is, when we don’t tune a traditional database index, we will get the same results, albeit slower. However, when we don’t tune a vector search index, we could get completely garbage results.

GPUs are extremely efficient at both building and tuning vector search indexes at scale and can reduce build times from days to minutes. Further, indexes built on the GPU with cuVS can be searched directly on the GPU or converted to HNSW indexes for search on the CPU. As an example, a RAG application that deploys vector search on the CPU can take advantage of faster index builds on GPU, significantly improving data readiness time, but deploy the indexes to CPU for search, as the vector search latency in typical RAG workflows is often not prohibitive when compared against the LLM inference performance. In some cases, building indexes on GPUs can be substantially more cost-efficient than using CPUs.

GPUs are also extremely efficient at high-throughput search, particularly when search requests are high and can be batched as is often the case for offline, machine learning workflows. For example, recommender systems might pre-populate search recommendations by batching thousands of requests as an offline task. In this case, the GPU easily outperforms CPU performance, again saving time and often reducing costs.

An open-source collaboration between SearchScale and NVIDIA

NVIDIA developed CAGRA, a state-of-the-art, graph-based search algorithm designed to leverage GPUs for massively parallel, high-dimensional search. Previously, CAGRA had API’s for C and C++, but not for Java. SearchScale initially explored the Java API for its integration using Java’s JNI interface. However, we later adopted the modern Project Panama-based Java-C interoperability, introduced in JDK 21+. This led us to create Java bindings for cuVS.

We are collaborating with NVIDIA to drive the development of cuVS and the cuVS Java APIs for the benefit of Apache Lucene and Apache Solr. This collaboration involves regular technical deep dives, discussions, and brainstorming sessions between engineers at SearchScale and NVIDIA. These sessions aim to identify opportunities to enhance cuVS’s existing feature set, ensuring it better suits large-scale production projects for Apache Lucene. We also maintain a working relationship with the Apache Solr community that stands to greatly benefit from these capabilities.

cuVS follows a two-month release cycle, using the CalVer scheme. In the June release, 25.06, we added some important improvements, which improve the Lucene connector:

CAGRA prefiltering: This feature allows for filtering of the resultset before the main search query on a CAGRA index. This is a building block for hybrid search, where other queries can reduce the result set to a fraction of the entire dataset.
Merge CAGRA indexes natively: Users of cuVS Java can now merge CAGRA indexes directly by calling the underlying merge API, without the need to handle in the Java layer. This means no need for copying data around from Java and C layers, back and forth, to merge indexes. This is beneficial for Lucene, since segment merges are performed often.
Efficient off-heap dataset copying: This involves copying datasets to memory outside of the Java heap, improving memory management. By using off-heap memory, it minimizes the pressure on the Java garbage collector, leading to better performance and stability. This will be useful in Lucene while indexing.

The Road Ahead..

What we have built so far is just a beginning. Going forward, we shall tackle deeper issues that affect very large-scale production systems.

CAGRA Serialization in Lucene’s HNSW format: Currently in cuVS, a CAGRA index can be serialized into hnswlib’s HNSW format, and use it for searching on machines that don’t have a GPU. However, Lucene has its own format for HNSW graphs and we are working to convert the CAGRA graph into a Lucene HNSW graph. This will make indexing seamless, whether done on a CPU or a GPU.
Tiered indexing: This is an indexing abstraction that contains either a brute force index or a CAGRA index (or both, depending on the cardinality of documents in a segment and the number of merges the segments undergo). This is a way to leverage the best of a brute force index (fast construction, expensive search) and a CAGRA index (slower to construct, faster to search).
Quantization support: Currently, 32 bit floating point values are supported for vectors. Support for 16 bit and 8 bit vectors would help fit more data points into the same memory, resulting in more compact indexes.
Multi-GPU support: When the dataset doesn’t fit in a single GPU, but the machines have multiple GPUs, the indexing and querying workloads can be shared across a pool of GPUs. This would also be preferable for multi-tenant search deployments to have GPU-level data and processing isolation.

Authors:

Ishan Chattopadhyaya, Apache Lucene and Solr committer, SearchScale
Corey Nolet, Principal Architect, Nvidia
Vivek Narang, Senior Software Engineer, SearchScale