Not All HNSW Indices Are Made Equally | by Noam Schwartz | Jul, 2024


Photo by Robin Jonathan Deutsch on Unsplash

Rebuilding an HNSW index is one of the most resource-intensive aspects of using HNSW in production workloads. Unlike traditional databases, where data deletions can be handled by simply deleting a row in a table, using HNSW in a vector database often requires a complete rebuild to maintain optimal performance and accuracy.

Why is Rebuilding Necessary?

Because of its layered graph structure, HNSW is not inherently designed for dynamic datasets that change frequently. Adding new data or deleting existing data is essential for maintaining updated data, especially for use cases like RAG, which aims to improve search relevence.

Most databases work on a concept called “hard” and “soft” deletes. Hard deletes permanently remove data, while soft deletes flag data as ‘to-be-deleted’ and remove it later. The issue with soft deletes is that the to-be-deleted data still uses significant memory until it is permanently removed. This is particularly problematic in vector databases that use HNSW, where memory consumption is already a significant issue.

HNSW creates a graph where nodes (vectors) are connected based on their proximity in the vector space, and traversing on an HNSW graph is done like a skip-list. In order to support that, the layers of the graph are designed so that some layers have very few nodes. When vectors are deleted, especially those on layers that have very few nodes that serve as critical connectors in the graph, the whole HNSW structure can become fragmented. This fragmentation may lead to nodes (or layers) that are disconnected from the main graph, which require rebuilding of the entire graph, or at the very least will result in a degradation in the efficiency of searches.

HNSW then uses a soft-delete technique, which marks vectors for deletion but does not immediately remove them. This approach lowers the expense of frequent complete rebuilds, although periodic reconstruction is still needed to maintain the graph’s optimal state.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*