Redefining the Database for AI: Why MongoDB Acquired Voyage AI

Dev Ittycheria, President and CEO, MongoDB
February 24, 2025 | Updated: March 5, 2025
#ArtificialIntelligence

This post is also available in: Deutsch, Français, Español, Português, Italiano, 한국어, 简体中文.

AI is reshaping industries, redefining customer experiences, and transforming how businesses innovate, operate, and compete. While much of the focus is on frontier models, a fundamental challenge lies in data—how it is stored, retrieved, and made useful for AI applications. The democratization of AI-powered software depends on building on top of the right abstractions, yet today, creating useful, real-time AI applications at scale is not feasible for most organizations.

The challenge isn’t just complexity—it’s trust. AI models are probabilistic, meaning their outputs aren’t deterministic and predictable. This is easily evident in the hallucination problem in chatbots today, and becomes even more critical with the rise of agents, where AI systems make autonomous decisions. Development teams need the ability to control, shape, and ground generated outputs to align with their objectives and ensure accuracy.

AI-powered search and retrieval is a powerful tool that extracts relevant contextual data from specific sources, augmenting AI models to generate reliable and accurate responses or take responsible and safe actions, as seen in the prominent retrieval augmented generation (RAG) approach. At the core of AI-powered retrieval are embedding generation and reranking—two key AI components that capture the semantic meaning of data and assess the relevance of queries and results. We believe embedding generation and reranking, as well as AI-powered search, belong in the database layer, simplifying the stack and creating a more reliable foundation for AI applications. By bringing more intelligence into the database, we help businesses mitigate hallucinations, improve trustworthiness, and unlock AI’s full potential at scale.

The most impactful applications require a flexible, intelligent, and scalable data foundation. That’s why we’re excited to announce the acquisition of Voyage AI, a leader in embedding and reranking models that dramatically improve accuracy through AI-powered search and retrieval. This move isn’t just about adding AI capabilities—it’s about redefining the database for the AI era.

Why this matters: The future of AI is built on better relevance and accuracy in data

AI is probabilistic—it’s not built like traditional software with pre-defined rules and logic. Instead, it generates responses or takes action based on how the AI model is trained and what data is retrieved. However, due to the probabilistic nature of the technology, AI can hallucinate. Hallucinations are a direct consequence of poor or imprecise retrieval—when AI lacks access to the right data, it generates plausible but incorrect information. This is a critical barrier to AI adoption, especially in enterprises and for mission-critical use cases where accuracy is non-negotiable.

This makes retrieving the most relevant data essential for AI applications to deliver high-quality, contextually accurate results. Today, developers rely on a patchwork of separate components to build AI-powered applications. Sub-optimal choices of these components, such as embedding models, can yield low-relevancy data retrieval and low-quality generated outputs. This fragmented approach is complex, costly, inefficient, and cumbersome for developers.

With Voyage AI, MongoDB solves this challenge by making AI-powered search and retrieval native to the database. Instead of implementing workarounds or managing separate systems, developers can generate high-quality embeddings from real-time operational data, store vectors, perform semantic search, and refine results—all within MongoDB. This eliminates complexity and delivers higher accuracy, lower latency, and a streamlined developer experience.

Building AI-powered applications before and after MongoDB and Voyage AI.

What Voyage AI brings to MongoDB

Voyage AI has built a world-class AI research team with roots at Stanford, MIT, UC Berkeley, and Princeton and has rapidly become a leader in high-precision AI retrieval. Their technology is already trusted by some of the most advanced AI startups, including Anthropic, LangChain, Harvey, and Replit.

Notably, Voyage AI’s embedding models are the highest-rated zero-shot models in the Hugging Face community. Voyage AI’s models are designed to increase the quality of generated output by:

Enhancing vector search by creating embeddings that better capture meaning across text, images, PDFs, and structured data.
Improving retrieval accuracy through advanced reranking models that refine search results for AI-powered applications.
Enabling domain-specific AI with fine-tuned models optimized for different industries such as financial services, healthcare, and law, and use cases such as code generation.

By integrating Voyage AI’s retrieval capabilities into MongoDB, we’re helping organizations more easily build AI applications with greater accuracy and reliability—without unnecessary complexity.

How Voyage AI will be integrated into MongoDB

We are integrating Voyage AI with MongoDB in three phases. In the first phase, Voyage AI’s text embedding, multi-modal embedding, and reranking models will remain widely available through Voyage AI’s current APIs and via the AWS and Azure Marketplaces—ensuring developers can continue to use their best-in-class embedding and reranking capabilities. We will also invest in the scalability and enterprise readiness of the platform to support the increased adoption of Voyage AI’s models.

Next, we will seamlessly embed Voyage AI’s capabilities into MongoDB Atlas, starting with an auto-embedding service for Vector Search, which will handle embedding generation automatically. Native reranking will follow, allowing developers to boost retrieval accuracy instantly. We also plan to expand domain-specific AI capabilities to better support different industries (e.g., financial services, legal, etc.) or use cases (e.g., code generation).

Finally, we will advance AI-powered retrieval with enhanced multi-modal capabilities, enabling seamless retrieval and ranking of text, images, and video. We also plan to introduce instruction-tuned models, allowing developers to refine search behavior using simple prompts instead of complex fine-tuning. This will be complemented by embedding lifecycle management in MongoDB Atlas, ensuring continuous updates and real-time optimization for AI applications.

What this means for developers and businesses

AI-powered applications need more than a database that just stores, processes, and persists data—they need a database that actively improves retrieval accuracy, scales seamlessly, and eliminates operational friction. With Voyage AI, MongoDB redefines what’s required for a database to underpin mission-critical AI-powered applications.

Developers will no longer need to manage external embedding APIs, standalone vector stores, or complex search pipelines. AI retrieval will be built into the database itself, making semantic search, vector retrieval, and ranking as seamless as traditional queries.

For businesses, this translates to faster time-to-value and greater confidence in scaling AI applications. By delivering high-quality results at scale, enterprises can seamlessly integrate AI into their most critical use cases, ensuring reliability, performance, and real-world impact.

Looking ahead: What comes next

This is just the beginning. Our vision is to make MongoDB the most powerful and intuitive database for modern, AI-driven applications.

Voyage AI’s models will soon be natively available in MongoDB Atlas.
We will continue evolving MongoDB’s AI retrieval capabilities, making it smarter, more adaptable, and capable of handling a wider range of data types and use cases.

Stay tuned for more details on how you can start using Voyage AI’s capabilities in MongoDB.

To learn more about how MongoDB and Voyage AI are powering state-of-the-art AI search and retrieval for building, scaling, and deploying intelligent applications, visit our product page.

← Previous

为 AI 重新定义数据库：MongoDB 为何收购 Voyage AI

AI 正在重塑各行各业，重新定义客户体验，并改变企业创新、运营和竞争的方式。尽管大部分关注点在前沿模型上，但一项根本的挑战在于数据 — 如何存储和检索数据并让数据为 AI 应用所用。AI 驱动软件的民主化依赖于在正确的抽象层上进行开发，但目前，对于大多数组织来说，大规模创建有用的实时 AI 应用仍然不可行。挑战不仅在于复杂性，还在于信任。AI 模型是概率性的，这意味着其输出不具有确定性和可预测性。这在当今聊天机器人的幻觉问题中显而易见，并且随着 AI 智能载体的兴起，AI 系统可以自主做出决策，这一点变得更加重要。开发团队需要能够控制、塑造和调整生成的输出，以符合其目标并确保准确性。 AI 驱动的搜索和检索是一项强大的工具，可以从特定来源提取相关的上下文数据，增强 AI 模型，以生成可靠和准确的响应或采取负责任和安全的行动，这在著名的检索增强生成（RAG）方法中得到了体现。在 AI 驱动的检索中，核心是嵌入生成和重新排序 — 这两个关键的 AI 组件能够捕捉数据的语义含义，并评估问询和结果的相关性。我们认为将生成、重新排序以及 AI 驱动的搜索嵌入数据库层可简化堆栈，从而为 AI 应用奠定更可靠的基础。通过将更多智能引入数据库，我们帮助企业减少幻觉，提高可信度，并在 AI 扩展上释放 AI 的全部潜力。最具影响力的应用需要一个灵活、智能且可扩展的数据基础。因此，我们很高兴地宣布收购了 Voyage AI ，这是一家在嵌入和重新排序模型领域的领导者，通过 AI 驱动的搜索和检索显著提高了准确性。此举不仅旨在增加 AI 功能，更是关乎为 AI 时代重新定义数据库。为什么这很重要：AI 的未来构建在数据更高的相关性和准确性之上 AI 是概率性的 — 这不像传统软件那样具有预定义的规则和逻辑。相反，它会根据 AI 模型的训练方式和检索到的数据生成响应或采取行动。然而，由于该技术的概率性，AI 可能会出现幻觉。幻觉是检索不佳或不精确的直接后果 — 当 AI 无法访问正确的数据时，它会生成看似合理但不正确的信息。这是 AI 采用的一项关键障碍，尤其是在企业中以及在准确性不可妥协的关键任务用例中。这使得检索最相关的数据对于 AI 应用程序提供高质量、上下文准确的结果至关重要。如今，开发者依赖于拼凑而成的独立组件来构建 AI 驱动的应用程序。这些组件的次优选择，例如嵌入模型，可能会导致低相关性的数据检索和低质量的生成输出。这种分散的方法对开发者来说既复杂、昂贵、效率低下，又繁琐。借助 Voyage AI，MongoDB 通过使 AI 驱动的搜索和检索成为数据库的原生功能，解决了这一挑战。开发者无需实施变通方法或管理单独的系统，而是可以从实时操作数据中生成高质量的嵌入，存储向量，执行语义搜索，并优化结果——所有这些都在 MongoDB 中完成。这消除了复杂性，并提供了更高的准确性、更低的延迟和简化的开发者体验。 Voyage AI 为 MongoDB 带来的优势 Voyage AI 已组建了一支以斯坦福大学、麻省理工学院、加州大学伯克利分校和普林斯顿大学为基础的世界级 AI 研究团队，并迅速成为高精度 AI 检索领域的领导者。他们的技术已经被一些最先进的 AI 初创企业所信任，包括 Anthropic、LangChain、Harvey 和 Replit。值得注意的是，Voyage AI 的嵌入模型是 Hugging Face 社区中评分最高的零样本模型。Voyage AI 的模型旨在通过以下方式提高生成输出的质量：通过创建更好地捕捉文本、图像、PDF 和结构化数据含义的嵌入来增强向量搜索。通过先进的重新排序模型提高检索准确性，以优化 AI 驱动式应用的搜索结果。通过使用针对金融服务、医疗保健、法律等不同行业以及代码生成等使用案例进行优化的微调模型，启用特定领域的 AI。通过将 Voyage AI 的检索功能集成到 MongoDB 中，我们正在帮助组织更轻松地构建更准确、更可靠的 AI 应用，而不会增加不必要的复杂性。如何将 Voyage AI 集成到 MongoDB 中我们将 Voyage AI 与 MongoDB 的集成分为三个阶段。在第一阶段，Voyage AI 的文本嵌入、多模态嵌入和重排序模型将继续通过 Voyage AI 的现有 API 以及 AWS 和 Azure 云市场广泛提供，确保开发者可以继续使用其先进的嵌入和重新排序功能。我们还将投资于平台的可扩展性和企业级就绪能力，以支持 Voyage AI 模型的更广泛采用。接下来，我们会将 Voyage AI 的功能无缝嵌入到 MongoDB Atlas 中，首先推出用于 Vector Search 的自动嵌入服务，该服务将自动处理嵌入生成。然后将进行原生重新排序，使开发人员能够立即提高检索准确性。我们还计划扩展特定领域的 AI 功能，以更好地支持不同行业（例如，金融服务、法律等）或用例（例如，代码生成）。最后，我们将通过增强的多模态功能推进 AI 驱动的检索，实现文本、图像和视频的无缝检索和排序。我们还计划引入指令调整模型，允许开发者使用简单的提示而不是复杂的微调来优化搜索行为。这将通过在 MongoDB Atlas 中嵌入生命周期管理来实现补充，确保 AI 应用的持续更新和实时优化。这对开发者和企业意味着什么？ AI 驱动的应用需要的不仅仅是一个存储、处理和持久化数据的数据库，而是还需要一个能够主动提高检索准确性、无缝扩展并消除操作摩擦的数据库。借助 Voyage AI，MongoDB 重新定义了支撑任务关键型 AI 驱动的应用的数据库要求。开发者将不再需要管理外部嵌入 API、独立运行的实例向量存储或复杂的搜索管道。AI 检索将内建到数据库中，实现与传统查询一样的无缝语义搜索、向量检索和排序。对于企业来说，这意味着能够更加信心十足地扩展 AI 应用，并加快价值实现速度。通过在大规模扩展交付高质量的结果，企业可以将 AI 无缝集成到其最关键的用例中，确保可靠性、性能和实际影响。展望未来：接下来会发生什么这仅仅是个开始。我们的愿景是将 MongoDB 打造成最强大且直观的数据库，适用于现代 AI 驱动的应用程序。 Voyage AI 的模型将很快在 MongoDB Atlas 中原生可用。我们将继续提升 MongoDB 的 AI 检索能力，使其更智能、更具适应性，并能够处理更广泛的数据类型和应用场景。请继续关注最新动态，详细了解如何在 MongoDB 中开始使用 Voyage AI 功能。要了解更多关于 MongoDB 和 Voyage AI 如何为构建、扩展和部署智能应用提供最先进的 AI 搜索和检索功能的信息，请访问我们的产品页面。

February 24, 2025

Next →

Scaling Vector Search with MongoDB Atlas Quantization & Voyage AI Embeddings

Key Takeaways Vector quantization fundamentals: A technique that compresses high-dimensional embeddings from 32-bit floats to lower precision formats (scalar/int8 or binary/1-bit), enabling significant performance gains while maintaining semantic search capabilities Performance vs. precision trade-offs: Binary quantization provides maximum speed (80% faster queries) with minimal resources; scalar quantization offers balanced performance and accuracy; float32 maintains highest fidelity at significant resource cost Resource optimization: Vector quantization can reduce RAM usage by up to 24x (binary) or 3.75x (scalar); storage footprint decreases by 38% using BSON binary format Scaling benefits: Performance advantages multiply at scale; most significant for vector databases exceeding 1M embeddings Semantic preservation: Quantization-aware models like Voyage AI's retain high representation capacity even after compression Search quality control: Binary quantization may require rescoring for maximum accuracy; scalar quantization typically maintains 90%+ retention of float32 results Implementation ease: MongoDB's automatic quantization requires minimal code changes to leverage quantization techniques As vector databases scale into the millions of embeddings, the computational and memory requirements of high-dimensional vector operations become critical bottlenecks in production AI systems. Without effective scaling strategies, organizations face: Infrastructure costs that grow exponentially with data volume Unacceptable query latency that degrades user experience and limits real-time applications Limited and restricted deployment options, particularly on edge devices or resource-constrained environments Diminished competitive advantage as AI capabilities become limited by technical constraints and bottlenecks rather than use case innovation This technical guide demonstrates advanced techniques for optimizing vector search operations through precision-controlled quantization—transforming resource-intensive 32-bit float embeddings into performance-optimized representations while preserving semantic fidelity. By leveraging MongoDB Atlas Vector Search ’s automatic quantization capabilities with Voyage AI's quantization-aware embedding models, we'll implement systematic optimization strategies that dramatically reduce both computational overhead and memory footprint. This guide provides an empirical analysis of the critical performance metrics: Retrieval latency benchmarking: Quantitative comparison of search performance across binary, scalar, and float32 precision levels with controlled evaluation of HNSW(hierarchical navigable small world) graph exploration parameters and k-retrieval variations. Representational capacity retention: Precise measurement of semantic information preservation through direct comparison of quantized vector search results against full-fidelity retrieval, with particular attention to retention curves across varying retrieval depths. We'll present implementation strategies and evaluation methodologies for vector quantization that simultaneously optimize for both computational efficiency and semantic fidelity—enabling you to make evidence-based architectural decisions for production-scale AI retrieval systems handling millions of embeddings. The techniques demonstrated here are directly applicable to enterprise-grade RAG architectures, recommendation engines, and semantic search applications where millisecond-level latency improvements and dramatic RAM reduction translate to significant infrastructure cost savings. The full end to end implementation for automatic vector quantization and other operations involved in RAG/Agent pipelines can be found on our Github repository . Auto-quantization of Voyage AI embeddings with MongoDB Our approach addresses the complete optimization cycle for vector search operations, covering: Generating embeddings with quantization-aware models Implementing automatic vector quantization in MongoDB Atlas Creating and configuring specialized vector search indices Measuring and comparing latency across different quantization strategies Quantifying representational capacity retention Analyzing performance trade-offs between binary, scalar, and float32 implementations Making evidence-based architectural decisions for production AI retrieval systems Figure 1. Vector quantization architecture with MongoDB Atlas and Voyage AI. Using text data as an example, we convert documents into numerical vector embeddings that capture semantic relationships. MongoDB then indexes and stores these embeddings for efficient similarity searches. By comparing queries run against float32, int8, and binary embeddings, you can gauge the trade-offs between precision and performance and better understand which quantization strategy best suits large-scale, high-throughput workloads. One key takeaway from this article is that representational capacity retention is highly dependent on the embedding model used. With quantization-aware models like Voyage AI’s voyage-3-large at appropriate dimensionality (1024 dimensions), our tests demonstrate that we can achieve 95%+ recall retention at reasonable numCandidate values. This means organizations can significantly reduce memory and computational requirements while preserving semantic search quality, provided they select embedding models specifically designed to maintain their representation capacity after quantization. For more information on why vector quantization is crucial for AI workloads, refer to this blog post . Dataset information Our quantization evaluation framework leverages two complementary datasets designed specifically to benchmark semantic search performance across different precision levels. Primary Dataset ( Wikipedia-22-12-en-voyage-embed ): Contains approximately 300,000 Wikipedia article fragments with pre-generated 1024-dimensional embeddings from Voyage AI’s voyage-3-large model. This dataset serves as a diverse vector corpus for testing vector quantization effects in semantic search. Throughout this tutorial, we'll use the primary dataset to demonstrate the technical implementation of quantization. Embedding generation with Voyage AI For generating new embeddings for AI Search applications, we use Voyage AI's voyage-3-large model, which is specifically designed to be quantization-aware. The voyage-3-large model generates 1024-dimensional vectors and has been specifically trained to maintain semantic properties even after quantization, making it ideal for our AI retrieval optimization strategy. For more information on how MongoDB and Voyage AI work together for optimal retrieval, see our previous article, Rethinking Information Retrieval with MongoDB and Voyage AI . import voyageai # Initialize the Voyage AI client client = voyageai.Client() def get_embedding(text, task_prefix="document"): """ Generate embeddings using the voyage-3-large model for AI Retrieval. Parameters: text (str): The input text to be embedded. task_prefix (str): A prefix describing the task; this is prepended to the text. Returns: list: The embedding vector (1024 dimensions). """ if not text.strip(): print("Attempted to get embedding for empty text.") return [] # Call the Voyage API to generate the embedding result = client.embed([text], model="voyage-3-large", input_type=task_prefix) # Return the first embedding from the result return result.embeddings[0] Converting embeddings to BSON BinData format A critical optimization step is converting embeddings to MongoDB's BSON BinData format , which significantly reduces storage and memory requirements. The BinData vector format provides significant advantages: Reduces disk space by approximately 3x compared to arrays Enables more efficient indexing with alternate types (int8, binary) Reduces RAM usage by 3.75x for scalar and 24x for binary quantization from bson.binary import Binary, BinaryVectorDtype def generate_bson_vector(array, data_type): return Binary.from_vector(array, BinaryVectorDtype(data_type)) # Convert embeddings to BSON BinData vector format wikipedia_data_df["embedding"] = wikipedia_data_df["embedding"].apply( lambda x: generate_bson_vector(x, BinaryVectorDtype.FLOAT32) ) Vector index creation with different quantization strategies The cornerstone of our performance optimization framework lies in creating specialized vector indices with different quantization strategies. This process leverages MongoDB for general-purpose database functionalities, more specifically, its high-performance vector database capabilities of efficiently handling million-scale embedding collections. This implementation step focuses on how to set up MongoDB's vector search capabilities with automatic quantization, focusing on two primary quantization strategies: scalar (int8) and binary. Two indices are created to measure and evaluate the retrieval latency and recall performance of various precision data types, including the full fidelity vector representation. The MongoDB database uses the vector index HNSW, which is a graph-based indexing algorithm that organizes vectors in a hierarchical structure of layers. In this structure, vector data points within a layer are contextually similar, while higher layers are sparse compared to lower layers, which are denser and contain more vector data points. The code snippet below showcases the implementation of two quantization strategies in parallel; this enables the systematic evaluation of the latency, memory usage, and representational capacity trade-offs across the precision spectrum, enabling data-driven decisions about the optimal approach for specific application requirements. MongoDB Atlas automatic quantization is activated entirely through the vector index definition. By including the "quantization" attribute and setting its value to either "scalar" or "binary", you enable automatic compression of your embeddings at index creation time. This declarative approach means no separate preprocessing of vectors is required—MongoDB handles the dimensional reduction transparently while maintaining the original embeddings for potential rescoring operations. from pymongo.operations import SearchIndexModel def setup_vector_search_index(collection, index_definition, index_name="vector_index"): """Setup a vector search index with the specified configuration""" ... # 1. Scalar Quantized Index (int8) vector_index_definition_scalar_quantized = { "fields": [{ "type": "vector", "path": "embedding", "quantization": "scalar", # Uses int8 quantization "numDimensions": 1024, "similarity": "cosine", }] } # 2. Binary Quantized Index (1-bit) vector_index_definition_binary_quantized = { "fields": [{ "type": "vector", "path": "embedding", "quantization": "binary", # Uses binary (1-bit) quantization "numDimensions": 1024, "similarity": "cosine", }] } # 3. Float32 ANN Index (no quantization) vector_index_definition_float32_ann = { "fields": [{ "type": "vector", "path": "embedding", "numDimensions": 1024, "similarity": "cosine", }] } # Create the indices setup_vector_search_index( wiki_data_collection, vector_index_definition_scalar_quantized, "vector_index_scalar_quantized" ) setup_vector_search_index( wiki_data_collection, vector_index_definition_binary_quantized, "vector_index_binary_quantized" ) setup_vector_search_index( wiki_data_collection, vector_index_definition_float32_ann, "vector_index_float32_ann" ) Implementing vector search functionality Vector search serves as the computational foundation of modern generative AI systems. While LLMs provide reasoning and generation capabilities, vector search delivers the contextual knowledge necessary for grounding these capabilities in relevant information. This semantic retrieval operation forms the backbone of RAG architectures that power enterprise-grade AI applications, such as knowledge-intensive chatbots and domain-specific assistants. In more advanced implementations, vector search enables agentic RAG systems where autonomous agents dynamically determine what information to retrieve, when to retrieve it, and how to incorporate it into complex reasoning chains. The implementation below provides the technical overview that transforms raw embedding vectors into intelligent search components that move beyond lexical matching to true semantic understanding. Our implementation below supports both approximate nearest neighbor (ANN) search and exact nearest neighbor (ENN) search through the use_full_precision parameter: Approximate nearest neighbor (ANN) search: When use_full_precision = False , the system performs an approximate search using: The specified quantized index (binary or scalar) The HNSW graph navigation algorithm A controlled exploration breadth via numCandidates This approach sacrifices perfect accuracy for dramatic performance gains, particularly at scale. The HNSW algorithm enables sub-linear time complexity by intelligently sampling the vector space, making it possible to search billions of vectors in milliseconds instead of seconds. When combined with quantization, ANN delivers order-of-magnitude improvements in both speed and memory efficiency. Exact nearest neighbor (ENN) search: When use_full_precision = True , the system performs exact search using: The original float32 embeddings (regardless of the index specified) An exhaustive comparison approach The exact = True directive to bypass approximation techniques ENN guarantees finding the mathematically optimal nearest neighbors by computing distances between the query vector and every single vector in the database. This brute-force approach provides perfect recall but scales linearly with collection size, becoming prohibitively expensive as vector counts increase beyond millions. We include both search modes for several critical reasons: Establishing ground truth: ENN provides the "perfect" baseline against which we measure the quality degradation of approximation techniques. The representational retention metrics discussed later directly compare ANN results against this ENN ground truth. Varying application requirements: Not all AI applications prioritize the same metrics. Time-sensitive applications (real-time customer service) might favor ANN's speed, while high-stakes applications (legal document analysis) might require ENN's accuracy. def custom_vector_search( user_query, collection, embedding_path, vector_search_index_name="vector_index", top_k=5, num_candidates=25, use_full_precision=False, ): """ Perform vector search with configurable precision and parameters for AI Search applications. """ # Generate embedding for the query query_embedding = get_embedding(user_query, task_prefix="query") # Define the vector search stage vector_search_stage = { "$vectorSearch": { "index": vector_search_index_name, "queryVector": query_embedding, "path": embedding_path, "limit": top_k, } } # Configure search precision approach if not use_full_precision: # For approximate nearest neighbor (ANN) search vector_search_stage["$vectorSearch"]["numCandidates"] = num_candidates else: # For exact nearest neighbor (ENN) search vector_search_stage["$vectorSearch"]["exact"] = True # Project only needed fields project_stage = { "$project": { "_id": 0, "title": 1, "text": 1, "wiki_id": 1, "url": 1, "score": {"$meta": "vectorSearchScore"} } } # Build and execute the pipeline pipeline = [vector_search_stage, project_stage] ... # Execute the query results = list(collection.aggregate(pipeline)) return {"results": results, "execution_time_ms": execution_time_ms} Measuring the retrieval latency of various quantized vectors In production AI retrieval systems, query latency directly impacts user experience, operational costs, and system throughput capacity. Vector search operations typically constitute the primary performance bottleneck in RAG architectures, making latency optimization a critical engineering priority. Sub-100ms response times are often necessary for interactive applications and mission-critical applications, while batch processing systems may tolerate higher latencies but require consistent predictability for resource planning. Our latency measurement methodology employs a systematic, parameterized approach that models real-world query patterns while isolating the performance characteristics of different quantization strategies. This parameterized benchmarking enables us to: Construct detailed latency profiles across varying retrieval depths Identify performance inflection points where quantization benefits become significant Map the scaling curves of different precision levels as the data volume increases Determine optimal configuration parameters for specific throughput targets def measure_latency_with_varying_topk( user_query, collection, vector_search_index_name, use_full_precision=False, top_k_values=[5, 10, 50, 100], num_candidates_values=[25, 50, 100, 200, 500, 1000, 2000], ): """ Measure search latency across different configurations. """ results_data = [] for top_k in top_k_values: for num_candidates in num_candidates_values: # Skip invalid configurations if num_candidates < top_k: continue # Get precision type from index name precision_name = vector_search_index_name.split("vector_index")[1] precision_name = precision_name.replace("quantized", "").capitalize() if use_full_precision: precision_name = "_float32_ENN" # Perform search and measure latency vector_search_results = custom_vector_search( user_query=user_query, collection=collection, embedding_path="embedding", vector_search_index_name=vector_search_index_name, top_k=top_k, num_candidates=num_candidates, use_full_precision=use_full_precision, ) latency_ms = vector_search_results["execution_time_ms"] # Store results results_data.append({ "precision": precision_name, "top_k": top_k, "num_candidates": num_candidates, "latency_ms": latency_ms, }) print(f"Top-K: {top_k}, NumCandidates: {num_candidates}, " f"Latency: {latency_ms} ms, Precision: {precision_name}") return results_data Latency results analysis Our systematic benchmarking reveals dramatic performance differences between quantization strategies across different retrieval scenarios. The visualizations below capture these differences for top-k=10 and top-k=100 configurations. Figure 2. Search latency vs the number candidates for top-k=10 Figure 3. Search latency vs the number of candidates for top-k=100. Several critical patterns emerge from these latency profiles: Quantization delivers exponential performance gains: The float32_ENN approach (purple line) demonstrates latency measurements an order of magnitude higher than any quantized approach. At top-k=10, ENN latency starts at ~1600ms and never drops below 500ms, while quantized approaches maintain sub-100ms performance until extremely high candidate counts. This performance gap widens further as data volume scales. Scalar quantization offers the best performance profile: Somewhat surprisingly, scalar quantization (orange line) consistently outperforms both binary quantization and float32 ANN across most configurations. This is particularly evident at higher num_candidates values, where scalar quantization maintains near-flat latency scaling. This suggests scalar quantization achieves an optimal balance in the memory-computation trade-off for HNSW traversal. Binary quantization shows linear latency scaling: While binary quantization (red line) starts with excellent performance, its latency increases more steeply as num_candidates grows, eventually exceeding scalar quantization at very high exploration depths. This suggests that while binary vectors require less memory, their distance computation savings are partially offset by the need for more complex traversal patterns in the HNSW graph and rescoring. All quantization methods maintain interactive-grade performance: Even with 10,000 candidate explorations and top-k=100, all quantized approaches maintain sub-200ms latency, well within interactive application requirements. This demonstrates that quantization enables order-of-magnitude increases in exploration depth without sacrificing user experience, allowing for dramatic recall improvements while maintaining acceptable latency. These empirical results validate our theoretical understanding of quantization benefits and provide concrete guidance for production deployment: scalar quantization offers the best general-purpose performance profile, while binary quantization excels in memory-constrained environments with moderate exploration requirements. In the images below we employ logarithmic scaling for both axes in our latency analysis because search performance data typically spans multiple orders of magnitude. When comparing different precision types (scalar, binary, float32_ann) across varying numbers of candidates, the latency values can range from milliseconds to seconds, while candidate counts may vary from hundreds to millions. Linear plots would compress smaller values and make it difficult to observe performance trends across the full range(as we see above). Logarithmic scaling transforms exponential relationships into linear ones, making it easier to identify proportional changes, compare relative performance improvements, and detect patterns that would otherwise be obscured. This visualization approach is particularly valuable for understanding how each precision type scales with increasing workload and for identifying the optimal operating ranges where certain methods outperform others(as shown below). Figure 4. Search latency vs the number of candidates (log scale) for top-k=10. Figure 5. Search latency vs the number of candidates (log scale) for top-k=100. The performance characteristics observed in the logarithmic plots above directly reflect the architectural differences inherent in binary quantization's two-stage retrieval process. Binary quantization employs a coarse-to-fine search strategy: an initial fast retrieval phase using low-precision binary representations, followed by a refinement phase that rescores the top-k candidates using full-precision vectors to restore accuracy. This dual-phase approach creates a fundamental performance trade-off that manifests differently across varying candidate pool sizes. For smaller candidate sets, the computational savings from binary operations during the initial retrieval phase can offset the rescoring overhead, making binary quantization competitive with other methods. However, as the candidate pool expands, the rescoring phase—which must compute full-precision similarity scores for an increasing number of retrieved candidates—begins to dominate the total latency profile. Measuring representational capacity retention While latency optimization is critical for operational efficiency, the primary concern for AI applications remains semantic accuracy. Vector quantization introduces a fundamental trade-off: computational efficiency versus representational capacity. Even the most performant quantization approach is useless if it fails to maintain the semantic relationships encoded in the original embeddings. To quantify this critical quality dimension, we developed a systematic methodology for measuring representational capacity retention—the degree to which quantized vectors preserve the same nearest-neighbor relationships as their full-precision counterparts. This approach provides an objective, reproducible framework for evaluating semantic fidelity across different quantization strategies. def measure_representational_capacity_retention_against_float_enn( ground_truth_collection, collection, quantized_index_name, top_k_values, num_candidates_values, num_queries_to_test=1, ): """ Compare quantized search results against full-precision baseline. For each test query: 1. Perform baseline search with float32 exact search 2. Perform same search with quantized vectors 3. Calculate retention as % of baseline results found in quantized results """ retention_results = {"per_query_retention": {}} overall_retention = {} # Initialize tracking structures for top_k in top_k_values: overall_retention[top_k] = {} for num_candidates in num_candidates_values: if num_candidates < top_k: continue overall_retention[top_k][num_candidates] = [] # Get precision type precision_name = quantized_index_name.split("vector_index")[1] precision_name = precision_name.replace("quantized", "").capitalize() # Load test queries from ground truth annotations ground_truth_annotations = list( ground_truth_collection.find().limit(num_queries_to_test) ) # For each annotation, test all its questions for annotation in ground_truth_annotations: ground_truth_wiki_id = annotation["wiki_id"] ... # Calculate average retention for each configuration avg_overall_retention = {} for top_k, cand_dict in overall_retention.items(): avg_overall_retention[top_k] = {} for num_candidates, retentions in cand_dict.items(): if retentions: avg = sum(retentions) / len(retentions) else: avg = 0 avg_overall_retention[top_k][num_candidates] = avg retention_results["average_retention"] = avg_overall_retention return retention_results Our methodology takes a rigorous approach to retention measurement: Establishing ground truth: We use float32 exact nearest neighbor (ENN) search as the baseline "perfect" result set, acknowledging that these are the mathematically optimal neighbors. Controlled comparison: For each query in our annotation dataset, we perform parallel searches using different quantization strategies, carefully controlling for top-k and num_candidates parameters. Retention calculation: We compute retention as the ratio of overlapping results between the quantized search and the ENN baseline: |quantized_results ∩ baseline_results| / |baseline_results|. Statistical aggregation: We average retention scores across multiple queries to account for query-specific variations and produce robust, generalizable metrics. This approach provides a direct, quantitative measure of how much semantic fidelity is preserved after quantization. A retention score of 1.0 indicates that the quantized search returns exactly the same results as the full-precision search, while lower scores indicate divergence. Representational capacity results analysis The findings from the representational capacity retention evaluation provide empirical validation that properly implemented quantization—particularly scalar quantization—can maintain semantic fidelity while dramatically reducing computational and memory requirements. Note that in the chart below, the scalar curve (yellow) exactly matches the float32_ann performance (blue)—so much so that the blue line is completely hidden beneath the yellow. The near-perfect retention of scalar quantization should alleviate concerns about quality degradation, while binary quantization's retention profile suggests it's suitable for applications with higher performance demands that can tolerate slight quality trade-offs or compensate with increased exploration depth. Figure 6. Retention score vs the number of candidates for top-k=10. Figure 7. Retention score vs the number of candidates for top-k=50. Figure 8. Retention score vs the number of candidates for top-k=100. Scalar quantization achieves near-perfect retention: The scalar quantization approach (orange line) demonstrates extraordinary representational capacity preservation, achieving 98-100% retention across nearly all configurations. At top-k=10, it reaches perfect 1.0 retention with just 100 candidates, effectively matching full-precision ENN results while using 4x less memory. This remarkable performance validates the effectiveness of int8 quantization when implemented with MongoDB's automatic quantization. Binary quantization shows retention-exploration trade-off: Binary quantization (red line) exhibits a clear correlation between exploration depth and retention quality. At top-k=10, it starts at ~91% retention with minimal candidates but improves to 98% at 500 candidates. The effect is more pronounced at higher top-k values (50 and 100), where initial retention drops to ~74% but recovers substantially with increased exploration. This suggests that binary quantization's information loss can be effectively mitigated by exploring more of the vector space. Retention dynamics change with retrieval depth: As top-k increases from 10 to 100, the retention patterns become more differentiated between quantization strategies. This reflects the increasing challenge of maintaining accurate rankings as more results are requested. While scalar quantization remains relatively stable across different top-k values, binary quantization shows more sensitivity, indicating it's better suited for targeted retrieval scenarios (low top-k) than for broad exploration. Exploration depth compensates for precision loss: A fascinating pattern emerges across all quantization methods: increased num_candidates consistently improves retention. This demonstrates that reduced precision can be effectively counterbalanced by broader exploration of the vector space. For example, binary quantization at 500 candidates achieves better retention than scalar quantization at 25 candidates, despite using 32x less memory per vector. Float32 ANN vs. scalar quantization: The float32 ANN approach (blue line) shows virtually identical retention to scalar quantization at higher top-k values, while consuming 4x more memory. This suggests scalar quantization represents an optimal balance point, offering full-precision quality with significantly reduced resource requirements. Conclusion This guide has demonstrated the powerful impact of vector quantization in optimizing vector search operations through MongoDB Atlas Vector Search and automatic quantization feature, using Voyage AI embeddings. These findings provide empirical validation that properly implemented quantization—particularly scalar quantization—can maintain semantic fidelity while dramatically reducing computational and memory requirements. The near-perfect retention of scalar quantization should alleviate concerns about quality degradation, while binary quantization's retention profile suggests it's suitable for applications with higher performance demands that can tolerate slight quality trade-offs or compensate with increased exploration depth. Binary quantization achieves optimal latency and resource efficiency, particularly valuable for high-scale deployments where speed is critical. Scalar quantization provides an effective balance between performance and precision, suitable for most production applications. Float32 maintains maximum accuracy but incurs significant performance and memory costs. Figure 9. Performance and memory usage metrics for binary quantization, scalar quantization, and float32 implementation. Based on the image above our implementation demonstrated substantial efficiency gains: Binary Quantized Index achieves the most compact disk footprint at 407.66MB, representing approximately 4KB per document. This compression comes from representing high-dimensional vectors as binary bits, dramatically reducing storage requirements while maintaining retrieval capability. Float32 ANN Index requires 394.73MB of disk space, slightly less than binary due to optimized index structures, but demands the full storage footprint be loaded into memory for optimal performance. Scalar Quantized Index shows the largest storage requirement at 492.83MB (approximately 5KB per document), suggesting this method maintains higher precision than binary while still applying compression techniques, resulting in a middle-ground approach between full precision and extreme quantization. The most striking difference lies in memory requirements. Binary quantization demonstrates a 23:1 memory efficiency ratio, requiring only 16.99MB in RAM versus the 394.73MB needed by float32_ann. Scalar quantization provides a 3:1 memory optimization, requiring 131.42MB compared to float32_ann's full memory footprint. For production AI Retrieval implementation, general guidance is as follows: Use scalar quantization for general use cases requiring good balance of speed and accuracy. Use binary quantization for large-scale applications (1M+ vectors) where speed is critical. Use float32 only for applications requiring maximum precision, where accuracy is paramount. Vector quantization becomes particularly valuable for databases exceeding 1M vectors, where it enables significant scalability improvements without compromising retrieval accuracy. When combined with MongoDB Atlas Search Nodes , this approach effectively addresses both cost and performance constraints in advanced vector search applications. Boost your MongoDB skills today through our Atlas Learning Hub . Head over to our quick start guide to get started with Atlas Vector Search.

June 10, 2025