Scalable Vector Search for AI Apps with Milvus and Databricks

Sathish Gangichetty
6 min readOct 12, 2022

--

Multi-modal embeddings are all the rage these days. See here. Here. And here. Everyone wants a piece of them because they give you a way to convert unstructured data to representations that are useful for understanding the semantic nature of unstructured assets — across image, text, audio, video, etc. These representations are vectors that can be used for a variety of purposes across use cases which require models for image similarity, deduplication, anomaly detection, text similarity, audio classification, video understanding, etc. To top that off, you don’t have to be a data scientist with deep ML expertise to build these systems, nor do you need to have large amounts of data to start leveraging them.

This is fine until you run into actual “hands on the keyboard” work for production. We end up running into multiple questions that need to be tackled across infrastructure for compute and storage. These questions usually turn out to be multi-fold. In this blog, we’ll tackle three of the most important questions in this area, so that teams can materialize AI products quickly:

  1. How do I upgrade models that generate embeddings quickly as state-of-the-art changes? And how should I upgrade those embeddings?
  2. Once generated, how do I get a high performance production-grade vector search done across billions/trillions of embeddings?
  3. Finally, how can I quickly perform a probabilistic search for batch or on-demand lookups?

Right out of the gate, we know a regular data warehouse cannot support this.This is because to truly work with unstructured data, at scale, you need the ability to not just save and load these data, you need a way to transform, process and build models against them. More importantly, all of this needs to happen at an optimal price performance with the ability to integrate cleanly with your orchestration system. The bottom line is, it’s just not possible to do this with a warehouse — it’s like trying to fit a square peg through a round hole. Can you? Maybe. But should you? Definitely not. Also, these use cases based on unstructured data aren’t going to disappear overnight. In fact, for the uninitiated, analysts estimate that over 80% of data generated by 2025 will be unstructured data. Look no further beyond what happens in our day to day life — IMs, emails and messages we exchange, pictures and videos we consume on social media platforms, etc. Therefore, getting a handle around this important topic is vital for the success of any organization. This means, we need to look past the data warehouse as we need both the capabilities of the data lake and a data warehouse. This is where the Databricks lakehouse platform can help us. By being able to provide solid tooling to define unstructured ETL pipelines via Databricks Delta Live Tables and Databricks Workflows, it massively simplifies the dev and deploy cycles of these pipelines.

If you’re just getting started and you need to know how to generate embeddings at scale using Databricks, please check out this article by my colleague. It shows a pattern that allows you to automatically scale the process using Databricks managed compute. That aside, how else does this help? To understand that, let’s dissect each of the questions we raised earlier.

1. How do I upgrade models that generate embeddings quickly as state-of-the-art changes? And how should I upgrade those embeddings?

This is where the open source package towhee helps. All you need is a python based environment to tap into the models shipped as operators inside this package. This means as state of the art for a model changes you can simply select the “operator” you need and upgrade your pipeline to reflect this. This becomes a monumental cost saving when we look at how rampant AI/ML research is these days. As an example, look at the exponential growth in the number of articles published on arXiv.

credit: https://arxiv.org/pdf/2210.00881.pdf

This directly translates to leapfrogging capabilities being pushed to open source very quickly, and when this happens, you need a way to capitalize on that. Databricks Delta Live Tables + Towhee + Pandas UDF gives you an opportunity to do exactly that. See here for more on how to accomplish this. This is perfect because it even does the work of generating embeddings from source data in a distributed manner. All you do is define a pipeline in a declarative manner.

2. Once generated, how do I go from an index based search that FAISS and HNSW enable to something that can support high performance production grade vector search across billions of embeddings?

This is where Milvus, an open source vector database, can help. Built on top of popular vector search libraries including FAISS, ANNOY, HNSW, and more, Milvus was designed for similarity search on vector datasets containing millions, billions, or even trillions of vectors. Milvus supports several key features that we come to expect from a database with a CRUD API for operations against the data it stores. A few of these key features are sharding, partitioning, replication, disaster recovery, load balancing, query optimization etc. The specifics of how Milvus organizes data internally and the services that it relies on to be operational are all covered here.

3. Finally, how can I quickly perform a probabilistic search for batch or on-demand lookups?

Before we address this, we need to understand how we can insert embedding data from delta tables into Milvus. For the sake of this blogpost, let’s assume that you picked up from where this blogpost left off i.e. you already have embeddings in a delta table. Databricks provides all the services needed to do the embedding inserts either on demand or on a schedule via Databricks Workflows. To begin we can simply %pip install pymilvus within the notebook scope to install the python sdk to interact with Milvus. After this, import the package and connect to Milvus like so:

Connect to Milvus

This would instantiate an active connection to Milvus, next we can create a Milvus collection using the data we have stored in our Databricks SQL table by simply defining a few variables that will allow us to define and identify the structure of the collection on Milvus.

Set Up Variables

Next, you can define a simple function like so to help create the collection.

function to create a collection

After this, we can create the collection and insert the data into it.

Insert into the collection

Next we define an index to help with the lookup against the field that has the embedding vector. You get a wide range of options on how you can go about setting up your index. For the sake of simplicity we’re just going to go with the L2 (euclidean) distance similarity metric. Here is the list of supported similarity metrics and the type of index you want to maintain on your vectors.

Attach an index

Now that we have the embeddings ready for a look up, we can do a lookup on-demand. Below is a simple ipywidget that allows us to select files in a directory. This action in turn triggers a quick similarity search against our collection in Milvus to fetch the top 2 matches. Super simple and quick! All from inside Databricks!

A small ipywidgets app that allows us to cycle through images and find similar images

That’s it and let’s end with a quick demo of our result!

--

--

Sathish Gangichetty

I’m someone with a deep passion for human centered AI. A life long student. Currently work @ databricks