Turbocharge Deep-Discovery with LLMs for Videos and Podcasts

Make your video search intelligent using Databricks and LangChain

8 min readApr 18, 2023

Background: I often have a long list of things I want to learn — as do most people. As a parent with a toddler, time is incredibly valuable to me, so I am always looking for ways to improve my learning efficiency. The recent innovations in LLMs — including the tooling, models, and platforms that enable them — presented me with an opportunity. I wondered if I needed to create my own personal assistant to improve my ability to learn quickly (inspired by the slew of recent innovations). Certainly, like everyone else, I’ve entertained the idea of using tools like ChatGPT to help with my learning. It is, to an extent, helpful but it is also largely constrained by me. Let me explain.

I have noticed that most GPT-like models struggle to give me high-quality pointed responses without specific contextual information. As the questions become increasingly narrow, such as when seeking specific knowledge, responses from these models are generic and become less helpful. This is not to discredit the usefulness of such services in general, but rather to highlight their limitations during a deep discovery phase, such as when exploring a topic in greater depth after superficially enjoying it. The expectation that the user will provide the right context for such models often creates an unnecessary burden, especially when the user is not sure where to begin. This is problematic because the quality of the response and the context provided go hand in hand. Therefore, generating a high-quality response using a generative model requires the ability to provide the right context as the first step. However, equally important is the ability to locate and extract the necessary information for the context. This is where most systems falter, as they rely on simple text-based document formats instead of comprehensively searching through rich media, dense with information, such as videos and audio. It is important to challenge these untested assumptions as they have influenced the collective thinking in the development of vital A.I. applications, by exploring and utilizing rich media sources to produce high-quality responses.

Concept: As far as our specific project is concerned, we’ll build a real-time semantic search application based on YouTube videos published on the Databricks YouTube channel. Why Databricks? Simple. It’s something I can continue to use beyond this demo/blog. I want to build something useful for myself. Since I work @ Databricks, it goes without saying that this will be useful for me & my co-workers. Also, like many others, my general learning strategy is to rely on multiple modes of learning, usually starting with videos. Given this, the Databricks YouTube channel is a perfect fit.

Additionally, pointed search systems like the one we will create offer significant improvements in the perceptible quality of search results compared to watered-down generic search systems. Generic systems are often plagued with non-specific, clickbaity content that drains our time and energy without providing much precision. In contrast, the Databricks channel has highly curated content published by smart people who undoubtedly spend a considerable amount of time ensuring quality. This alone is a huge advantage for us, among other benefits. If we could ensure quality of the content we consume by “automatically reading through” everything before we consume it, we save a huge deal of time & energy for ourselves. That is exactly what we’ll set out to achieve.

Planning: Before we do anything else, we need a plan of attack on how we can go from our idea to execution. Here’s a flowchart where I’ve laid out every step to show exactly that. We’ll go from left to right, of course.

A General Path For Unlocking Search Context In Videos or Podcasts

Tooling: We’ll use Databricks to learn more about Databricks! 😎 We’ll also use the winners of the A.I. fad of the month tooling such as LangChain & Chroma. We’ll then host this solution using the Databricks Model Serving offering, so we can use this on-demand from any API friendly client.

Getting it done: With that, let’s see how we get these items done. To extract the links from the Youtube channel, we’ll use the scrapetube library. When we inspect the response object (by video), we can see a simple way for us to pick up source URLs for the videos on the channel we’re interested in (Databricks).

Highlighted in yellow: the source info for the video

By utilizing this information, we can efficiently collect all video links for any videos uploaded within the last three years. This timeframe ensures that any contextual information we collect remains meaningful and relevant to the current state of the product, given the rapid pace of innovation. By collecting links for recent videos, we can better ensure that our project will remain current and accurate in its representation of the product over time.

# Get all links for videos uploaded in the last 3 yrs
hyperlinks2video = [
    "www.youtube.com"
    + video["navigationEndpoint"]["commandMetadata"]["webCommandMetadata"]["url"]
    for video in vlinks
    if datetime.strptime(
        get_date_from_relative_time(video["publishedTimeText"]["simpleText"]),
        "%Y-%m-%d",
    )
    > datetime.now() - timedelta(365 * 3)
]

Having completed this initial step, we can now obtain the links for the relevant videos in the associated YouTube channel. Moving forward, we need to process these links and extract the transcribed scripts for each video. Fortunately, we can utilize the powerful YoutubeLoader tool that comes with LangChain to accomplish this task efficiently. Additionally, we must ensure that the resultant documents are split into smaller, more manageable sizes, to avoid creating unnecessarily long documents. To achieve this goal, we can implement a text splitter to segment and organize the information efficiently.

def _fetch_and_produce_docs(link):
  "func takes a link and spits n transcribed docs based on splitter"
  print(f"INFO: Processing {list(locals().values())[0]}")
  doc = YoutubeLoader.from_youtube_url(link, add_video_info=True).load_and_split(splitter)
  docs.append(doc)

def retry_function(docfunc, num_retries, sleep_time):
  "func applies retries and sleeps for a little while before trying again"
    for i in range(num_retries):
        try:
            docfunc()
        except:
            print(f"Retry failed {i+1} time(s). Retrying in {sleep_time} seconds...")
            time.sleep(sleep_time) 
    raise ValueError(f"All {num_retries} retries failed")

docs=[]
reprocess_url_list=[]
splitter = RecursiveCharacterTextSplitter(chunk_size=3000,length_function=len)
# Iteratively process all the links and keep a tally of any failures
for link in hyperlinks2video:
  try:
    retry_function(lambda:_fetch_and_produce_docs(link),3,5)
  except Exception as e:
    reprocess_url_list.append(link)
    print(f"INFO: failed for link -  {link}")

Next, we will generate embeddings from the documents extracted from the previous step. To achieve this task, we will employ a model hosted on the HuggingFace model hub that provides a base model compatible with our approach. Using the HuggingFaceEmbeddings class, we will extract the desired embeddings and use Chroma (which is included with LangChain) to persist the data to disk. While we have used the base model in this case, there are other pre-trained models available that may be better suited to particular analyses, including OpenAI or Cohere, or even ones that you have trained yourself.

from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
db = Chroma.from_documents(
    documents,
    HuggingFaceEmbeddings(),
    persist_directory=persist_dir
)
db.persist()

That’s it! We can now use these embeddings to perform context-based semantic search as and when we need it. Shown below is an example search for search term “DLT best practices” and sure enough as expected, we get back high quality results.

Example Search: DLT Best Practices. Uses In-Video Context

Finally, to serve the model as an endpoint using Databricks’ Model Serving capability, we have to register an MLflow model to the Model Registry. We can quickly do this by packaging this model as a custom PyFunc model and save it as an MLflow model. The snippet of code that does this is shown below. This model can either be manually or programmatically moved into the Model Registry as well.

import cloudpickle
import langchain
import chromadb
import mlflow.pyfunc
import pandas as pd
import numpy as np
from langchain.vectorstores import Chroma
from sys import version_info

# save artifacts in this path as model artifacts
artifacts = {
    "chroma_path": "./.chroma/",
}

# Generate env info to save with the model
PYTHON_VERSION = "{major}.{minor}.{micro}".format(
    major=version_info.major, minor=version_info.minor, micro=version_info.micro
)
conda_env = {
    "channels": ["defaults"],
    "dependencies": [
        "python={}".format(PYTHON_VERSION),
        "pip",
        {
            "pip": [
                "mlflow=={}".format(cloudpickle.__version__),
                "cloudpickle=={}".format(cloudpickle.__version__),
                "chromadb=={}".format(chromadb.__version__),
                "langchain=={}".format(langchain.__version__),
                "tiktoken==0.3.3",
            ],
        },
    ],
    "name": "yt_search_rec_model",
}

# custom python function model for sim search
class VideoSemSearch(mlflow.pyfunc.PythonModel):
    def __init__(self):
        self.embed_db = None

    def load_context(self, context):
        self.embed_db = Chroma(
            persist_directory=context.artifacts["chroma_path"],
            embedding_function=HuggingFaceEmbeddings(),
        )
        return self.embed_db

    def predict(self, context, model_input: str):
        if self.embed_db is None:
            self.embed_db = self.load_context(context)
        query = model_input[0]
        res = self.embed_db.similarity_search(query)
        return json.dumps([res[i].dict() for i in range(len(res))])

with mlflow.start_run():
  mlflow.pyfunc.log_model(
    artifact_path="model",
    artifacts=artifacts,
    python_model=VideoSemSearch(),
    conda_env=conda_env)

Registering a model to the Model Registry is an essential step as this enables it to be served using Databricks Model Serving. This process makes the model readily available via an API endpoint that is highly scalable, easily manageable, and highly available. Key model metrics such as latency, queries per second (QPS), and error rates are readily available via the Databricks Serving endpoint page for easy tracking and monitoring. With the completion of this process, the model is now fully operational and can push video recommendations to upstream clients. Shown below is an example request made to the model endpoint using a REST API client called Insomnia.

That’s it! Our idea was rapidly executed with the help of Databricks platform’s best-in-class end-to-end tools and the open-source flexibility of Python. As more tools, such as LangChain and LlamaIndex, continue to unlock useful LLM applications for organizations and developers globally, it is critical to question traditional assumptions about “what the right input modality to begin with” might be or “what input modality has the best quality of information” for those specific projects.

Regarding this particular use case, we have reached a decisive closing point. While extending this concept further and feeding the results into another LLM could have resulted in a generative example, it would have been nothing more than a vanity effort, yielding little real value. Instead, we have focused on creating a useful tool that we will build upon in the coming days. Until then, take good care and I hope that you found this piece useful. Later!

P.S : The code shown in the blogpost is here.

To connect with me, please reach out to me on LinkedIn

Turbocharge Deep-Discovery with LLMs for Videos and Podcasts

Make your video search intelligent using Databricks and LangChain

Written by Sathish Gangichetty