Unlocking In-Video Search: How LLMs and VectorDBs Transform Video Platforms
This post delves deeply into how you can implement in-video search to enhance the search functionality and overall experience of your video sharing platform.
Pre-requisites
Curiosity to Learn
Assume you are owning a video sharing platform and it is doing very great but in an recent customer survey, you figured that you are losing money because customers are not able to find what they want quick enough. Also, customers feels disconnected from content.
Feedback from Customers:
I wanted to search a video with car blast as I was trying to remember a movie where had a scene of car blast. I don’t remember the movie name and it took me at least half an hour to find movie.
I want to watch a movie but don’t know what to watch. I want platform to be interactive enough that when I hover over movie, it should show me a clip from movie which suits my interest.
In this post, I'll explore how integrating Large Language Models (LLMs) can significantly enhance our video sharing platform. LLMs hold the key to elevating both user experience and organizational innovation, paving the way for next-level improvements and groundbreaking advancements. Stay tuned to discover how these powerful tools can revolutionize content discovery and engagement on our platform. For those who don’t understand “What is Innovation?”, I will highly suggest you to read this LinkedIn article from Richard Devaul Innovation isn’t what you think it is
Let's explore how we can address our customer challenges using this technology.
To simplify, here are the core issues we're tackling:
Inefficient Query Mechanisms in Search Technology: Enhancing the effectiveness of how users search for content.
Platform Engagement: Improving the level of interaction and connection users feel with the content.
Inefficient Query Mechanisms in Search Technology
Let’s try to understand:
Why is this a issue?
How can it be solved?
Why is this a issue?
Inefficient query mechanisms can lead to several problems:
Content Bias: Users may encounter content that is skewed based on metadata rather than relevance.
Increased Effort: Users may need to put in 1.25 times more effort to find the content they want.
Reduced Engagement: A decline in Daily Active Users (DAU) as users become frustrated with the search experience.
How can it be solved?
We will split the solution into two components as shown in Fig. 2
Understanding Media
With the current mechanism, the platform is not having sufficient information about media. Platform is only aware of metadata shared by publishers as shown in Fig 3.
We aim to enhance the platform's ability to understand the underlying content of media to significantly improve customer experience. This will involve integrating Large Language Models (LLMs) and adding new components to our system architecture.
Here's the overall approach (as illustrated in Fig 4):
Segment the Video into Scenes: Divide the video into distinct scenes.
Extract Frames from Scenes: Sample frames from each scene for detailed analysis.
Generate Embeddings from Frames: Process these frames to create content embeddings.
Index in Storage: Store the embeddings in a searchable index for efficient retrieval.
On a higher level, we will split the video into scenes and then sample the scenes for frames. Here is a twist where we should be careful when setting the number of samples for scenes as this decision goes through a trade-off between cost of processing and being more accurate.
Cost of Processing ∝ Accuracy
These sample frames for scene works as a basic unit for better understanding on data. The sample frames will work as an input to LLMs to give a better representation of data which can be utilised to understand video.
How LLMs can help here?
LLMs are trained on vast amounts of text data and use deep learning techniques, particularly neural networks, to learn the complexities and nuances of language.
There are different LLM models available out there which are trained for specific reason; naming a few like Claude, Llama, etc. In our case, we are looking for a model which can understand both text and image. These kinds of models are called as Multimodal as they can take different forms of input.
We will be utilizing the CLIP model, developed by OpenAI in 2021. This advanced model is designed to understand both images and text within a unified embedding space, making it highly effective for analyzing and correlating visual and textual information.
How CLIP Works
Dual Encoder Model:
Image Encoder: Typically a Vision Transformer (ViT) or a Convolutional Neural Network (CNN) that processes images and generates image embeddings.
Text Encoder: Usually a Transformer-based model that processes text and generates text embeddings.
Shared Embedding Space:
Both the image and text encoders are trained to map their respective inputs into a shared embedding space.
In this shared space, similar images and texts (e.g., an image of a cat and the text "a photo of a cat") are located close to each other, while dissimilar ones are far apart.
Contrastive Learning Objective:
During training, CLIP uses a contrastive learning objective. It learns by associating each image with its corresponding textual description (positive pairs) and distinguishing it from other descriptions (negative pairs).
The loss function encourages the model to minimize the distance between embeddings of positive pairs and maximize the distance between embeddings of negative pairs.
CLIP leverages contrastive learning to align visual and textual representations, improving its ability to distinguish between relevant and irrelevant matches.
Contrastive Learning
The primary goal of contrastive learning is to learn an embedding space where similar samples are closer to each other and dissimilar samples are farther apart. This approach is especially useful for tasks where the model needs to understand the relationships between different types of data, such as images and text. Also, Netflix adopted that they are using Contrastive Learning for improving Search in Tech Blogs. More details are available here
We can improve CLIP model by fine-tuning it using {Image, Text} pair.
In joint embedding space, the similar data points are mapped to nearby vectors and dissimilar data points mapped to distant vectors.
We have solved our first problem of understanding media by using LLMs through representing data like image and text in common space.
Improving Search
Improving search functionality involves more than just implementing Large Language Models (LLMs). It also requires an efficient database to manage the data we need to store. Imagine it as a 3D space where vectors are positioned based on their properties.
Vector databases are ideal for this purpose. Many cloud providers offer services to handle such data representations. I'll keep it abstract here, so you can explore options beyond specific cloud providers and enhance the system as needed.
Our approach is straightforward, but there is room for optimization to boost system performance.
As shown in Fig 6, once we have fine-tuned our embedding model, we'll deploy it with an endpoint. We’ll store the embeddings generated by the CLIP model in our VectorDB. This setup addresses the challenge of storing complex vector representations of images and text in a unified space, ensuring that our system can efficiently manage queries at scale. Without this infrastructure, handling queries effectively on a large scale would be unmanageable.
Let's look at the complete picture:
Query Transformation: The incoming query is first converted into embeddings.
Query to Vector DB: This embedding is then used to query the VectorDB. The results returned are embeddings along with their associated metadata.
Reverse Mapping: Depending on the implementation, the returned embeddings are reverse-mapped to their corresponding scene IDs.
Scene to Video ID: Each scene ID is linked to a video ID.
Return Results: The final results are compiled, including all metadata as outlined in Fig 3, and are returned to the user.
We can integrate another database to handle these mappings as depicted in Fig 7. I’ll leave this part open for exploration, as it offers flexibility for implementation.
Additionally, to maintain system performance without impacting users, you can run offline jobs or use a queue mechanism to index new media content in the VectorDB using the embedding model. This ensures that the database is updated efficiently while keeping user experience smooth.
Let's evaluate if we've addressed the issue of inefficient query mechanisms in search technology.
With our platform's new setup, we now offer advanced search capabilities that consider both metadata and internal media details, providing more relevant and accurate results. Congratulations!
This is just a glimpse of what can be achieved with cloud technologies, system re-architecture, and model fine-tuning. The potential for further innovation and improvement is immense.
I hope you found today’s insights valuable. Stay tuned for our next post, where we’ll delve into Platform Engagement in detail.
If you enjoyed this post, please show your support by sharing and giving it a heart!
Check out this Medium article from Netflix Engineering, where they explore similar advancements and efforts in this direction.
Follow me on LinkedIn for more tech bytes, deep dives and random content related to Software Engineering :D🚀