Multimodal RAG: Chat with Videos

What You'll Learn

Create an advanced question-answering system that interacts with multimodal data, including video.
Understand the concept of multimodal semantic space and its significance in AI.
Differentiate between traditional RAG and multimodal RAG systems, focusing on model integration complexities.

About This Course

This course, developed in collaboration with Intel, guides you in building an interactive system to query and understand video content through multimodal AI. Learn to implement a multimodal RAG system, utilizing multimodal embedding models for embedding images and captions in a semantic space, and leverage this setup for retrieval using text prompts.

Key Technologies and Concepts

Multimodal Embedding Models: BridgeTower for creating joint embeddings of image-caption pairs.
Video Processing: Whisper model for transcription, LVLMs for captioning.
Vector Stores: LanceDB for efficient storage and retrieval of high-dimensional vectors.
Retrieval Systems: LangChain for building a retrieval pipeline.
Large Vision Language Models (LVLMs): LLaVA 1.5 for advanced visual-textual understanding.
APIs and Cloud Infrastructure: PredictionGuard APIs, Intel Gaudi AI accelerators, Intel Developer Cloud.

Hands-on Project

Throughout the course, you’ll build a complete multimodal RAG system that:

Processes and embeds video content (frames, transcripts, and captions).
Stores multimodal data in a vector database.
Retrieves relevant video segments based on text queries.
Generates contextual responses using LVLMs.
Maintains multi-turn conversations about video content.

Course Outline

Introduction: Overview of multimodal RAG systems and interactive video chat capabilities.
Interactive Demo and Multimodal RAG System Architecture: Introduction to the system architecture with a Gradio app demo.
Multimodal Embeddings: Understanding and creating joint embeddings with the BridgeTower model.
Preprocessing Videos for Multimodal RAG: Extracting frames, generating transcripts with Whisper, and captioning using LVLMs.
Multimodal Retrieval from Vector Stores: Implementing retrieval using LanceDB and LangChain.
Large Vision - Language Models (LVLMs): Understanding LVLM architecture and implementing visual question answering.
Multimodal RAG with Multimodal Langchain: Building a multimodal RAG pipeline using LangChain.
Conclusion: Summary and future directions.

Who Should Join?

This course is for anyone with intermediate Python programming knowledge, familiarity with machine learning concepts and deep learning frameworks, and a basic understanding of natural language processing and computer vision.