Skip to content

11 - Multimodal Learning

This module covers vision-language models, image generation, and audio models.

Module Structure

11-multimodal-learning/
├── 01-vision-language/         # Vision-Language
├── 02-image-generation/        # Image Generation
└── 03-audio-models/            # Audio Models

Core Content

01 - Vision-Language Models

ModelFunctionApplication
CLIPContrastive learningZero-shot classification
BLIPImage-text understandingImage captioning, VQA
LLaVAMultimodal dialogueVisual QA

02 - Image Generation

TechnologyDescription
VAEVariational autoencoder
DiffusionDenoising diffusion models
Stable DiffusionLatent space diffusion
ControlNetConditional control

03 - Audio Models

ModelFunction
WhisperMultilingual ASR
TacotronText-to-speech
HiFi-GANVocoder

Learning Path

CLIP → BLIP/LLaVA → VAE → Diffusion → Stable Diffusion → Whisper → TTS

Released under the MIT License.