11 - Multimodal Learning
This module covers vision-language models, image generation, and audio models.
Module Structure
11-multimodal-learning/
├── 01-vision-language/ # Vision-Language
├── 02-image-generation/ # Image Generation
└── 03-audio-models/ # Audio ModelsCore Content
01 - Vision-Language Models
| Model | Function | Application |
|---|---|---|
| CLIP | Contrastive learning | Zero-shot classification |
| BLIP | Image-text understanding | Image captioning, VQA |
| LLaVA | Multimodal dialogue | Visual QA |
02 - Image Generation
| Technology | Description |
|---|---|
| VAE | Variational autoencoder |
| Diffusion | Denoising diffusion models |
| Stable Diffusion | Latent space diffusion |
| ControlNet | Conditional control |
03 - Audio Models
| Model | Function |
|---|---|
| Whisper | Multilingual ASR |
| Tacotron | Text-to-speech |
| HiFi-GAN | Vocoder |
Learning Path
CLIP → BLIP/LLaVA → VAE → Diffusion → Stable Diffusion → Whisper → TTS