QElight - Quality Education
Home
Contact
Efficiently Serving LLMs - Syllabus
Introduction
Overview of efficiently serving LLM applications
Course objectives and foundational concepts
Text Generation
Understanding how LLMs generate text token by token
Code examples demonstrating text generation processes
Batching
Implementing batching techniques to serve multiple users
Trade-offs between speed and user load handling
Continuous Batching
Optimization through continuous batching
Code examples for efficient batch processing
Quantization
Improving model performance through quantization
Examples of quantization and its impact on latency
Low-Rank Adaptation (LoRA)
Introduction to LoRA for efficient model serving
Using LoRA to fine-tune multiple models simultaneously
Multi-LoRA Inference
Serving multiple LoRA adapters to various users
Batching techniques for scalable multi-LoRA inference
LoRAX
Using Predibase's LoRAX framework for LLM inference
Real-world application of optimization techniques
Conclusion
Recap of efficient serving techniques for LLMs
Best practices and future directions