QElight - Quality Education
Home
Contact
Preprocessing Unstructured Data for LLM Applications - Syllabus
Introduction
Introduction to preprocessing unstructured data for LLM applications
Course objectives and key concepts
Overview of LLM Data Preprocessing
Understanding data preprocessing for retrieval-augmented generation (RAG)
Identifying diverse unstructured data sources
Normalizing the Content
Extracting and normalizing data from various document formats
Converting content to a common JSON format
Metadata Extraction and Chunking
Enriching content with metadata for better search results
Chunking content for improved retrieval
Preprocessing PDFs and Images
Applying layout detection and vision transformers
Extracting data from complex PDF and image structures
Extracting Tables
Techniques for handling and extracting tabular data
Transforming tables for use in LLM applications
Build Your Own RAG Bot
Building a RAG bot to handle diverse document types
Ingesting documents like PDFs, PowerPoints, and Markdown files
Conclusion
Recap of preprocessing techniques and course highlights
Next steps for implementing data preprocessing in LLM applications
Appendix - Tips and Help
Additional resources and troubleshooting tips
Code examples for common challenges