Preprocessing Unstructured Data for LLM Applications

Preprocessing Unstructured Data for LLM Applications - Syllabus

Introduction
- Introduction to preprocessing unstructured data for LLM applications
- Course objectives and key concepts
Overview of LLM Data Preprocessing
- Understanding data preprocessing for retrieval-augmented generation (RAG)
- Identifying diverse unstructured data sources
Normalizing the Content
- Extracting and normalizing data from various document formats
- Converting content to a common JSON format
Metadata Extraction and Chunking
- Enriching content with metadata for better search results
- Chunking content for improved retrieval
Preprocessing PDFs and Images
- Applying layout detection and vision transformers
- Extracting data from complex PDF and image structures
Extracting Tables
- Techniques for handling and extracting tabular data
- Transforming tables for use in LLM applications
Build Your Own RAG Bot
- Building a RAG bot to handle diverse document types
- Ingesting documents like PDFs, PowerPoints, and Markdown files
Conclusion
- Recap of preprocessing techniques and course highlights
- Next steps for implementing data preprocessing in LLM applications
Appendix - Tips and Help
- Additional resources and troubleshooting tips
- Code examples for common challenges