Towards synthetic data augmentation for Lecture Slide Understanding
SynSlideGen is a synthetic data generation pipeline that creates realistic, annotated lecture slides through an LLM-powered generation pipeline. Designed to support tasks like slide element detection and retrieval, it leverages structured text to generate diverse layouts and semantic content. SynSlideGen addresses the scarcity of annotated educational data through scalable automation.
Work under submission. Paper and Code releasing soon!
Abstract
Lecture slide element detection and retrieval, key tasks in lecture slide understanding, have gained significant attention in the multi-modal research community. However, annotating large volumes of lecture slides for supervised training is labor intensive and domain specific. To address this, we propose a large language model (LLM)-guided Synthetic Lecture Slide Generation SynLecSlideGen pipeline that produces high-quality, coherent slides, named as SynSlide dataset, closely resembling real lecture slides. We also create an evaluation benchmark RealSlide by manually annotating 1050 real slides curated from lecture presentation decks. To evaluate the effectiveness of SynSlide dataset, we perform few-shot transfer learning on real slides using models pre-trained on our synthetically generated slides. Experimental results show that few-shot transfer learning outperforms training only on the real dataset especially in low resource settings, demonstrating that synthetic slides can be a valuable pre-training resource in labeled data scarce real-world scenarios.