Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task
- authored by
- Judith Stanja, Anett Hoppe, Sarah Dannemann, Johannes Krugel
- Abstract
Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.
- Organisation(s)
-
L3S Research Centre
Leibniz School of Education
Didactics of Electrical Engineering and Computer Science Section
- Type
- Conference abstract
- Pages
- 57
- No. of pages
- 1
- Publication date
- 23.08.2024
- Publication status
- Published
- Peer reviewed
- Yes
- Electronic version(s)
-
https://www.earli.org/assets/images/2024SIG6-7Conference_BookAbstract_Corrected.pdf (Access:
Open)
-
Details in the research portal "Research@Leibniz University"