Master's Thesis Proposal
Kwangmin Cho
(Advisor: Prof. Dimitri Mavris)
"Improving LLM Performance in Aerospace NER Task:
A Study on Data Augmentation and Fine-tuning Strategy"
Thursday, February 27
10:00 a.m.
Weber Space and Technology Building (SST II), Collaborative Visualization Environment (CoVE)
Abstract
As digital transformation progresses across various sectors, systems engineering is also transitioning from document-based practices to Model-Based Systems Engineering (MBSE). This shift is anticipated to improve traceability, streamline verification and validation processes, and enable better integration across system components. In alignment with this transition, there is a growing need for Named Entity Recognition (NER) methods capable of extracting machine-readable entities from requirements written in natural language (NL). NER plays a critical role in identifying and classifying data belonging to target entity types. Among the various approaches for NER, fine-tuning Large Language Models (LLMs) has shown significant promise due to the rapid advancements in their capabilities.
However, fine-tuning LLMs for domain-specific tasks presents significant challenges, particularly in low-resource domains where open-source data is scarce, and in labor-intensive pre-processing tasks such as NER, which requires every token in the training data to be paired with corresponding entity labels. Aerospace requirements exemplify both challenges: their confidential nature restricts data availability, and NER tasks demand not only extensive annotation efforts but also expert-level knowledge. Consequently, the NER task for aerospace requirements engineering remains underexplored compared to other NLP and fine-tuning applications.
To address the challenges of low-resource domains and labor-intensive pre-processing, this study proposes a domain-entity adaptive data augmentation strategy aimed at improving the performance of fine-tuned LLMs without requiring extensive manual labeling efforts. This strategy employs Synonym Replacement (SR) and Label-wise Token Replacement (LwTR) adaptively, based on a detailed analysis of domain-specific entity characteristics. These characteristics are identified by evaluating entity-wise performance across varying replacement rates and augmentation methods. By tailoring the augmentation strategy to account for the desired levels of variability and method preferences for each entity type, this study explores optimal combinations of replacement rates and augmentation methods. The proposed approach seeks to enhance the overall performance of fine-tuned LLMs for NER tasks in aerospace domains, addressing key challenges in data scarcity and annotation costs, while contributing to advancements in requirements engineering.
Committee
- Prof. Dimitri Mavris – School of Aerospace Engineering (advisor)
- Dr. Olivia Fischer – School of Aerospace Engineering
- Dr. Woongje Sung – School of Aerospace Engineering