Improving Soft Skill Extraction via Data Augmentation and Embedding Manipulation

Soft skills (SS) are important for Human Resource Management when recruiting suitable candidates for a job. Nowadays, enterprises aim to automatically extract such information from documents, curriculum vitae (CVs) and job descriptions, to speed up their recruitment process. State-of-the-art Large Language Models (LLMs) have been successful in Natural Language Processing (NLP) by fine-tuning them to the domain-specific task. However, annotated data for the task is very limited and costly to obtain, since it requires domain experts. Moreover, SS consists of complex long entities which are difficult to extract given few annotated examples. As a consequence, the performance of the LLMs on soft skill detection still needs improvement before being used in a real-world context. In this paper, we introduce data augmentation based entity extraction approach which shows promising performance when the entity length is long (i.e more than three tokens). Moreover, we explore the performance of pre-trained LLMs to generate synthetic data for training. The pre-trained models are used to generate contextual augmentation of the baseline dataset. We further analyse the embeddings generated by these models in aiding the extraction process of entities. We develop an Embedding Manipulation (EM) approach to further improve the performance of baseline models. We evaluated our approach on the only publicly available dataset for soft skills (SKILLSPAN), and on three Entity Extraction datasets (GUM, WNUT-2017 and CoNLL-2003) to assess the proposed approach. Empirical evidence shows that the proposed approach allows us to get 6.52% increased F1 over the baseline model for the soft skills.


INTRODUCTION
In recent years, there has been a paradigm shift toward online job posting and recruitment portals.With the help of these platforms, candidates can effortlessly upload their data and documents, such as resumes and curriculum vitae (CV), for the chosen vacancies.On the positive side, these systems have made the job application process smoother for candidates, but on the other hand, it has made the screening process a time-consuming and labor-intensive task for recruiters-for a single job advertisement, the Human Resources (HR) department may receive a huge number of applications.Machine learning tools supporting the recruiters could potentially save a significant amount of HR resources [2,24,29].Specifically, automatic information extraction from text data could greatly speed up a recruiter's job [19].In this context, the data extraction primarily focuses on recognizing applicants' personal data, work experience, and education.Soft Skills (SS) are also one of the constructs that recruiters try to assess when screening candidates given a job profile.The job profile consists of features (e.g.education, background, hard and soft skills, past covered positions, . . . ) that an ideal candidate should possess to get the respective position.
However, extracting the soft skills from text data can be tedious as SS do not possess any distinctive definition or rule.To help recruiters, a tool for automatic soft skill extraction from CVs is needed.The current state-of-the-art in soft skill extraction is insufficient to cover this need.Moreover, it is difficult to improve the performance of the current state-of-the-art using conventional techniques due to the scarcity of available datasets, with the notable exception of the SKILLSPAN dataset [31].
SS could consist of a single token (communication, management, . . .), multiple tokens (team player, critical thinker, . . .), or even a sentence (manage and develop a competitive service product portfolio strategy . . .).Recently, researchers have tackled the task of extracting SS from resumes and job postings as a classification task at sentence level [28], and at token level [31].However, the quality of the output of these models is severely limited by the scarcity of available annotated data.Moreover, due to the complex nature of the human language, the interpretation of a word largely depends on the context.For instance, head in "head of management" contributes towards SS, while it does not in "head of human body".
State-of-the-art Large Language Models (LLMs) have been shown to be successful in understanding word meaning in context.These models can be fine-tuned to domain-specific tasks, yet to do so an annotated dataset is required.However, manually annotating large corpora at the token level is time-consuming and tedious.Additionally, SS detection requires the knowledge of domain experts to do quality annotations.
In this scenario, data augmentation (DA) [15] is a viable approach to generate more synthetic data, given a limited amount of golden annotations.However, the current DA approaches performs well when the length of the entities in the dataset is short.The quality of generated data via DA degrades when the length of the entities becomes longer.
In this paper, we address the problem of SS extraction especially when the length of the entities is long, i.e., more than three tokens (more details in Section 3.4).Starting from baseline DA techniques, we develop new DA approaches based on LLMs to improve the results.Moreover, we introduce the Embedding Manipulation (EM) approach to further improve the performance of LLMs.

RELATED WORKS
Since resumes usually follow a standard structure, keyword-matching algorithms can be leveraged to search and extract specific data in the relevant sections, such as personal details and experiences [1].Challenges arise when detecting hard and soft skills due to their ambiguity, the need for annotated text, and domain experts.
The skill extraction task has been framed as a binary classification problem by [21], where they implemented a phrase-matching based approach to differentiate between SS phrases and something else.After comparing different neural network models such as CNNs, LSTMs, and Hierarchical Attention Model, it was found that using an input representation with tagged skills in combination with an LSTM achieved the best performance.
The SkillNER system [7] is a tool based on a support vector machine model.The training for this system was based on a collection of 5000 scientific papers, and it utilizes a classification system for SS from the O*NET database.The system is divided into two parts: clue extraction and skill extraction.Clue extraction involves identifying patterns that suggest the presence of a specific SS, and these clues are then used to identify relevant sentences in the corpus.
In the skill extraction stage, these sentences are labeled, and the resulting data is used to train a support vector machine and an MLP.However, the evaluation of the system has shown that there is room for improvement.
The models proposed by [28] use BERT [5] word embedding representations in combination with POS tags (Part of Speech Tags) and DEP tags (Dependency Parsing Tags).These features were used to train various machine learning classifiers, which were then evaluated on publicly available datasets.Results showed that using these techniques improved accuracy compared to traditional methods, although the limited size of the datasets hampered further progress.
Zhang et al. [31] released a novel dataset for skill extraction on English job postings called SKILLSPAN, while also outlining the annotation guidelines created by domain experts to annotate hard and soft skills.Additionally, this research introduces two BERT models (jobBERT and jobSpanBERT) that are optimized with continuous pre-training on the job posting domain and multi-task learning techniques.Experimental results obtained with these models show that single-tasking and multitasking can improve performance significantly over non-adapted counterparts.The authors point out the need to enrich the taxonomy with unseen skills, and they addressed this issue using weak supervision in a subsequent work [32].
Finally, Imane et al. [12] provide a systematic review and classification of skills extraction techniques.

DATA AUGMENTATION
Data Augmentation (DA) is a well-known technique in machine learning to automatically generate more training data without an extensive annotation exercise [15].DA has recently become popular in Natural Language Processing (NLP) due to the availability of large language models and increased interest in low-resource domains.There are various DA techniques in NLP, including random swap [27], random insertion [27], word deletion [27], back-translation, and text generation (see [8] for a survey on the topic).
Most techniques described above are only suitable when the entity (to extract) length is short, i.e., one or two tokens.The current approaches do not show promising results for longer entities (length greater than three tokens).Also, such techniques can only be used for text classification tasks where the annotation is only at the sentence level.These tasks are annotated via binary or multi-label schema.For instance, in a binary classification task, each sentence is annotated either Positive or Negative.Hence, preserving the golden annotation is straightforward.Such tasks can easily leverage the aforementioned DA techniques since token-label correspondence is unnecessary.However, the problem arises when it comes to finegrained analysis tasks such as Named Entity Recognition (NER).In the NER task, each token of a document or sentence is tagged.Any manipulation of the input sequence might misalign the corresponding label.Therefore, preserving the gold labels becomes a critical task in NER problems.More details are explained in Section 3.5.In the remaining part of this section, descriptions and limitations of some of these techniques are provided.

Word Replacement
Various methods of word replacement have been proposed in the past.The approach proposed in [27] replaces words with one of their synonyms (WordNet [18]) or random word insertion, swap, or deletion.An alternative solution using word replacement based on context predicted by a bi-directional LSTM-RNN based language model has been suggested in [14].Another approach presented in [9] replaces a randomly chosen word in a sequence with a soft word which is a probabilistic distribution over the vocabulary of a language model.The author leveraged the use of Transformer architecture [26] for the language model.However, these techniques are suitable for short entity lengths (less than three tokens).Also, these techniques are limited in annotation-sensitive tasks such as token classification due to the token-label misalignment problem, as explained in Section 3.5.

Back-Translation
Back-translation is a popular approach in NLP where a sequence is translated into another language and then translated back to the original [22].This approach preserves the overall semantics of the original sentences but does not guarantee to preserve token-label correspondence in token-level tasks.

Masked Language-Modelling
There have been few efforts to address the token-label misalignment problem.For instance, Ding et al. [6] proposed using DA as a conditional generation task, generating new sentences while preserving the original targets and labels.Their approach relies upon linearized labeled sequences.During linearization, the entity labels are explicitly inserted in the sequence.This approach is controllable and allows for more diversified sentence generation.Zhou et al. [33] suggest the use of Masked Entity Language Modelling (MELM) as a DA framework for low-resource NER, which addresses the tokenlabel misalignment issue by injecting NER labels explicitly into a sentence [6].This enables the fine-tuned MELM to predict masked entity tokens while explicitly conditioning on their labels [3].Such techniques solve the token-label misalignment problem by injecting the label information explicitly into the model.However, the performance degrades when the lengths of entities are longer.Also, these methods require post-processing to remove noisy samples from the augmented data.

Entity Length
In this paper, we consider entities with a length of less than three tokens as short entities, whereas entities with a length greater than three tokens as long entities.In this research, we find that the length of the entities affects the data augmentation process.The current state-of-the-art approaches are viable for the token classification task when the entity length is short.Unfortunately, for longer entities the literature is limited [23].Moreover, most datasets for token classification tasks consist of named entities, such as the name of a person, organization, or location, which usually consist of one or two tokens.However, the entity length could be much longer for niche tasks such as SS extraction.
To assess our approaches, in addition to a SS dataset, we selected one dataset with similar entity length statistics as the SS dataset, and two other datasets that represent the usual entity extraction tasks where the average length is shorter.Experimental evaluation shows that traditional DAs perform well with short entities but not well with long entities; conversely, our proposed DAs are able to perform well with long entities while being comparable to traditional DAs on short ones.In Table 1, we report the average entity length of the datasets we used in our experimental assessment.We see that, on average, a SS consists of 5 tokens, whereas on average entities in GUM datasets consist of 3 tokens.The average entity length in WNUT-2017 and CoNLL-2003 datasets is 1.5

Token-Label Misalignment
In a token-level classification task, each token in a document is assigned a corresponding label, as shown in Figure 1.One of the hard conditions for DA on such a task is to preserve the token-label correspondence in the output.The token-label misalignment is a critical problem limiting the use of DA techniques.For instance, in back-translation or text generation, the augmented output sequence is not guaranteed to be aligned with the input sequence length.By using the back translation augmentation from [17] the following generated output sequence consists of 8 tokens, whereas the input sequence is of 12 tokens: Original Text: Support and mentor other engineers through code reviews and pair programming sessions.Augmented Text: Support mentor other through code reviews programming sessions.
The above example limits us to only using the DA techniques where the token-label correspondence is retained.Therefore, we explore contextual augmentation as explained in Section 4.

OUR METHODS
The proposed SS extraction workflows are shown in Figure 2. We propose contextual and keyword DA by a pre-trained LLM [26].The LLM is used to generate synthetic data.Moreover, using Embedding Manipulation (EM) we better exploit the information from the annotated dataset.As shown in Figure 1, the input sentence is composed of two parts referred to as context (highlighted in light red) and keywords (highlighted in light green).Here, the keyword is defined as a SS in the sentence.As can be seen from Figure 1, the gold standard data is annotated using BIO (Beginning-Inside-Outside) tags.
Notice that the SS labeled by domain experts consists of unique tokens that are largely dependent on the context.To learn the representation of such soft skills given one specific context is not enough.The LLM requires more examples to discriminate between different contexts.To address this problem, we propose two types of augmentation techniques, namely context augmentation and keyword augmentation, and to further enhance the performance, we use Embedding Manipulation.

Context Augmentation
As shown in Figure 2a, an input sentence of the training set is divided into keywords (light green) and context (light red), according to the gold labels provided by the annotators.In contextual augmentation, we leverage the pre-trained contextual embedding of the language model.Given an input sentence  containing context tokens   and Keyword tokens as   , we mask the   with , and the model task is to predict the  given   .The new substitutes for tokens   can be sampled from a given probability distribution over the vocabulary of the language model.We choose the top 5 tokens to generate 5 different augmentations of the same sentence .We generate similar sentences with the pre-trained model, constraining it to replace only tokens of the input sentence.The original tokenlabel correspondence is maintained since we perform augmentation by substituting the tokens instead of random insertion or deletion.
The generated augmented sentence is shown at the top of Figure 2a, where the pre-trained model predicts masked tokens .In this way, we generate more contexts given a single   .Additionally, the sequence length for all the augmentations is the same as the input sequence, so the token-label correspondence is preserved.

Keyword Augmentation
In keyword augmentation, using the same approach mentioned in Section 4.1, given an input sentence  containing context tokens   and Keyword tokens as   , we mask the   with , and the model task is to predict the  given   .The new substitutes for tokens   are sampled similarly as mentioned in Section 4.1.This setting allows us to generate more sentences with different keywords given the one specific context   .The keyword augmentation is shown in Figure 2b.

Embedding Manipulation
In this paper, we propose a simple yet effective approach to explore the embeddings of a pre-trained language model.Given an annotated dataset, the gold label for each token  is known beforehand.For instance, the Skillspan dataset  is annotated with soft skills using BIO tagging.With this information, we can extract all the soft skills (where a single soft skill is referred to as The padded   is then passed through a LLM [26] to generate the sequence of  embedding vectors   1 , . . .,    corresponding to each token   ,  = 1, . . ., , in   .All the generated embedding vectors are then averaged out to produce a single embedding vector   representing   .Likewise, for each   in the training set   is calculated.Finally, all the calculated embedding vectors   are averaged out to produce a single global embedding vector   , which supposedly contains the average representation in vector space of all s in the training set. During fine-tuning of LLM, given a sentence  with  input tokens {  |  = 1, . . .,  }, we obtain the embedding representation  .We use the pre-calculated   and average with each    ,  = 1, . . ., , obtaining the resulting vector corresponding to each token  as    = (   +   )/2 (see Figure 3).Then, the resulting embedding vectors    ,  = 1, . . ., , are passed through the classifier for final prediction.Since   contains the average representation of entities, performing a simple averaging operation on each    with   raises the score of    which are closer to   .This allows the LLM to be driven toward entities of interest in the dataset.
Likewise, during testing we use the EM module (refer to Figure 3).As we do not have the information of golden annotation, we use the same   as calculated from the training set.For each sentence  with  input tokens {  |  = 1, . . .,  }, we pass all the embeddings of tokens   1 , . . .,    through EM module which allows us to obtain    ,  = 1, . . .,  (explained in above paragraph), which is then passed through classifier for final predictions.The proposed setting allows the increase in the score of    close to   in the embedding space.

DATASETS AND EXPERIMENTAL SETUP
This section describes our empirical assessment of the proposed DA techniques.Due to the scarcity of SS datasets, in addition to SKILLSPAN, we tested our approach on three datasets, i.e., GUM, WNUT-2017, and CoNLL-2003.For CoNLL-2003, we used a subset of comparable size to the other two datasets to "simulate" a training regime involving a relatively small number of labeled examples.For all the datasets used in this work, we compare our DA approach versus the baselines obtained by using BERT and RoBERTa.For SKILLSPAN, we also report the performance of jobBERT, jobSpan-BERT [31], and GPT-4 [20].

Datasets
In the following, we provide details of the SKILLSPAN dataset and three additional entity extraction datasets used in our experimental assessment.Table 1 presents the statistics of the datasets we used in our experimental assessment.As explained in Section 4.1, when considering our two DA techniques, the training set for the datasets reported in Table 1 is augmented with  more samples via an LLM, increasing its size by  +1 times.
SKILLSPAN dataset, described in [31], is collected from job postings and labeled by domain experts.The authors of the dataset divided the data into three categories BIG, HOUSE, and TECH.The BIG dataset has not been released, whereas the HOUSE and TECH datasets are publicly available.The HOUSE dataset contains the various categories of job ads from 2012 to 2020, whereas the TECH dataset is restricted to only technical job postings.In this paper, the HOUSE and TECH datasets are merged to increase baseline data.
GUM dataset [30] is an open-source dataset of distinct annotated texts from twelve different text types.GUM comprises diverse text sources, such as interviews, news stories, academic writing, biographies, Wikipedia articles, political speeches, . . . .The dataset contains nine different entities.
WNUT-2017 dataset [4] focuses on identifying unusual, previously unseen entities in the context of emerging discussions.It evaluates the ability to detect and classify novel, emerging, singletonnamed entities in noisy text.
CoNLL-2003 dataset [25] is a token classification dataset with 4 classes of entities extracted from a news corpus.To better simulate the low resource settings, we choose a subset of CoNLL-2003 to match best the token statistics of the other two datasets.(1) Extract list of tokens corresponding to soft skills in the given input sentence.(2) Extract soft skills from the input sentence.Return tokens corresponding to soft skills in a list.(3) Extract a list of skills from the input sentence.
The prompts were engineered to cast the output of the model as a token classification task.The best prompt is chosen which returns best  1 score on the test samples.The proposed prompting approach reduces the post-processing work of the output generated by GPT-4 model and also allows us to make better comparisons with the other approaches presented in Table 3.The results using the best prompt are presented in Table 3, whereas some examples can be visualized in Table 5.
Hyperparameter Tuning and Model Selection To determine the optimal setting for augmentation rate  , we conduct a grid search on hyperparameters in the range [1, 2, 3, 4, 5]1 .We used the pre-trained baseline models and generated the augmented data on the dataset following our methods described in Section 4. The augmented data was used to fine-tune the model, and its performance on the validation set was recorded.The best model was chosen based on the best  1 score on the validation set.The number of augmentations for each dataset is presented in Table 2.It can be noted that for datasets with longer entities-SKILLSPAN and GUM-the performance degrades after 3 rounds of augmentation, whereas for datasets with short entities, i.e., WNUT-17 and CoNLL-2003, a similar trend is observed after 2 rounds of augmentation.

RESULTS
As already stated, due to the scarcity of datasets for SS extraction, we decided first to assess our DA techniques proposed in Section 4 on three entity extraction tasks.For these three datasets, we considered two LLMs, BERT [5] and RoBERTa [16].In contrast, for improving the performance of SS in job postings and CVs (SKILLSPAN dataset) we also considered jobBERT [31] and jobSpanBert [31].The performance of the GPT-4 model [20] is also evaluated for SS extraction using the methodology mentioned in Section 5.2.

Soft Skills
To show the effectiveness of the proposed DA approaches, we perform an extensive comparison with traditional DA approaches such as (word deletion, synonym replacement, word swap, and spelling augmentation [27]) and recently introduced DA approaches for token classification tasks (MELM [33] and DAGA [6]).From Table 3, we observe that the proposed DA approaches (keyword and context augmentation) used in the experiments achieve a higher  1 score than the baseline counterpart.The improvement is significant for all the models used in experimental assessment.It can be observed from empirical evidence that the performance of context augmentation surpasses keyword augmentation.We hypothesize that this increase in performance is because when the entity length is long, LLM requires more contextually diverse examples to learn better representations of entities.From Table 3, we notice the previous state-of-the-art models jobBERT and jobSpanBERT for SS extraction can perform better than the BERT model.However, the RoBERTa model achieves the highest performance in the SS extraction task overall.
Moreover, we notice that EM plays a significant role in further enhancing the performance of LLMs.To test whether the EM leads to a higher  1 score without assuming normality and homoscedasticity, we resorted to the Wilcoxon signed-ranks test [11].It gives  = 224 and -value= 0.017, revealing statistical evidence from the experiments that using EM improves the performance over baseline models.The RoBERTa model with EM fine-tuned on contextual augmented dataset achieves the highest  1 score of 54.46, which leads to a 6.52% improvement over the baseline counterpart.We also evaluate our approach against the off-the-shelf GPT-4 model using the prompting approach (described in Section 5.2).The GPT-4 model achieves 48.01% in absolute  1 , which is 6.41% less than our proposed approach.The performance of the proposed DA techniques on the entity extraction WNUT-2017 dataset is reported in Table 4.We can observe that contextual augmentation can achieve better performance over the baseline.However, the performance of contextual DA is marginally less than the baseline augmentation approaches.The RoBERTa model finetuned on augmentation generated by swapping random words and using EM achieves the highest  1 score of 55.31, which is a 1.2% improvement over the contextual DA.
On the other entity extraction dataset (CoNNL-2003), the augmentation is not as effective as in the previous case of (GUM and SKILLSPAN).We notice a similar trend as in the WNUT-17 dataset.From Table 4, we observe that all augmentation techniques show marginal improvement over the baseline model.The RoBERTa model fine-tuned on a dataset augmented with word deletion and using EM gains 0.42% performance over the baseline.
We remark that the performance of data augmentation is marginally effective (WNUT-17) and, in some cases, degrades (CoNLL-2003).However, the performance gain is significant for SKILLSPAN and GUM datasets.
It can be observed that, for both datasets, the model/augmentation technique that reaches the best  1 score in the validation set returns the best results on the test set as well.This fact supports the conclusion that the assessment of the NER datasets can be considered positive overall.

INFERENCE
Table 5 shows the inference on the test set.We compare the predictions of the RoBERTa model fine-tuned on the baseline and augmented dataset against the gold labels.Both models correctly predict Example 1.In Examples 2 and 3, the baseline model predicts some extra tokens not annotated in the gold labels, whereas the augmented model predicts the tokens correctly.Likewise, in Example 4, both models fail to predict the tokens "NET services and APIs ... "; however, the second part of the sentence is predicted correctly by the augmented model, showing superior performance over the baseline.Example 5 reveals interesting outcomes: the input phrase "supervision and other relevant management functions" would be considered a soft skill in general.However, due to human mistakes, it is not labeled as such in the gold dataset.The augmented model is still able to denote it as a SS.The GPT-4 model can extract a few tokens corresponding to SS in Examples 1, 2, and 5, whereas it fails to identify any SS in Examples 3 and 5.

CONCLUSIONS AND FUTURE WORK
We have presented a simple yet effective approach for improving the performance of Large Language Models when the entity length is long and gold annotations are limited.We demonstrate our approach on four different token classification datasets.By using DA techniques, we generated synthetic data that improved the performance of the baseline models.We also exploit the gold-annotated datasets to extract additional information via EM.We demonstrate that the proposed approach can effectively improve the SS extraction from job descriptions and CVs.This approach can help to speed up the recruitment process by automatically extracting information about soft skills from candidate documents without requiring expensive and time-consuming manual annotations.However, it is worth noting that more research is needed to improve these models' performance further and test their effectiveness in a real-world scenario.Additionally, in future work, we plan to investigate how to use these models fairly and ethically to ensure that they do not perpetuate existing biases in the recruitment process.

Figure 2 :
Figure 2: Pipeline of the proposed Data Augmentation flow.The tokens comprising of SS are referred to as keywords (marked in light green), whereas the rest of the tokens are referred to as context (marked in light red).A pre-trained Language Model is used to replace either the context or SS tokens.(a) refers to Contextual Augmentation, (b) refers to Keyword Augmentation.
) in the training set.Each   in the training set consists of a different number of tokens |  |.All the extracted s are padded to the maximum length  of soft skills in the training set.

Figure 3 :
Figure 3: Proposed methodology for fine-tuning LLM via Embedding Manipulation (EM).  are input tokens to the pre-trained LM.   refers to the embedding representation of each token generated by the pre-trained LM.Embedding (highlighted in orange) is the average representation of entities, calculated by extracting entities in the training set.

Table 1 :
Statistics of the datasets used in the proposed work are presented.The average entity length refers to the average number of tokens for each entity.

Table 2 :
Number of generated augmented sentences  and  1 score on the validation set for each dataset and each considered model/augmentation technique.The baseline value is for  = 0, i.e. no augmentation.Best performance for each model/data augmentation is in bold.

Table 5 :
Comparison of soft skills predictions on examples randomly sampled from the test set.We choose the best-performing model from Table3, i.e.RoBERTa.The Baseline Model column shows predictions of the RoBERTa model fine-tuned with the baseline dataset, GPT-4 column shows the predictions of GPT-4 on test examples.The Data-Augmented Model shows the results of the RoBERTa model fine-tuned with augmented-dataset and EM.Highlighted texts stand for gold labels in the first column, and the corresponding predictions by the models.