The Computational Method for Supporting Thai VerbNet Construction

VerbNet is a lexical resource for verbs that has many applications in natural language processing tasks, especially ones that require information about both the syntactic behavior and the semantics of verbs. This article presents an attempt to construct the first version of a Thai VerbNet corpus via data enrichment of the existing lexical resource. This corpus contains the annotation at both the syntactic and semantic levels, where verbs are tagged with frames within the verb class hierarchy and their arguments are labeled with the semantic role. We discuss the technical aspect of the construction process of Thai VerbNet and survey different semantic role labeling methods to make this process fully automatic. We also investigate the linguistic aspect of the computed verb classes and the results show the potential in assisting semantic classification and analysis. At the current stage, we have built the verb class hierarchy consisting of 28 verb classes from 112 unique concept frames over 490 unique verbs using our association rule learning method on Thai verbs.


INTRODUCTION
Lexical resources have played a key role in the rapid advancement of many natural language processing ( NLP ) tasks over the last few decades.For example, WordNet [ 19 , 44 ] is a lexical database of semantic relations (e.g., synonyms), which provides information about words for performing semantic analysis on texts, strings, and documents [ 18 ].FrameNet [ 5 , 38 ] is a lexical database organized in terms of semantic frames [ 37 ], which is valuable information for semantic role labeling ( SRL ), used in applications such as information extraction and sentiment analysis.The most notable resource is VerbNet [ 14 , 34 , 49 ], a verb lexicon that is crucial for supporting NLP tasks related to semantic interpretation.
These semantically annotated lexical resources help advance the development of the natural language understanding field [ 6 ], especially for VerbNet, since verbs play a crucial role in determining sentence patterns.VerbNet provides a framework for representing semantic predicate-argument structure, which gives the semantic representation of a sentence.VerbNet also consists of semantically coherent verb classes, which are useful for capturing generalizations and inferring information about their behavior, rather than describing individual words.This property is helpful for handling data sparseness problems in NLP tasks.Therefore, VerbNet is useful in computational lexicography, information extraction, question-answering, and verb sense disambiguation [ 9 , 10 , 16 , 29 ].
Despite the benefits, the number and size of lexical resources in the Thai language are limited such as WordNet from [ 52 ] has around 30,000 words in Thai synset, FrameNet from [ 33 ] contains 1,339 frames and 22,708 lexical units, and Thai concept frame from [ 50 ] contains 112 verbs.Since the manual development of lexical resources is a complicated, time-consuming, and error-prone process, researchers came up with computational approaches for VerbNet construction.For example, [ 35 ] enriched and combined existing lexical resources to build the VerbNet for Vietnamese, and [ 41 ] exploited the well-established resources via cross-lingual translatability to create VerbNet for many languages such as Polish, Croatian, Mandarin, Japanese, Italian, and Finnish.
One popular computational approach for creating a new lexical resource in a new language is translating the lexical resources in well-established languages such as VerbNet from English, or languages with abundant resources, to other languages with scarce resources.By exploiting the well-developed and abundant lexical resources in English or languages with the specified properties, it is a success for some languages [ 41 ].However, this approach does not work well for the Thai language in other simpler lexical resources, such as WordNet and FrameNet, so researchers have to either abandon the untranslated entities [ 52 ] or adopt a hybrid approach [ 33 ].Hence, our method uses a different approach to exploit the existing lexical resources via data enrichment in order to develop the more complex lexical resource like VerbNet.
In this work, we present a novel computational method for the construction of Thai VerbNet, which extends from the Thai concept frame construction in [ 50 ] by enriching the Thai concept frame with verb class.Since the construction of the Thai concept frame presented in [ 50 ] involves a few manual steps such as SRL, we survey the state-of-the-art methods for the SRL problem, e.g., Transformer-based and recurrent neural network ( RNN )-based methods, in order to illustrate that the construction of the lexical resource can be made automatic.Then, each verb in the Thai concept frame is enriched with the verb class and hierarchy, which is computed by a rule-based machine learning method, i.e., association rule learning.We also investigate the characteristics of each verb class and show that the members of verb classes share somewhat semantic and syntactic behaviors.The main contribution of this work is threefold: (a) introducing the first Thai VerbNet corpus, to the best of our knowledge, (b) developing the novel method for verb classification based on association rule learning, and (c) providing the benchmark in the SRL problem on the Thai VerbNet corpus.The verb class hierarchy consists of 28 verb classes from 112 unique concept frames computed by our association rule learning method on Thai verbs is available at https:// sgulyano.github.io/semanthaibank/ .
The rest of the article is organized as follows.Section 2 discusses the verb classification and SRL problems that are required for the construction of VerbNet.Section 3 presents the novel method for verb classification based on association rule learning and the configuration of benchmark tests for SRL.In Section 4 , the results of the state-of-the-art methods on the SRL problem are presented, The Computational Method for Supporting Thai VerbNet Construction 22:3 while Section 5 discusses the verb classification results and the relationship between verbs within the same class.Finally, in Section 6 , we present conclusions and directions for future work.

RELATED WORK
VerbNet [ 34 , 49 ] organizes verbs into verb classes, whose components include class hierarchy, members, frames, and thematic roles, which can be further optionally characterized by selectional restrictions [ 14 ].Finding the verb class hierarchy and members can be seen as the verb classification problem, while thematic roles are determined through the SRL problem.Frames can be computed following the method introduced by [ 50 ].

Verb Classification
Previous work in [ 21 ] suggested that a verb syntactic behavior (i.e., its argument realization) and lexical semantics are connected.Consequently, the verbs with shared common semantic features can be grouped based on their syntactic expression of verbal arguments.The verb classification is intended to organize verbs into groups based on semantic relations and similar syntactic behavior.While it is laborious to manually classify verbs, previous studies showed that it is possible to automatically acquire verb classes using constructionist framework [ 11 ] and cluster analysis [ 35 ].On the other hand, our work introduced a novel verb classification method based on association rule learning.

Clustering.
Clustering is an unsupervised learning technique for grouping similar entities together without the knowledge of labels.The work in [ 35 ] adopted the hierarchical clustering algorithm to automatically identify verb classes.The clustering algorithm requires the definition of the data representation and the similarity measurement.The authors use the word embedding vectors, generated by the Word2Vec [ 43 ] and BERT [ 46 ] algorithms, to represent each verb and use the cosine distance as a similarity measure.

Association Rule
Learning.Association rule learning is a rule-based machine learning method for discovering frequent patterns between variables in a database.The task can be seen as two subtasks: frequent itemset identification and association rule mining.The database consists of a list of transactions, where each transaction is defined by an ID and the corresponding set of items of interest, called the itemset.The first subtask is to find frequent itemsets, which are the set of items whose support, the number of transactions in which the set of items appears, exceeds a user-specified threshold.The latter subtask is to find a "strong" association rule.A rule is expressed as a link between two itemsets, X and Y .The confidence of the rule is the conditional probability that a transaction contains the itemset Y given that it includes the itemset X .A rule is "strong" if its confidence exceeds the user-specified threshold [ 55 ].For verb classification, the syntactic frame, which is a combination of semantic roles with syntactic patterns, can be seen as an item, so the group of syntactic frames is the itemset.The verb or the collection of verbs can then be seen as a transaction, whose itemset is all possible syntactic frames of the verb(s).Hence, finding the "strong" rules can capture the associated semantic roles and syntactic patterns, which is one possible way to categorize verb classes.
The Apriori algorithm is a popular method for finding frequent itemsets and association rules introduced by [ 3 ].It uses a breadth-first (level-wise) search approach and prunes the search space for efficiency by exploiting the observation that no superset of an infrequent itemset can be a frequent itemset.The Apriori algorithm is adopted in our work due to its popularity and simplicity; however, other methods for finding frequent itemsets and association rules such as Eclat and FPGrowth algorithms can be used as well [ 55 ].

Token Classification
Token classification is a natural language understanding task in which a label is assigned to some tokens in a text.Some popular problems in this category are named entity recognition ( NER ), part-of-speech ( PoS ) tagging, and SRL.
SRL plays a crucial role in the sentence-level semantic analysis as it recovers the predicateargument structure of a sentence, determining "who" did "what" to "whom", "when", and "where."Hence, the task of SRL is to indicate what semantic relations hold among a predicate and other sentence constituents, where these relations are drawn from a pre-specified list of possible semantic roles [ 42 ].
Previous methods for SRL can be grouped into three main categories, including traditional machine learning methods, RNN-based methods, and transformer-based methods.

Traditional Methods.
Many conventional machine learning techniques were applied to the SRL problem, such as k-nearest neighbors ( kNN ) [ 4 ], graphical model [ 53 ], and especially support vector machine ( SVM ) [ 45 ].SVM, introduced by [ 15 ], has achieved good performance in many classification tasks, e.g., NER [ 7 ] and POS-tagging [ 20 ].The main advantages of SVM are that it has high generalization performance independent of the dimensions of the feature vectors by considering only support vectors and it can support a combination of multiple features through the kernel trick.[ 24 ] and its variants such as long-short term memory ( LSTM ) [ 26 ] and gated recurrent unit ( GRU ) [ 12 ] have shown great success in modeling sequential data.The previous work from [ 51 ] adopted the bidirectional long short-term memory with conditional random field ( Bi-LSTM-CRF ) architecture [ 31 ], which utilizes both word and character representations, and achieved an outstanding result in Thai NER task.Bi-LSTM-CRF exploits the Bi-LSTM for capturing information in a sentence and the conditional random field ( CRF ) for sequence labeling, which also takes into account the correlations between neighbor's labels [ 40 ].This type of deep learning architecture is designed especially for sequential data as it can capture long-range dependencies between predicate and argument [ 25 ].

Transformer-based Methods.
Recent transformer-based methods usually involve the unsupervised language model pre-training on a large dataset, resulting in the pre-trained model that can be applied to downstream tasks with little modification to the architecture and produce the state-of-the-art performance [ 17 ].For example, authors in [ 39 ] introduced a language model, named WangchanBERTa, based on the RoBERTa-base architecture [ 36 ] modified for the Thai language.WangchanBERTa was benchmarked against other state-of-the-art methods and gave comparable or superior results on many downstream tasks such as sentiment analysis, topic classification, NER, and POS tagging.The advantage of this type of method is that it can be applied to low-resource tasks since only a small number of parameters need to be learned from scratch.

Thai Concept Frame
Our method is based on the construction of the Thai concept frame in [ 50 ], which was computed for the agricultural domain.The source data comprised 5,784 Thai sentences and 962 case frames were constructed.This framework collects the relationships of words surrounding those verbs and then summarizes them into a semantic frame, called concept frame, capturing the relationships of words in various possible agricultural contexts.The construction of the Thai concept frame that the authors introduced in [ 50 ] involves six steps: (1) automatic word segmentation, (2) automatic POS annotation, (3) automatic syntactic parsing labeling, (4) manually SRL, (5) manually semantic Fig. 1.The overview of our method.Our contribution is the introduction of the verb classification (step 7 in blue box) to the construction process of the Thai concept frame (steps 1-6) [ 50 ].Previous work [ 50 ] has already automated most of the steps, except for SRL (step 4), semantic concept labeling (step 5), and verb classification (step 7).We also survey automatic SRL methods and suggest automatic process for semantic concept labeling.concept labeling, and (6) automatic capture and organization (Figure 1 ).The two manual steps are manual SRL (step 4) and manual semantic concept labeling (step 5).These are two crucial steps that introduce syntactic and semantic behaviors, respectively, to the concept frame.Therefore, we surveyed the automatic SRL method as the possible automated solution; while the semantic concept used in [ 50 ] can be derived from WordNet [ 52 ].For semantic concept labeling, the authors in [ 50 ] determine the semantic concept of a word from WordNet so one possible solution is to use Thai synset [ 52 ] to find the word's synonym and assign the word the same semantic concept as its synonym with the same verb argument structure.The semantic concept may need to be manually determined following the process in [ 50 ] if no matches are found.

METHOD
The method that we propose to build a Thai VerbNet, an equivalent resource to VerbNet for the Thai language, is based on data enrichment of existing frame semantics lexical resource, i.e., the Thai concept frame.Our work extended the construction process of the Thai concept frame in [ 50 ] by adding the verb classification step (Figure 1 ).Authors in [ 50 ] already provide an automatic process for most of the steps, except for SRL, semantic concept labeling, and the extra step, verb classification.Hence, we survey the automatic methods for SRL and provide a benchmark for SRL in the Thai language.While we use only the semantic concepts given in the dataset.The additional information needed for Thai VerbNet construction is the class hierarchy and members, which can be obtained by solving the verb classification problem.The last component of VerbNet, which is optional, is the selectional restrictions of the verb arguments are left for future work as they only provide the constraints on specific word combinations and there are no effective automatic solutions available for the Thai language.A constituent that notes how the action, experience, or process of an event is carried out.Measure A constituent that notes the quantification of an event.Object A constituent that is usually the object of the verb in a sentence (also known as Patient).Source A referent which is the place of origin, the entity from which a physical sensation emanates, or the original owner in a transfer.Time A constituent that notes the temporal placement of an event.

Semantic Role Classification
The thematic roles or semantic roles used in this study follow the works by [ 23 ] and [ 32 ], which consists of 12 cases: Accompaniment, Agent, Beneficiary, Experiencer, Goal, Instrument, Location, Manner, Measure, Object, Source, and Time (details in Table 1 ).We compare three techniques, including SVM, Bi-LSTM-CRF, and WangchanBERTa, which represent the traditional, RNN-based, and transformers-based methods respectively.And the configuration used in the experiments is described below: SVM with the radial basis function kernel and one-vs-one ( OVO ) classification strategy is used in all experiments.And the feature vectors are set to be the embedding vectors of the current and previous tokens.The language model for word embedding that was deployed in this study is a pretrained version of Thai2Fit [ 48 ], which is a ULMFit Language Modeling [ 27 ] trained on Thai Wikipedia Dump with 60,005 embeddings.In this study, the dimension of word embeddings is 400.The implementation uses Scikit-learn library [ 47 ] and the hyperparameters, such as the regularization parameter, are set to default.
Bi-LSTM-CRF used in this study has the same architecture as the one used for NER described in [ 51 ].Bi-LSTM-CRF uses the word embeddings from the pretrained Thai2Fit model as the wordlevel representation.On the other hand, the character-level representation is computed using the Bi-LSTM with 32 LSTM units, the dimension of embedding vectors is 32, and the recurrent dropout of 0.5.Then, both word and character representations are concatenated and fed to the Bi-LSTM with 256 LSTM units and recurrent dropout equal to 0.5.Adam is used as the optimization method with a batch size equal to 32.The model was implemented using Keras/Tensorflow library [ 1 , 13 ].
WangchanBERTa uses SentencePiece [ 30 ] as the subword tokenizer and it adopted the RoBERTa-base architecture [ 36 ] pretrained on the assorted Thai texts described in [ 39 ].Then, the model is fine-tuned for the downstream task of semantic role classification.In this study, Adam was used for optimization with the learning rate of 2e-5, the number of epochs of 4, and the batch size of 4. The model is developed using Hugging Face Transformers library [ 54 ].
The Computational Method for Supporting Thai VerbNet Construction Find frequent itemset A using apriori algorithm.

C[ i]
← i for all i ∈ L that is a root node.

23:
for all i ∈ L in BFS order do 24: if i is a left child node then

Verb Class Hierarchy Computation
Since verb classification is intended to organize verbs into groups with shared semantic and syntactic behaviors, our method classifies verbs within the same concept frame based on their syntactic frames.We represent verbs in the same concept frames by their case grammar [ 22 ] as the combinations of semantic roles required by the verbs.This representation is preferred because it can capture both syntactic and semantic behavior as both concept frames and syntactic frames have semantic properties [ 28 ], while the structure of semantic roles provides the syntactic realization.Assuming that verbal argument structures in terms of semantic roles are reliable indicators of verb meaning and behavior for the Thai language, we can treat the verb classification problems as grouping verbs based on frequent patterns of argument structures.
To set up association rule learning for mining frequent patterns, a concept frame is seen as a transaction, whose itemset is the argument structures associated with its verbs.An item is represented by a sequence of semantic roles following the verbal arguments.The goal is to create a verb class hierarchy that takes into account both syntactic and semantic information.With this problem formulation, the frequent itemsets act as potential verb classes, and the association rules provide the class hierarchy.
Our method consists of three main steps, described in Algorithm 1 .First, it organized the concept frames as a binary tree based on the frequent and common argument structures, which can be computed using the Apriori algorithm [ 2 ].The internal nodes of the binary tree split the data into two groups based on whether it contains the specific item(s), where the left child contains the item(s) and the right child does not.Second, the binary tree is pruned into the forest, a collection of disjoint trees, to obtain the verb classes.The user-defined maximum depth of trees is defined empirically to avoid under-pruning, which makes verb classes lose their specificity.On the other hand, over-pruning also removes the shared characteristics between classes.In our experiments, the support threshold of 0.2 and the maximum depth of 3 gave the best result based on our observations but the maximum depth is expected to change when the number of samples increases.Finally, the class name is assigned to every node and every concept frame and verb within the node.The node and its right child share the same name, while the left child creates the new subclass since it introduces the new argument structure(s) to the class so it qualifies to be a new subclass.

RESULTS
In this section, we describe the results of three SRL techniques we surveyed-SVM, Bi-LSTM-CRF, and WangchanBERTa-on a collection of unseen test sentences.SVM represents the traditional machine learning technique that was popular for NLP tasks, while modern NLP techniques usually revolve around deep learning, i.e., RNN and transformers.Bi-LSTM-CRF is one of the SOTA of RNN-based techniques, while WangchanBERTa is one of the SOTA of transformers-based techniques.
The dataset used in this experiment is a part of the Thai concept frame dataset [ 50 ] that is publicly available on http://tcf.human.ku.ac.th/ .It contains 2,238 sentences with 490 unique verbs.The sentences in the dataset are partitioned into two groups for training and testing using two different strategies, including the hold-out method and Out-Of-Vocabulary ( OOV ) analysis.In the hold-out method, we randomly partitioned the sentences in an 80:20 ratio for training and test datasets respectively.In OOV analysis, to imitate the effect of OOV on unseen verbs, all unique verbs are partitioned in an 80:20 ratio and the latter part is considered as OOV words so the model is trained only on the sentences that contain the verbs from the former group.In each sentence, only the arguments of the verb predicate are labeled with a semantic role while other constituents are labeled as "unknown" and are not used for the evaluation.The predicate-argument structure is already provided in the dataset, which is used as the ground truth in our study, while the arguments are defined as the tokenized words.Although there are 12 possible cases of semantic roles (Table 1 ), the sample sentences have no words that belong to Source and Goal cases.This is because the sample sentences are collected from news and research about agriculture so most of the samples are declarative sentences that don't involve the motion events.Table 2 provides counts of the number of sentences, tokens, unique verbs, arguments, and semantic role labels in the training and test datasets for both data-splitting strategies.
Predicted results are evaluated with respect to precision, recall, and the F1 measure.Precision (p) is the proportion of predicted semantic role labels that are correct.Recall (r) is the proportion of corresponding semantic role labels that are predicted correctly.Finally, the F1 measure computes the harmonic mean of precision and recall, F 1 = 2 pr / (p + r ), and it is the main measure for performance comparison between systems.
The Computational Method for Supporting Thai VerbNet Construction 22:9 Tables 3 and 4 present the overall results and the results for each semantic role obtained by the three techniques on the hold-out and OOV test sets, respectively.The results with the best F1 score are shown in bold.The overall results are computed using micro-averaged (Micro Avg), macro-averaged (Macro Avg), and weighted-averaged (Wt.Avg).Micro-averaged scores are the global average scores of all samples.Macro-averaged scores are the mean of all per-class scores.Weighted-averaged scores are the mean of all per-class scores weighted by class size.
As it can be observed, the transformers-based method like WangchanBERTa gives the best overall performance and performs best in most of the semantic roles, while SVM gives the worst results.This observation aligns with the recent progress in NLP, where the Transformer has rapidly become the dominant architecture [ 54 ] due to its superior performance over other deep learning architectures and traditional machine learning methods.Bi-LSTM-CRF seems to work well with semantic roles that usually appear next to or close to related words like accompaniment and beneficiary; however, the transformers-based method still outperforms other methods.It also can be noticed that the results of these methods are rather low on the semantic roles with a low number of instances, e.g., Beneficiary, Instrument, Accompaniment, Manner, and Experiencer.For WangchanBERTa, both precision and recall are very different (ranging between 15% and 70%) in these classes.If the number of instances is low enough (i.e., Beneficiary case), then these small classes are ignored.Therefore, WangchanBERTa usually performs worse in terms of F1 measures on these classes.
On the other hand, both SVM and WangchanBERTa methods usually obtain high scores on classes with a high number of samples, such as Object, Location, Measure, and Time, but this is not true for Bi-LSTM-CRF.The method performs the best on the Accompaniment case in both data splitting strategies, while it performs much worse in other cases, so this suggests overfitting to a specific class.
In OOV analysis, the results show no significant differences in the patterns.WangchanBERTa still gives the best overall results and it performs the best in most semantic roles.Bi-LSTM-CRF method is affected the most by OOV words as its scores drop the most among the three methods.The micro-averaged and weighted-averaged F1 scores of WangchanBERTa and SVM are slightly affected by OOV words, where they decrease for a few percentages.However, the macro-averaged scores are heavily affected since the effect of small classes is amplified, especially on Beneficiary, Instrument, and Manner cases, where F1 measures of all three methods are lower than 8%.

DISCUSSION
Our Verb classification method based on association rule learning is used for grouping 112 unique frames of 490 unique verbs from the Thai concept frame dataset [ 50 ], which consists of 184 unique argument structures in terms of semantic roles, into 28 verb classes and 40 subclasses.Since the dataset used in this study contains sentences about agricultural information from news and academic articles, most sample sentences are statements stating facts about agriculture.Therefore, the verb classification in our study is designed for informative sentences in the agricultural The Computational Method for Supporting Thai VerbNet Construction 22:11 domain.Since this work is the first work of its kind to the best of our knowledge; therefore, there are no previous works for comparison.And Thai VerbNet has never been built before so there are no ground truths for quantitative results either.Thus, only the qualitative evaluation of our novel method is discussed in terms of linguistic aspects.
Examples of computed verb classes are illustrated in Table 5 .The first column indicates the class/subclass name.The naming pattern starts with the frame name, followed by the class number.If there are subclass or sub-subclass numbers, then they are appended at the end.And all these components are separated by a dash.The second column shows the frame members of the verb classes or subclasses, where one possible English translation of the Thai frame name is given in parentheses.The last column displays the common argument structures of the verb classes or subclasses as sequences of verbs and their arguments described by their semantic role.The qualitative evaluation is based on the syntactic and semantic coherence of verbs within the same classes/subclasses and no inter-class relationships are considered.Syntactic coherence is observed by the similarity of argument arrangement; while semantic coherence is inspected by the ability to use the broader context like theme or topic to relate the verb members.An advantage of using frequent common argument structures for verb classification is that subclasses are guaranteed to contain the verb argument structures of the parent class.This property gives solid syntactic coherence over verb class hierarchy, compared to the clustering method [ 46 ] that gives rather inconsistent results depending on the choice of the clustering techniques and the similarity measures.
It can be observed that the verbs within the same computed classes and subclasses either share some kinds of semantic characteristics or fall under the same concepts.Some verb classes contain member verbs that share a direct similarity in terms of semantic aspects.For example, " -22" verb class contains the following frames: / / (learn), / /(grow), and /plìan/ (change). 1The meanings of all these verbs are directly related to the idea of development and transition.On the contrary, the verb members of some verb classes may not have direct links between them but they can be seen as coming from the same theme.The verb class named " -11" contains the following frames: /th εε n/ (replace), /pr εε rûup/ (transform), /kh ǎay/ (sell), / duu / (look), and / / (protect).Their meanings do not directly relate, unlike the previous example, but all these frames are related to the concept of trading.
At the same time, it is noticeable that the members from different verb subclasses have some distinctions in semantic aspect.An example is the " -23-1" and " -23-2" verb subclasses." -23-1" verb subclass contains /hàn/(cut), /plùuk/ (plant), and /phôn/ (spray), while " -23-2" verb subclass contains /khâw/ (enter) and /yùu/ (live).The two verb subclasses are connected to two different concepts-planting and dwelling, respectively.Nonetheless, these two verb subclasses fall under the same verb class and all these frames can link to the concept of farming.
The limitation of the method is that the verbs of some computed classes and/or subclasses are somewhat semantically incoherent.The issue may arise from the relatively small size of the corpus that was used for experiments and the shallow semantic representations derived from the SRL structure.The number of sample sentences in the corpus used in this study is small and all the samples are from the agricultural domain so increasing the number of samples and including other domains may give a better verb classification and more semantic coherence.Second, the current features for representing concept frames solely rely on argument structure for syntactic features, while semantic features are captured indirectly through the concept frame.This representation captures the syntactic behavior very well but not the semantic behavior, despite the fact that the concept frames capture both syntactic and semantic aspects.As a result, it leads to incorrect or opaque classification/clustering results.If richer and more expressive semantic representations like [ 8 ] were adopted, it may enable in-depth linguistic semantic analysis and build verb classes with more semantic coherence.

CONCLUSION AND FU T URE WORK
In this article, we have presented the results of a preliminary investigation aiming at constructing the first Thai VerbNet via data enrichment of an existing frame semantics lexical resource.This involves automating the manual SRL step within the existing Thai concept frame construction and introducing the novel association rule learning method for verb classification to organize verbs into Thai VerbNet based on both semantic and syntactic properties.
The verb classification problem is formulated as the association rule learning with verb argument structures as items.The main advantage is that it is a computational method with interpretability, compared to previous methods such as clustering algorithms.Moreover, the verb members of computed verb classes possess syntactic coherence and share some semantic properties.
Multiple SRL methods are surveyed and evaluated on the available Thai lexical resource.The deep learning techniques, especially the transformers-based methods like WangchanBERTa, were found to be the most effective method regardless of OOV words.
The Computational Method for Supporting Thai VerbNet Construction

22:13
The presented methodology establishes the method for automatic Thai VerbNet construction.Additional work is required since the selectional restrictions must be included in future work and the evaluation is carried out on the relatively small corpus, compared to other similar lexical resources in other languages.Moreover, richer and more expressive semantic representations are needed to better capture the semantic aspects of verbs for building the verb hierarchy.This work hopes to inspire other scientists to pursue this goal together in order to advance NLP research in the Thai language and other low-resource languages.

25 :
C[ i] ← a new subclass of C[ parent( i)

Table 1 .
Definitions of Semantic Roles Used in This Study from https://glossary.sil.org/bibliographyCase Description Accompaniment A thing that participates in close association with an agent, causer, or affected in an event.that an agent uses to implement an event.It is the stimulus or immediate physical cause of an event.Location A constituent that identifies the location or spatial orientation of a state or action, but it does not imply motion to, from, or across the location.Manner j}

Table 2 .
Counts on the Datasets in Both Splitting Strategies

Table 3 .
Precision, Recall, and F1-Score of the Three Techniques Over Hold-Out Test Set

Table 5 .
Example of Computed Verb Classes