WOT-Class: Weakly Supervised Open-world Text Classification

State-of-the-art weakly supervised text classification methods, while significantly reduced the required human supervision, still requires the supervision to cover all the classes of interest. This is never easy to meet in practice when human explore new, large corpora without complete pictures. In this paper, we work on a novel yet important problem of weakly supervised open-world text classification, where supervision is only needed for a few examples from a few known classes and the machine should handle both known and unknown classes in test time. General open-world classification has been studied mostly using image classification; however, existing methods typically assume the availability of sufficient known-class supervision and strong unknown-class prior knowledge (e.g., the number and/or data distribution). We propose a novel framework WOT-Class that lifts those strong assumptions. Specifically, it follows an iterative process of (a) clustering text to new classes, (b) mining and ranking indicative words for each class, and (c) merging redundant classes by using the overlapped indicative words as a bridge. Extensive experiments on 7 popular text classification datasets demonstrate that WOT-Class outperforms strong baselines consistently with a large margin, attaining 23.33% greater average absolute macro-F1 over existing approaches across all datasets. Such competent accuracy illuminates the practical potential of further reducing human effort for text classification.


INTRODUCTION
Weakly supervised text classification methods, including zero-shot prompting, can build competent classifiers from raw texts by only asking humans to provide (1) a few examples per class [8,16] or (2) class names [5,17,25].All these methods require that the humanprovided known classes cover all the classes of interest, however, it can be very difficult especially in the dynamic and ever-changing real world.For example, the human expert could be exploring a new, large corpus without a complete picture.
In this paper, we work on a novel yet important problem of weakly supervised open-world text classification as shown in Figure 1.Specifically, the human is only asked to provide a few examples for every known class; the machine is tasked to dive into the raw texts, discover possible unknown classes, and classify all the raw texts into corresponding classes, including both known and unknown.The open-world setting here releases the all-class requirement, further reducing the required human effort in weakly supervised text classification.We argue that this problem is feasible because one could expect that unknown classes follow a similar taste as the known classes, i.e., the classes should follow certain underlying high-level semantic meanings and the same granularity level.For example, if the known classes are "Awesome" and "Good", one would expect to see classes like "Terrible" and "Bad"; in Figure 1, "Politics" can be a reasonable choice for unknown class.
Open-world classification [6,10,21,23,27] has been studied, mostly in image classification; however, existing methods typically assume the availability of sufficient known-class supervision and strong unknown-class prior knowledge (e.g., the number and/or data distribution).Text classification has its uniqueness -The text is composed of words, some of which reflect the semantics of the  We propose a novel, practical framework WOT-Class 1 , which lifts those strong assumptions of existing methods.Figure 2 illustrates the general idea of WOT-Class.It leverages the class-words in text to iteratively refine the text clustering and ranking of classwords.Specifically, we first make an overestimation of the number of classes and construct initial clusters of documents based on the names of known classes.Then, we employ an iterative process to refine these clusters.We first select a set of candidate class-words for them through statistics and semantics.Then we learn a classifier to provide each cluster a ranking of class-words based on the limited known-class supervision.When there is redundancy among these clusters, the high-ranked class-words for the clusters will overlap, in which case we know at least one cluster is redundant.The refined set of class-words will help re-cluster documents, and we repeat this process till the number of classes no longer decreases.
We conduct our experiments by fixing the most infrequent half of classes as unseen, which emphasizes the imbalanced and emerging nature of real-world scenarios.And our extensive experiments on 7 popular text classification datasets have shown the strong performance of WOT-Class.By leveraging merely a few knownclass examples and the names of known classes, WOT-Class gains a 23.33% greater average absolute macro-F 1 over the current best method across all datasets.When given our prediction of classes as an extra input, WOT-Class still achieves 21.53% higher average absolute macro-F 1 .While precisely discovering unseen classes identical to the ground truth remains challenging, our method can provide predictions closest to the actual classes more stably than existing approaches.And considering WOT-Class provides classwords for each discovered unknown class, it shall only require a reasonable amount of effort for humans to digest the discovered ones to something similar to the ground truth.Finally, WOT-Class

PRELIMINARIES
In this section, we formally define the problem of weakly supervised open-world text classification.And then, we brief on some preliminaries about CGExpan and X-Class, two building blocks that we will use in our method.Problem Formulation.In an open-world setting, there exists a not-fully-known set of classes C, which follows the same hyperconcept and a set of documents D, each uniquely assigned to a class.A weakly supervised open-world model can observe partial information of C. In this work, we assume that partial information is given as a labeled few-shot dataset D  = {  ,   }  =1 ,   ∈ C  , where C  ⊂ C where C  is the known subset of classes and  is rather small (e.g., a ten-shot dataset would mean  = 10 * |C  |).The goal of the model is to classify the remainder of the dataset, D  = D\D  , where some of the labels in C  = C\C  is completely unknown to the model.We emphasize that different from extremely weakly supervised or zero-shot prompting based text classifiers, the names of the unknown classes are also not given to the model.CGExpan.Entity set expansion aims to expand a set of seed keywords (e.g., United Sates, China) to new keywords (e.g., Japan) following the same hyper-concept (i.e., Country).This is the exact technique to help us discover potential class words that are highly suggestive of the hidden class names.In our method, we employ CGExpan [30], one of the current state-of-the-art methods for set expansion.CGExpan selects automatically generated hyper-concept words by probing a pre-trained language model (e.g., BERT), and further ranks all possible words guided by selected hyper-concept.However, a common problem of such set expansion method is that they typically give duplicated and semantically-shifted entities even at the top of the rank list.In our work, we utilize CGExpan to find semantically related words to the user-given class names as candidates for the class-words.Our method resolves this imperfect set of candidates problem by ranking them based on a scoring metric learned by few-shot supervision.X-Class.X-Class is an extremely weakly supervised text classification method that works with only names of the classes [25].It proposes a framework that learns representations of classes and text, and utilizes clustering methods, such as a Gaussian Mixture Model [19] to assign text to classes.While X-Class showed promising performance in close-world classification settings with minimal supervision, it does not work in open-world settings.Our method for the open-world classification problem reduces open-world text classification to close-world classification by iterative refinement of class-words.Therefore, a strong performing (and efficient) closeworld text classifier X-Class is employed.Static Representation.For each unique word  in the input corpus, we obtain its static representation s  by averaging BERT's contextualized representations of all its appearances.That is, , where  ′ are occurrences of the word in the corpus and t  ′ is its contextualized word representation 3 .A static representation is useful in determining the similarity and relatedness of two words.

OUR WOT-CLASS METHOD
In this section, we introduce our WOT-Class framework.To be able to accompany unknown classes, a common approach [6] is to first overestimate the number of classes and then reduce them after observing the data.We follow this approach and integrate it with class-words, with the goal of reducing the problem into an existing weakly supervised text classification (WS-TC) problem, where there are solutions to classify text in a close-world setting when class-words are given for each class and all classes are known.
In simple words, WOT-Class first proposes a set of high-potential words from which class-words can be found, and a list of classwords for an over-estimated number of classes.At each iteration, we start with clusters of documents obtained by WS-TC methods on the proposed class-words, and rerank the high-potential words to find the new list of class-words for each class.During this, classes of similar class-words are removed, and a new cluster of documents can be obtained by WS-TC methods.The iteration stops when no class removal happens.We summarize our iterative refinement framework in Algorithm 1.And Figure 3 shows an overview of the sub-components of this method.

Algorithm 1 WOT-Class's Iterative Refinement Framework
Input: clusters C, document representations R 1: while there are still redundant clusters do 2: Find class-indicative words W 3: Train MLP and rank W 5: Select possible names S  from W 6: Compute cluster coherence   (Eq. 1) 7: end for 8: for each pair S  , S  do update R, C based on S 15: end while

Initial Overestimation and Clustering
In WOT-Class, we take the approach where we make an initial overestimation of classes, and then try to refine the class-words and clusters of text.We fix a number  ( = 100 in all experiments) and ask CGExpan (refer to Section 2) to suggest  − |C  | similar words as the given |C  | class names.We consider these as a rough approximation of classes in the corpus, and employee X-Class (refer to Section 2) to construct the initial clusters of text, that may contain many duplicates and different granularity clusters.From now on, our iterative method is completed by two processes.In the first process, we obtain a list of class-words for each cluster and remove duplicate ones; in the second process, we simply apply X-Class to refine the clusters based on the new class-words.We mainly elaborate on the first process.

Cluster → Class-words
In the first process, we start with clusters of text, the initially suggested words by CGExpan, the class-words in the last iteration (for the first iteration, the class-words are the CGExpan proposed words) and the few-shot supervision, and aim to reduce the number of clusters and assign class-words to each cluster.Proposing Potential Class-words.Class-words are words that are related to and highly indicative of the class.Words from CGExpan qualify for the relativeness, but we also wish to find words that are rather exclusive to all except one (or a few, because of the noise in the clusters) cluster.The indicativeness of a word can be expressed by the statistical outstandness of it to its cluster of text, compared to other clusters.Such statistical measures are well-researched in information retrieval, the representative would be the tf-idf [18] score.We apply a more recent measure that has been used in text classification [14] to find statistically representative words within cluster : , where   () is the number of occurrences of the word  in documents belonging to cluster  frequency of word , and   indicates how many documents are in cluster .In the measurement, the first term tells how indicative a word is to a cluster, the second term measures how frequent this word is, and the third is a normalization based on the inverse document frequency.We find the top such statistically representative words for each cluster and merge these statistically representative words with words from CGExpan as the set of potential class-words.Class-word Ranking.We utilized CGExpan and statistical representativeness to approximately retrieve a list of potential classwords, and now we precisely rank them by learning a metric that defines the similarity between a cluster and a word.
Specifically, we construct features for a potential class-word to a cluster as the mean and variance of Euclidean distance and cosine similarity of static representations between the class-word and (a large list of  = 50) statistically representative words for the cluster 4 .Since we know a few labeled examples, they serve as virtual clusters where we treat the features of their class names to the respective virtual clusters as positive signals of closeness.The negative signals are derived through an intuitive heuristic where we find the most dissimilar word (i.e., the word with the furthest static representation from the known class name) from the set of potential class-words.With these two signals, we train a Multilayer Perceptron (MLP) logistic regression on the features to predict the signal.We assign a score  (, ) for each cluster  and each word from the high potential words.
We also propose a post-processing step to remove generic words from the ranking.We follow previous work [12] and design a penalty coefficient  (, ) for each candidate class name  in cluster  based on inter-class statistics: , 4 A larger list of statistically representative words is further employed for each cluster to detect the closeness with potential class-words.When proposing class-words, note we only select the top statistically representative words in each cluster.
where   () is the absolute rank number of  in cluster  based on MLP's prediction, M{•} indicates the median value of the set, and  is the size of the clusters in the current iteration.
The main idea of this formula is to obtain a coefficient to penalize those generic words (e.g., life, which might rank high in most clusters) from being selected as class-words.The numerator of the fraction shows how the word behaves across all clusters while the denominator shows how it behaves in a specific cluster.The median rank of a generic word will be very close to the specific rank.Note that we allow one word as the class name of several clusters because of the initial overestimation, but if a word ranks high in more than half of the clusters, it is considered a generic word that must be filtered.Such penalization and normalization are similar to the idf term in information retrieval.Therefore, we follow the design and choose to divide the two values and take the logarithm.Similar to the idf, this penalty coefficient lowers the chance of selecting a generic word but will not harm proper words.
The final indicativeness ranking is based on the product of two scores: Removal of Clusters.We finally discuss how we remove the clusters based on the class-words found.In simple terms, we remove clusters that have non-empty intersections in the class-words.While the cluster is removed, all its documents are discarded until they are reassigned to new clusters in the second text clustering process.
Precisely, we pick the  highest ranked class-words for a cluster to compare, where  is the number of iterations in the removal process, a simple way to inject the prior that the cluster quality is better and better after each iteration.We noticed that in certain clusters, there might not be enough good class-words, so we introduce a cutoff threshold  such that we do not pick words that have a low ratio of indicativeness score to the highest indicativeness score in the cluster.Then, when two list of class words overlap, we would like to retain one cluster (or in other words, the list of class-words) and remove the other.We remove the cluster with a low coherent where R is the list of text representations belonging to the cluster.When overlap happens, we remove the cluster that has lower coherence .
We also rerank the class words after removing the duplicated clusters and continue until no clusters require removal.

Iterative Framework and Final Classifier
The whole iterative framework is simply applying the first classword suggesting and cluster deduplicating process and the second class-word-based text clustering process iteratively, until the first process no longer removes clusters.
After exiting the iterative loop and obtaining the stable set of clusters, we follow the common approach in weakly supervised text classification [14,16,17,25] and train a final text classifier based on the pseudo-labels assigned to each text.This usually improves the performance as the fine-tuned classifier mitigates some noisy in the pseudo-labels.
Sentiment analysis is also popular in text classification.However, many explored sentiment analysis settings with weak supervision are on the coarse-grained setting [17,25] with too less classes (e.g., positive and negative), which is not practical for open-world class detection.

Compared Methods
We compare our method with 3 open-world classification methods, Rankstats+, ORCA and GCD.
Rankstats [10] (aka AutoNovel) is the first approach crafted for open-world classification without relying on any human annotation of the unseen classes.It tackles the task by joint learning to transfer knowledge from labeled to unlabeled data with ranking statistics.Since its original setting requires the labeled and unlabeled classes to be disjoint, we follow the process in GCD paper [23] to adapt it to our setting and named it Rankstats+.ORCA [6] is a general method for open-world semi-supervised classification, which further reduces the supervision of the seen classes.It utilizes an uncertainty adaptive margin to reduce the learning gap between seen and unseen classes.GCD [23] is also semi-supervised, utilizing contrastive representation learning and clustering to directly provide class labels, and improve Rankstats' method of estimating the number of unseen classes.To adapt these three methods to the text domain, we use BERT on top of their frameworks to obtain the document representations as the training feature.And since these three methods utilize self-supervised learning grounded upon visual data, within the text domain, we harness the SimCSE [9] approach.
We also propose other two baselines.BERT is known to capture the domain information of a document well [1,24].So we design BERT+GMM, which utilizes the CLS token representations after fine-tuning on the labeled dataset to fit a Gaussian Mixture Model for all classes.We also propose BERT+SVM.We first utilize a Support Vector Machine [11] to find all outlier documents based on CLS token representations and then classify documents belonging to seen classes with a BERT classifier and cluster outlier documents.

Experimental Settings
For the basic experiments, we split the classes into half seen and half unseen.We set the most infrequent half of classes (i.e., classes with Table 2: Evaluations of compared methods and WOT-Class.The overall mean micro-/macro-F 1 scores over three runs are reported.We also report performances for seen and unseen classes separately in Table 5 fewer documents) as unseen, which emphasizes the imbalanced and emerging nature of real-world scenarios.Among the seen classes, we give 10-shot supervision, that is 10 documents for each seen class containing labels and the rest are unlabeled (Figure 4).For WOT-Class and all compared methods in our experiments, we utilize the pre-trained bert-base-uncased model provided in Huggingface's Transformers library [26].All experiments are conducted using a 32-core processor and a single NVIDIA RTX A6000 GPU.For WOT-Class, the default hyper-parameter settings are  = 100,  = 50, and  = 0.7.The analysis of hyper-parameter sensitivity is presented in Section 4.4.
Since all compared methods require the total number of classes as input, we evaluate them in two ways.

• Our Estimation (OE):
To ensure the comparison is fair to the baseline methods, we also run all the baselines based on (3 runs) of our prediction of classes.• Baselines' Estimation: Since Rankstats+, ORCA, and GCD can also work started with an overestimated number of the initial classes and provide the prediction of the classes.So we also test these three methods starting with  = 100 classes as same as our method.Since further experiments show their predictions don't work well in most of our datasets, we do not test other baselines with their estimations.Evaluation.There are several evaluation criteria like accuracy, or NMI-based clustering metrics have been used in previous work [6,10,23].However, they were proposed in a balanced setting and would be biased toward the popular classes when the classes are imbalanced (i.e., the low penalty for misclassification of infrequent unseen classes).Since we argue that in open-world classification, the new, emerging classes are naturally the minority classes, these metrics are not suitable.
Therefore, we propose a new evaluation criterion based on F 1 Score to better demonstrate results.Since the final number of classes produced by a method may not equal the ground truth, a mapping from the prediction to the actual classes is required.Given the clusters of documents provided by a method and the ground-truth classes of documents, we first perform a maximum bipartite matching between the method-provided clusters and the ground-truth classes, where the edge weights are the number of overlapping documents between the clusters and the ground-truth classes. 5The matched clusters are assigned to the corresponding classes.This step is to guarantee that all classes have some predictions.For each remaining cluster, we simply assign it to the class with which it exhibits the maximal matches.This is the equation.Consider a matrix M, where M , denotes the number of text in cluster  that belongs to class .We use   to denote the assignment of each cluster: After assigning all clusters to the classes, the F 1 score can be computed instance-wise on the text as the evaluation criterion for classification performance.

Experimental Results
WOT-Class Performance.We assess the weakly supervised openworld performance of WOT-Class versus other baselines.Table 2 contains overall comparisons, Table 5 and 6 further provide performances on seen and unseen classes.Specifically, WOT-Class outperforms BERT+GMM and BERT+SVM across all 7 datasets for both seen and unseen classes, even though they are given the matching number of classes as input.This strengthens the need for our iterative refinement process since merely applying few-shot tuning does not bring as good performance as ours.Moreover, WOT-Class performs noticeably better than the general methods Rankstats+, ORCA, and GCD under the same circumstances.Even when the matching number is given as input to them, WOT-Class consistently outperforms them in all cases on all datasets, except for the seen part of DBpedia.Imbalance Tolerance.As generic solutions, ORCA, and GCD with our prediction number of classes only have a little performance margin with WOT-Class on balanced dataset DBpedia, while underperforming more severely on other imbalanced datasets.To gain further insights, we conduct experiments on the tolerance of imbalance for WOT-Class and these three compared methods.As shown in Table 4, we construct three imbalanced DBpedia datasets with different degrees of imbalance.This is achieved by removing the number of samples in each class by a linearly increasing ratio Δ.For example, when Δ = 4%, the classes have 100%, 96%, 92%, . . . of its original documents.We choose the ordering of classes randomly but fixed across the Low, Medium, and High experiments, Table 7: Predictions of the number of classes.The average and standard deviation over three runs are reported.The average offset refers to the average of the absolute discrepancies between each prediction and the ground truth value.

Method
AGNews   and by design, the classes with a larger number of documents are seen classes.Figure 5 and Table 3 shows the result of WOT-Class and compared methods on the constructed DBpedia.ORCA and GCD are sensitive to imbalanced classes, especially for the unseen part of the data.Even after reporting the Pareto optimal results based on their own predictions and our estimation for these two methods, their overall performance still dropped by 11.31% and 8.03% respectively as the data distribution became more imbalanced, while our method experienced a relative drop of only 5.52%.This experiment shows that WOT-Class is more robust to imbalanced classes of text datasets which are common in the real world (e.g., the imbalance ratio of NYT collected from NYT News is 16.65).Prediction of the Number of Classes.WOT-Class starts with an initial guess on the number of classes and removes redundant ones iteratively.The number of the remaining classes is its prediction of the total number of classes.As shown in Table 7, in most cases, WOT-Class's final predicted number of classes is around 2 to 4 times larger than the ground truth, which is affordable for human inspection.And the estimation turns out to be reasonable as shown in Table 8, WOT-Class overestimates because its predicted classes are the fine-grained version of the ground truth classes.For example, DBpedia's artist class can be split into two classes which respectively related to painting and performing.Moreover, our class-words are highly related to (or even as same as) ground truth class names and human-understandable.So based on our prediction of classes with class-words, users can simply find some underlying sub-classes and decide whether to merge them.
As baselines, Rankstats+, ORCA, and GCD can also start with an overestimated number of classes and get a prediction of classes.However, given the same initial number of classes, Rankstats+ struggles to discover any classes beyond the seen classes, and ORCA hardly eliminates enough classes to provide the user with an intuitive understanding of the dataset's text distribution.GCD can provide more reasonable predictions, but compared to our approach, its prediction still deviates substantially from the ground truth and is much more unstable.This indicates these methods' ability to estimate the number of classes in the few-shot setting is not reliable.Hyper-parameter Sensitivity.WOT-Class has 3 hyper-parameters: ,  , , and we show their default values in Sec.4.3.To further explore the stability and robustness of WOT-Class, we conduct a hyper-parameter sensitivity study on three datasets with different imbalance rates: 20News, NYT-Small and Yahoo, to study how fluctuations in hyper-parameters influence the performance of our method.The experiment is performed using a fixed random seed (42) for reproducibility.We report the overall macro-F 1 scores for 5 distinct values of each hyperparameter.As illustrated in Figure 6, the performance fluctuations remain within reasonable margins, basically under 5%.Our method does not need to fine-tune these hyper-parameters.

RELATED WORK
Open-world Learning.Traditional open-world recognition methods [3,4,20] aim to incrementally extend the set of seen classes with new unseen classes.These methods require human involvement to label new classes.
Recently, Rankstats [10] first defined open-world classification tasks with no supervision on unseen classes in the image domain and proposed a method with three stages.The model begins with self-supervised pre-training on all data to learn low-level representations.It is then trained under full supervision on labeled data to glean higher-level semantic insights.Finally, joint learning with ranking statistics is performed to transfer knowledge from labeled to unlabeled data.However, Rankstats still require full supervision on seen classes to get high performance.
Following that, ORCA [6] and GCD [23] defined open-world semi-supervised classification and proposed general solutions which further improved the framework of Rankstats and alleviated the burden of manual annotation.However, these methods' performance is not robust enough for the few-shot task and the imbalanced data distribution in the text domain.In contrast, our work is applicable to infrequent classes and exploits the fact that the input is words which are class-indicative.Extremely Weak Supervision in NLP.Aharoni and Goldberg [1] showed that the average of BERT token representations can preserve documents' domain information.X-Class [25] followed this idea to propose the extremely weak supervision setting where text classification only relies on the name of the class as supervision.However, such methods can not transfer to open-world classification naively as they cannot detect unseen classes.Our method leverages such extremely weak supervision methods as a subroutine to help the clustering of documents.But importantly, we note that such methods cannot be applied straightforwardly as they also are sensitive to noise and too similar classes.We show that our general idea of using class-words can further help an extremely weak supervision method to obtain stable performance.Joint Clustering with Downstream Tasks.To some sense, our method leverages partly an idea called joint clustering, which some recent works [2,7] in the image domain achieved high performance through jointly performing clustering and image classification.Their main idea is to utilize clustering to extract the hidden information of image representations and generate pseudo-labels, which in turn provide supervision for classification training and ultimately guide the co-improvement of representation and clustering.However, the crucial difference is that their methods already know the predefined classes and highly depend on strong assumptions like all classes share the same size to obtain excellent performance.Conversely, WOT-Class utilizes the general idea of joint clustering in an open-world setting where the classes may be too fine-grained and noisy.We address these unique challenges via the class-words we propose and show that our methodology can not only estimate the precise number of classes but also tolerate imbalanced data distribution.

CONCLUSIONS AND FUTURE WORK
In this paper, we introduce the challenging yet promising weakly supervised open-world text classification task.We have identified the key challenges and unique opportunities of this task and proposed WOT-Class that achieves quite decent performance with minimal human effort.Specifically, WOT-Class starts with an overestimated number of classes and constructs an iterative refinement framework that jointly performs the class-word ranking and document clustering, leading to iterative mutual enhancement.Consequently, WOT-Class can progressively extract the most informative classes and assemble similar documents, resulting in an effective and stable open-world classification system, which is validated by comprehensive experiments.In the future, we envision that open-world text classification can be conducted with even less manual annotation, for example, by only requiring user-provided hyper-concept (e.g., Topics, Locations) or custom instructions.This will further reduce the cost of classification systems and extend their applicability.
In summary, this paper presents an initial exploration of openworld text classification, including problem formulation, methodology, and empirical results.We hope this work can inspire more research in open-world learning for NLP.As an emerging field, open-world classification demands more algorithms, datasets, and evaluation metrics to truly unleash its potential.

ArtFigure 1 :
Figure 1: Weakly Supervised Open-World Text Classification.We aim to cluster text in a corpus, where only a few classes have few-shot supervision and class names known.

Figure 2 :
Figure 2: An overview of WOT-Class framework.Given the corpus and a part of labels, we first estimate document representations and construct the initial clusters.And then, we perform an iterative cluster refinement to remove redundant clusters.At the end of each iteration, we will update the document representations and recluster them.

Figure 4 :
Figure 4: Schematic diagram of the corpus split.Only 10 samples in popular classes are provided as training labels.

Figure 5 :
Figure 5: Overall performance of compared methods and WOT-Class on different imbalance degrees.

Figure 6 :
Figure 6: Hyper-parameter sensitivity study on 20News, NYT-Small and Yahoo.The overall macro-F 1 scores using a fixed random seed are reported.

transportation Raw Input Corpus Potential Class- words Estimation Cluster(s) Removal Based on Class-words ID Label Documents
5 finance The house is expensive. 6 finance The stock increases.
is less sensitive to class imbalances, making it more suitable in real-world scenarios.Our contributions are as follows.•We introduce the novel yet important problem of weakly supervised open-world text classification.We release the code and datasets on Github 2 .
1 WOT-Class means What class?Let's discover!• We propose a novel, practical framework WOT-Class that jointly performs document clustering and class-word ranking that can discover and merge unknown classes.• Extensive experiments demonstrate that WOT-Class outperforms the previous methods in various manners.Competent accuracy of WOT-Class also illuminates the practical potential of further reducing human effort for text classification.Reproducibility.

Table 1 :
An overview of our datasets.The imbalance factor refers to the ratio of sample sizes between the most frequent class and least frequent one in the dataset., which is the closeness of the text in the cluster.This can be obtained from X-Class which provides representations of each text, and we can simply compute score

Table 3 :
, 6. Performance of compared methods and WOT-Class on different imbalance degrees.We report macro-F 1 scores to more effectively demonstrate the results under an imbalanced data distribution.For all compared methods, we report their Pareto optimal started with 100 classes and our estimation.

Table 4 :
An overview of the imbalanced DBpedia datasets with 14 classes.

Table 5 :
Performance of seen classes.The mean micro/macro-F 1 scores over three runs are reported.

Table 8 :
Examples of the class-words.We use '[]' to split classwords belonging to different clusters in WOT-Class.