A Single Vector Is Not Enough: Taxonomy Expansion via Box Embeddings

Taxonomies, which organize knowledge hierarchically, support various practical web applications such as product navigation in online shopping and user profile tagging on social platforms. Given the continued and rapid emergence of new entities, maintaining a comprehensive taxonomy in a timely manner through human annotation is prohibitively expensive. Therefore, expanding a taxonomy automatically with new entities is essential. Most existing methods for expanding taxonomies encode entities into vector embeddings (i.e., single points). However, we argue that vectors are insufficient to model the “is-a” hierarchy in taxonomy (asymmetrical relation), because two points can only represent pairwise similarity (symmetrical relation). To this end, we propose to project taxonomy entities into boxes (i.e., hyperrectangles). Two boxes can be "contained", "disjoint" and "intersecting", thus naturally representing an asymmetrical taxonomic hierarchy. Upon box embeddings, we propose a novel model BoxTaxo for taxonomy expansion. The core of BoxTaxo is to learn boxes for entities to capture their child-parent hierarchies. To achieve this, BoxTaxo optimizes the box embeddings from a joint view of geometry and probability. BoxTaxo also offers an easy and natural way for inference: examine whether the box of a given new entity is fully enclosed inside the box of a candidate parent from the existing taxonomy. Extensive experiments on two benchmarks demonstrate the effectiveness of BoxTaxo compared to vector based models.


Introduction
A taxonomy is a schema of hierarchical classification, which is used to organize conceptual entities into a tree-like structure according to their semantics.Taxonomies have been widely adopted to support various web services because of the effectiveness of indexing and organizing knowledge.For example, Amazon has a product taxonomy to facilitate online shopping [Mao et al., 2020], and Pinterest uses taxonomy to enhance content understanding and recommendation [Gonçalves et al., 2019;Manzoor et al., 2020].Many taxonomies were initially curated by domain experts, however, due to the constant and rapid growth of new concepts, automatically expanding existing taxonomies with these new entities is necessary to avoid their obsolescence.For consistency with existing literature, we follow [Shen et al., 2020;Yu et al., 2020] and refer to a child as query, and a parent as anchor.The terms are interchangeable throughout this paper.
Existing approaches for taxonomy expansion focus on capturing the child-parent hierarchies.Early efforts learn the hierarchies by exploiting the semantic relatedness between two entities.The semantics can be represented by lexical patterns [Snow et al., 2004;Hearst, 1992] or later the more powerful distributional word embeddings [Mikolov et al., 2013;Chang et al., 2018;Fu et al., 2014].Beyond semantics, recent works have further explicitly modeled the tree structure of taxonomy.They use various structural summaries, including paths [Yu et al., 2020;Liu et al., 2021;Jiang et al., 2022] and local graphs [Wang et al., 2021;Shen et al., 2020;Mao et al., 2020], as additional signals to enhance the learning of child-parent hierarchies.Representing hierarchies is more in line with the geometric properties of hyperbolic space [Ganea et al., 2018;Nickel and Kiela, 2017].Several works [Aly et al., 2019;Ma et al., 2021] model the child-parent relations by learning hyperbolic representations.
The core methodology of most aforementioned approaches is to learn vector embeddings for entities in taxonomy.The child-parent relation is then inferred by computing the relatedness of a pair of entities upon their vector embeddings.However, the vector embeddings, i.e., points in geometric space, can only represent the pairwise similarity, which is a symmetrical relation (Similarity is usually measured by distance -either Euclidean or geodesic-of two points).The taxonomic child-parent hierarchies, on the contrary, are naturally asymmetrical.Therefore, the vector based embeddings are not sufficient to represent the hierarchies in taxonomy, limiting their effectiveness in taxonomy expansion.
To overcome this insufficiency, instead of vectors, we propose to use boxes to represent the entities in the taxonomy.A box is an axis-aligned hyperrectangle in geometric space, which can be characterized by two points.Unlike a single point, the benefit of a box is that box has a geometric region, which enables it to represent the more complicated asymmetrical pairwise relations such as "enclose", "disjoint" and "intersect".Specifically, a child box is entirely enclosed inside its parent box (e.g., "Graph Neural Network" and "Machine learning").Two entities are fully separated if they are not in a child-parent hierarchy (e.g., "Programming Language" and "Machine learning").The boxes of two entities overlap if they share common children in taxonomy (e.g., "Computer Vision" and "Machine learning").
Despite the natural and intuitive representation of taxonomic hierarchies, the box embeddings for taxonomy expansion still face three main challenges.First, limited taxonomy annotation is available for new entities, making it difficult to learn accurate boxes and infer their positions in the taxonomy in a supervised manner.Second, most existing box embeddings approaches optimize boxes by capturing probabilistic properties, which have proven difficult to train in practice [Li et al., 2018;Vilnis et al., 2018].The reason is box pairs that are supposed to "enclose" or "intersect", but are wrongly disjoint during training, will never be corrected because the gradients from the probabilistic loss function are zero in this case.[Li et al., 2018;Dasgupta et al., 2020] mitigate this issue by representing the edges of boxes as probabilistic density distributions, i.e., making the box "soft".However, such "soft" boxes lose the intuitive interpretability of normal "hard" boxes.Third, different from reasoning in the existing structure, taxonomy expansion requires learning boxes for new entities.Therefore, a desired model should be generalizable, which is able to generate box embeddings compatible with existing taxonomies for new entities.
In this paper, we propose BOXTAXO, a self-supervised model that expands taxonomy with box embeddings.With self-supervised learning, our model does not require annotated labels, but creates training samples from the existing taxonomy.Specifically, each ⟨child, parent⟩ pair in the existing taxonomy is treated as a positive sample.The entities that are not the ancestors of each child are collected as negative samples.To optimize the box embeddings, we propose a joint loss function that guides the boxes to capture the taxonomic hierarchies from both the geometric view and the probabilistic view.The joint view loss function can avoid the gradient missing issue mentioned above and still ensure the boxes are intuitive and interpretable to humans.The box embeddings are encoded via a pre-trained language model to ensure the generalizability to new entities.At inference time, box embeddings offer an easy and natural way to find an appropriate anchor for a query, by checking whether the box of a candidate parent fully contains the box of the query.We implement this from the probabilistic view in BOXTAXO.
Our main contributions are summarized as follows: 1) We propose to use box embeddings for taxonomy expansion, which can accurately represent the hierarchies in taxonomy.
2) We develop a self-supervised model that optimizes the box embeddings through joint learning of geometry and probability.3) We conduct an extensive set of experiments on two real-world taxonomies.Experimental results demonstrate the effectiveness of BOXTAXO compared to vector based representations.We also provide various ablation studies and analyses to understand how BOXTAXO works.

Box Projection
The nodes in taxonomy are conceptual entities.To represent each entity as a box in the latent space, BOXTAXO uses a twostage projection process.In specific, an entity is first encoded as numeric embeddings from natural language via an entity encoder, and then converted into a rectangular lattice by a box projector.We now introduce these two operators in detail.
The pre-trained language models (PTMs) have shown promising achievements in many natural language tasks [Qiu et al., 2020].Encouraged by their success, we use PTMs to encode the entities into embeddings.Without loss of generality, we use Bert [Devlin et al., 2019] as the entity encoder in this paper.Formally, for the i-th entity e i , Bert converts it into k-dimensional representation n i ∈ R k : (1) The entities in taxonomy are usually curated and thus could have definition sentences.Therefore, for such a definition sentence s i , we concatenate it with its entity e i and build the input of Bert as "[CLS]e i , s i [SEP]".We then use the output embeddings of "[CLS]" in the final Bert layer as the representation n i of entity e i .The representation n i encodes the contextual semantics of the entity.Please note other pretrained language models, such as Roberta [Liu et al., 2019] and ELECTRA [Clark et al., 2020], are flexible to be replaced as the encoder.Box Projector.We then project the entity representation n i into box embeddings.A box can be defined by two points (i.e., vectors).We therefore use the center point c i ∈ R d and the offset vector where d is the dimension of box embeddings.Note that c i and o i are just two vector embeddings.To represent an entity e i as box embeddings b i , we project the entity representation n i into the center c i and offset o i , separately.Specifically, because this projection is only a dimension transformation between two embeddings, we simply use two multilayer perceptrons (MLPs) as the projectors: where MLP c and MLP o are the projection layers for center c i and offset o i , respectively.To ensure the learned box b i is a valid rectangle, we further apply an exponential operator to the offset o i , so that every dimension of o i is guaranteed to be larger than 0.

Box Training
We now seek to optimize the box embeddings such that they can accurately represent the taxonomic hierarchies, i.e., the child-parent relations.Because each ⟨child, parent⟩ pair in the taxonomy is a natural "label", we propose to fine-tune the entity encoder and box projector in a self-supervised manner.Specifically, we utilize all the immediate ⟨child, parent⟩ pairs in the taxonomy as positive samples.Negative samples have been demonstrated to be crucial in optimizing box embeddings [Lees et al., 2020].Therefore, for each child node in such a pair, we collect its "siblings", "uncles" and "cousins" as the negative samples against the child-parent relations.Compared to vectors, box embeddings are more powerful in representing child-parent relations.We show how boxes achieve this advantage from two views: the geometric view and the probabilistic view.Accordingly, we design two training objective functions, the geometric loss and the probabilistic loss, to jointly optimize the box embeddings.Geometric View.
We first show how the child-parent relation can be represented with box embeddings in geometric language.A box with d-dimensional center and offset vectors is a d-dimensional hyperrectangle in Euclidean space.A ⟨child, parent⟩ pair can be semantically interpreted as "child is-a parent" or "child is-one-of parent" [Hearst, 1992].Therefore, we let the child hyperrectangle be fully enclosed by the parent hyperrectangle, indicating the child entity is one kind of parent.
where i denotes the i-th dimension of the embeddings.We derive a loss function L + g to ensure boxes satisfy this geometric enclose relation (Eq.( 3)) for pair ⟨e c , e p ⟩, formalized as: ) where δ is a hyper-parameter across all d dimensions that controls the geometric margin between the child and parent boxes.
Oppositely, for a negative pair ⟨child, parent ′ ⟩, denoted by ⟨e c , e p ′ ⟩, the child hyperrectangle should be disjoint with the negative parent hyperrectangle.We implement this "disjoint" relationship by enforcing the intersection between the child box and the negative parent box to be empty.Formally, for such a box pair⟨b c , b p ′ ⟩, their intersection b z = b c ∩ b p ′ is formalized as: (5) An empty intersection, i.e., b z = ∅, essentially means every dimension of the intersection b z is less than or equal to 0. Based on this property, we derive a loss function L − g to minimize the offset o z of the intersection, formalized as: where ϵ is a hyper-parameter to adjust the margin of intersection.If ϵ > 0, we allow some intersection between two boxes, and when ϵ ≤ 0, we force the two boxes to be separated.Note that the offset can be derived by o z = (r z − l z )/2.Probabilistic View.We now introduce how the childparent relation is represented by box embeddings from a probabilistic perspective.We first define taxonomic probability: 1) (Taxonomic Probability) Taxonomic probability P(e y |e x ) is the likelihood of event "from a given entity e x , another entity e y can be reached along a given 1-length edge" occurring.For a ⟨child, parent⟩ pair ⟨e c , e p ⟩ in taxonomy, the taxonomic probability P (e p |e c ) = 1, because given a child, its exact parent can always be retrieved along the edge connecting them.If a child has multiple parents, we define the taxonomic probability as 1 for all parents.Similarly, for a negative pair ⟨child, parent ′ ⟩, denoted by ⟨e c , e p ′ ⟩, since the negative parent can not be directly reached given the child, the taxonomic probability P (e p ′ |e c ) = 0. Desired box embeddings should satisfy these conditions of taxonomic probability for both positive and negative pairs, so that they can accurately represent the child-parent hierarchies in taxonomy.
Similar to using diagrams of sets to describe probabilities (i.e., Venn diagram [Venn, 1880]), box embeddings provide a natural graphical way to calculate the taxonomic probability.Following [Vilnis et al., 2018;Li et al., 2018;Onoe et al., 2021], we use the volume of the intersection between child box and parent box, divided by the volume of child box, to represent the taxonomic probability P (e p |e c ), formalized as: where the V ol(•) is the volume of a box.On this basis, we propose a probability loss function for each positive childparent pair ⟨e c , e p ⟩, denoted by L + p , which is formalized as: and also a probability loss function for each negative pair ⟨e c , e p ′ ⟩, denoted by L − p , which is formalized as: Box Regularization.In both geometric and probabilistic views, we design loss functions that minimize the intersection of two negative box embeddings, i.e., the negative geometric loss L − g and the negative probabilistic loss L − p .Actually, if a box is near zero in all its embedding dimensions, or its volume is close to zero, these two losses are also able to be minimized.In this case, however, the learned box embeddings are meaningless and can hardly represent the taxonomic hierarchies.To avoid this "cheating" during training, we regularize that box embeddings can not be too small in all dimensions.For box embeddings b e of entity e, we implement this constraint by regularizing the offset o e with regularization loss L r : , 2, ..., d}. (10) ϕ controls the minimum length of boxes in each dimension.Joint Loss.Finally, we combine the geometric losses, the probabilistic losses and the regularization loss to jointly train the model.Formally, the final loss function is: where the α, β and γ are hyper-parameters to control the contributions of each single loss function.

Inference with Box
During inference, our goal is to find an appropriate parent entity, i.e., an anchor, from the taxonomy for a given new query.
In contrast to vector embeddings that measure the distance between two points, box embeddings are more intuitive and natural to determine whether a candidate parent is suitable: by checking to what extent the box of anchor encloses the box of the query.We implement this idea in a probabilistic way as shown in Fig. 1 (b).Specifically, for query e q , we first project it into a box b q and then compare it with each candidate anchor e a 's box b a1 .Formally, we rank the candidates by their taxonomic probabilities P (e a |e q ).A higher P (e a |e q ) indicates that anchor e a is more likely to be an appropriate parent for the query e q .In some cases, the taxonomic probability values of many anchors could be the same.For example, if the query box is enclosed by the box of a leaf anchor node, then it is also enclosed by all ancestors (until the root) of this leaf anchor, i.e., P (e a |e q ) = 1 for the leaf anchor and all its ancestors.In this case, we return this leaf anchor as the predicted parent, since it is the finest-grained and more precisely describes the query.We note this leaf anchor should have the smallest volume because it is enclosed by all ancestors.Therefore, for candidate anchors with the same taxonomic probability, we perform a second ranking according to the volume of their boxes, so that finer-grained anchors can be placed higher.

Experiments
We compare BOXTAXO with vector based embeddings baselines for taxonomy expansion and report the results in Table. 1.We include two lines of baselines: 1) Because BOXTAXO only models the simple ⟨child, parent⟩ pairs during training, we first compare BOXTAXO with vector based counterparts that also focus on such pairs, i.e., TAXI, Hy-peNet and Bert+MLP.BOXTAXO outperforms them with significant gains, indicating the effectiveness of box embeddings against vectors for taxonomy expansion.2) We also compare BOXTAXO with vector based baselines that use advanced structural summaries, including local graphs (TaxoExpan) and paths (STEAM).Despite not explicitly modeling such structural signals, BOXTAXO still achieves a clear improvement over TaxoExpan and shows comparable results with STEAM.We are encouraged by these results as it shows the potential to facilitate box embeddings with advanced structures to further boost taxonomy expansion.et al., 2020] and the baseline results are from [Yu et al., 2020].We report the averages of ten runs of BOXTAXO.The best results are in boldface, and the second-best results are underlined.The "N/A" indicates that MRR is not applicable to TAXI.

Figure 1 :
Figure 1: The overview of BOXTAXO.The entities in taxonomy are first projected to boxes based on Bert.(a) Training: the box embeddings are optimized from a joint view of geometry and probability, in order to accurately represent the taxonomic hierarchies.(b) Inference: check whether a query's box is enclosed by the candidate anchor's box in a probabilistic way.Note that the boxes shown in this figure are 2D, but they can be in higher dimension spaces, i.e., hyperrectangles.
Formally, for a d-dimensional child box b c = (c c , o c ), since o c is regularized to be positive, we denote by l c = c c − o c and r c = c c + o c the minimum and maximum corner points of the hyperrectangle, respectively.Similarity, for parent box b p = (c p , o p ), denote by l p = c p − o p and r p = c p + o p the minimum and maximum corner points.Then the "enclose" relation has:

Table 1 :
Results of BOXTAXO on taxonomy expansion compared to vector based methods.We use the same experimental setting as [Yu