Towards Discrete Object Representations in Vision Transformers with Tensor Products

In this work, we explore the use of Tensor Product Representations (TPRs) in a Vision Transformer model to form image representations that can later be used for symbolic manipulation in a neurosymbolic model. We propose the Tensor Product Vision Transformer (TP-ViT), an enhancement of a Vision Transformer that incorporates TPRs, an object representation methodology that utilizes filler and role vectors to represent objects. TP-ViT is the first application of TPRs on visual input, and we report qualitative and quantitative results which show that the use of TPRs allows for the formation of more targeted and diverse object representations when compared to a standard Vision Transformer.


INTRODUCTION
In the field of computer vision, connectionist models are now capable of performing tasks such as image classification, captioning, and segmentation at near-human performance levels.This is possible as these highly parametric models are especially effective at inductively approximating the necessary function for a given task.In line with their inductive mode of learning, these models only require a sufficiently large training dataset, as well as test data that is sampled from the same distribution as the training data.However, this may not always be possible in practical applications, leading to poor real-world performance in many cases.
While various advancements have been made to deep vision architectures since AlexNet's [8] breakthrough in 2012, the most widely used methods have all worked by improving a model's inductive efficiency and accuracy.However, the fundamental limitations of induction still remain, with today's leading models still unable to perform well on out-of-distribution data [18].
An existing class of methodologies that allows for increased reasoning capabilities are neurosymbolic models, where classical symbolic methods are used in conjunction with modern connectionist networks.Operating on the principle of deduction, symbolic models work by encoding the relationships between symbols (objects) and were largely used in the expert systems that spawned in the early days of Artificial Intelligence.However, their reliance on human-coded rules severely limit their scope.Neurosymbolic models aim to address this by having the model learn these features.
The flow of information in neurosymbolic models resembles that of the binding problem, an idea first introduced in neuroscience.In the context of deep learning, the binding problem poses the question of how a model can learn to 1) segregate, 2) represent, and 3) perform composition on unstructured input [2].In more detail, segregation is the process of segmenting input data into semantically useful attributes.For vision, this could be the segregation of an image into multiple objects within the scene.Next, these segregated attributes need to be represented in a common format to form symbol-like entities that are suitable for composition.This common format would need to encode the grouping information from the segregation process, while retaining the encoded features of each object.Finally, these objects undergo composition, where the semantic relationships between objects are modeled.
This work contributes towards the areas of segregation and representation of the binding problem, where we explore the use of Tensor Product Representations (TPRs) as a way to encode object representations in Vision Transformers (ViT) [7].Concretely, we propose a novel TPR module that is adapted for use on a Vision Transformer.We refer to our proposed model as TP-ViT and include analysis showing that it is capable of forming more targeted and diverse representations compared to a standard ViT (ViT-B/16 proposed in [7]).

RELATED WORK 2.1 Object Representations in Deep Learning
In line with the binding problem introduced in Section 1, unstructured input first needs to be segregated into discrete objects.Next, as part of the representation process, grouping information for these objects needs to be encoded using an appropriate methodology.These can be classified into 4 categories, as defined in [2]: slot-based methods, augmentation-based methods, attractor dynamics, and TPRs.Due to their compatibility with contemporary deep learning architectures, we have decided to explore the use of TPRs.As such, we will briefly discuss the first 3 categories here, and elaborate on TPRs in the next subsection.
Slot-based methods utilize multiple representational slots, where the slots can either be instance-based (one object per slot), sequential (objects bound sequentially), spatial (slots defined by region), or categorical (multiple objects of the same category in a slot).Examples of spatial slots include SPACE [9] & SCALOR [6], and an example of categorical slots are Hinton's capsule networks [3].
Augmentation-based methods use a shared set of features among all objects and append/augment these features with grouping information.Specific examples of this include temporal codes and complex-valued codes.Temporal codes work in spiking neural networks, where the firing rate of spiking neurons encode separation.Complex-valued codes instead utilize complex-valued neurons to encode this grouping information, as seen in the complex-valued Boltzmann Machine in [12].
Finally, attractor dynamics involve the generation of attractor states in the feature space, where each attraction point represents a stable interpretation of a given input.While attractor networks have received relatively little attention due to the difficulty of training them, a few notable works exist, including [11] and [5].

Tensor Product Representations
TPRs were first introduced by Smolensky in 1990 [15] and involve the use of role and filler vectors to bind features to object representations.Intuitively, the filler vectors contain the encoded features of the object, while the role vectors encode the grouping information.Subsequently, the representation is formed by "binding" the role and filler vectors through an outer product.TPRs allow for the formation of multiple object representations through the use of multiple filler and role vectors, where the use of a distributed role vector allows for the encoding of structural information or uncertainty regarding the segregation of objects.
While there has been limited work on TPRs within the context of deep learning, a couple of notable works exist.First, [4] incorporated TPRs within a recurrent-neural-network-based architecture to perform captioning and part-of-speech tagging.Their method was able to learn the grammatical structure of sentences in an unsupervised fashion by learning to form a role-unbinding vector.TPRs were also used in another language model in the form of the TP-Transformer [14].Here, the TP-Transformer was applied to a NLP-based question-answering task, specifically the Mathematics dataset [13].They propose a Tensor Product Multi-Head Attention (TPMHA) block that stacks a binding operation on top of the standard multi-head attention [16] operation.They observed that distinct roles were formed for different features in the input (e.g.separate clusters for numerators and denominators).The ability of the models in [4] and [14] to form semantically meaningful roles for NLP-based input motivated us to explore the use of TPRs for other modalities.In the following section, we will detail our proposed Tensor Product Vision Transformer (TP-ViT), which applies TPRs to visual input.

METHODOLOGY
As discussed in Section 2.2, previous works have utilized different approaches to incorporating TPRs into deep learning models.Our work is inspired by the TP-Transformer [14], but we have enhanced their approach in three ways: 1) We adapt TPRs for use on the visual modality, 2) We isolate the TPR operations from the transformer encoder to allow it to function as a self-contained module, allowing for greater flexibility, and 3) We utilize the full tensor product instead of the Hadamard product that was used as an approximation (to reduce computational complexity) in [14].To the best of our knowledge, this is the first use of TPRs in the visual modality.
Our proposed model TP-ViT consists of a TPR module stacked on top of a standard Vision Transformer (ViT-B/16 variant).The TPR module is shown in Figure 1 and consists of two main blocks: the TPR block and the inference block.The roles of these blocks will be described in further detail in the following subsections.

TPR Block
The function of the TPR block is to receive encoded tokens from the ViT as input, and output a TPR.To do this, it first has to apply functions   and   to transform the tokens into filler and role vectors respectively.We use multi-head self attention as   , and a linear projection as   , allowing for both transforms to be learned (Eqs. 2 & 3).
In Figure 1, we begin by passing image patches  through a ViT to obtain encoded tokens  (Eq 1).
where  is the number of input tokens and  is the ViT's embedding dimension.
In the TPR block, we learn functions   and   to form the filler and role vectors respectively (Eqs. 2 & 3).
where  is the filler dimension and is set equal to , and  is standard multi-head self attention proposed in [16].
where  is the role dimension and is left as a hyperparameter.The filler and role vectors are then multiplied (outer product) across their first two dimensions, with element-wise multiplication being performed on the last dimension, producing the tensor product representation, TPR (Eq.4).

Inference Block
The 3-dimensional TPR contains a distributed representation of the input image, but for some tasks, its dimensions will need to be reduced for it to be effectively used downstream.For our particular classification training objective, we created the inference block to reduce the TPR's dimensions to  × .
To achieve this, the inference block utilizes multi-head self attention, with the TPR acting as the query (q), and the ViT output tokens acting as the key (k) and value (v) (Eq.5).Here, we utilize one attention head per role in the TPR, with the intuition being that the varying information contained within each role will result in diverse queries of the encoded tokens.(5)

Experimental Configuration
We train TP-ViT for image classification on the ImageNet [1] dataset at a resolution of 224x224.We begin training with the ViT encoder weights frozen with a batch size of 1024, and use the AdamW optimizer [10] at an initial learning rate of 1 −3 , decreasing by factors of 10 to a minimum of 1 −7 when the validation loss plateaus.
We then unfreeze the encoder weights and train at a batch size of 256, and using the same optimizer we step down the learning rate from 1 −4 to 1 −7 in the same manner as above.Standard image augmentation techniques (such as brightness and contrast modulation) are applied for both training steps.Since the TPR module allows for the number of roles to be selected as a hyperparameter, we train models with 4, 8, 12, and 16 roles following the procedure above, and present results in the next section.Code used for training is available at (anonymized for blind review).

RESULTS & DISCUSSION 4.1 Visualization of Attention Maps
To determine if the proposed TPR module forms meaningful representations, we visualize the attention maps formed in the inference block.Since each attention head here uses a different role as the query (see Section 3.2), we would intuitively expect to see varying features be included in each attention map.In addition, this visualization method allows us to provide a comparison with the standard ViT's attention maps as a baseline.The attention maps for a single sample image for both TP-ViT and the standard ViT are shown in Figure 2. Since the standard ViT-B/16 uses 12 attention heads, we use the 12-role variant of TP-ViT to provide a fair point of comparison.From visual inspection, we observe that TP-ViT produces 9 distinct types of attention maps with focused attention on various parts of the image such as the people (including separate attention maps targeting the head and torso), crosswalk, car, and overall background.In contrast, the standard ViT produces redundant attention maps, where most of the attention maps highlight similar areas, with not much targeted attention on objects in the scene.
Figure 3 shows further qualitative results, where we select the 4 most distinct attention maps from TP-ViT and the standard ViT for multiple sample images.Here, each row represents the results for a sample image.These four images were chosen as they were representative of the module's performance in general.
The additional results in Figure 3 further show TP-ViT's ability to produce significantly more targeted/focused attention maps,  relative to the more generalized attention maps of the standard ViT.TP-ViT's maps are also more diverse between attention heads, where each head often singles out a particular object.This diversity was one of the stated goals behind the use of multiple heads in the original Transformer model [16].However, recent work has since shown that the standard multi-head self-attention architecture often produces redundant feature maps [17].Our results shown in Figures 2 and 3 support this claim.
Empirically, we find that the models with 8 or 12 roles provide a good compromise between sufficient expressive power and low redundancy, as the model with 4 roles often does not include all objects of interest, and the model with 16 roles shows some redundancy.This can be observed in Figure 4, which compares the attention maps produced by all variants of TP-ViT for a single sample image.However, note that the ideal number of roles largely depends on the complexity of the input data, as well as the requirements of downstream tasks, such as if the representations are used for reasoning tasks.

Quantitative Performance & Effect of Roles
Here, we provide a quantitative measure of the difference in attention map diversity between TP-ViT and the standard ViT.We identified the Intersection over Union (IoU) between attention maps as an appropriate metric to quantify attention map diversity.To measure IoU between attention maps, we first select a binarization threshold and binarize the attention values across all tokens in the image, such that attention values are either 0 or 1.We then calculate the mean IoU between all pairs of attention maps produced for an image, and report an average value from all images in the ImageNet validation set.To provide more illustrative results, we include IoU values for binarization threshold values between 0 and 1 (step size 0.1).
These results are shown in Figure 5, where we see that all 4 variants of TP-ViT show significantly lower average IoU between attention maps across all binarization thresholds, which indicates higher diversity between the attended objects in each attention map.In particular, selecting a binarization threshold of 0.1 results in a 72% reduction in IoU between attention maps.These quantitative results strongly support our qualitative analysis in Section 4.1.
Finally, we include results for downstream performance in Table 1.While improved classification accuracy is not a primary objective behind the development of TP-ViT, it achieves comparable or marginally higher accuracy compared to the standard ViT-B/16.We also observe that tuning the number of roles does not significantly affect classification performance.However, as discussed in Section 4.1, setting the number of roles to 8 or 12 results in a good balance of role diversity, which will be significant in future work where the roles are used downstream for composition/reasoning.

CONCLUSION
In this work, we proposed TP-ViT, which consists of a novel TPR module stacked on top of a ViT.Through analysis of its attention maps, we have qualitatively and quantitatively shown that TP-ViT is able to form more targeted and diverse representations of objects compared to the standard ViT, representing a significant step towards the future development of neurosymbolic models.Our work builds upon the NLP-based TP-Transformer [14], and is the first work to apply TPRs to the visual domain.We utilize the full Tensor Product Representation, instead of the Hadamard product approximation used in [14].Finally, our proposed module also isolates the Tensor Product operation from the encoder and introduces a self-contained structure.This allows our module to achieve greater flexibility and compatibility with various types of pretrained encoders.

Figure 1 :
Figure 1: Proposed TP-ViT architecture with the novel TPR modules in orange (best viewed in colour).

Figure 2 :
Figure 2: Comparison of TP-ViT attention maps (12 roles) vs baseline ViT-B/16 for a single sample image.

Figure 3 :
Figure 3: Comparison of TP-ViT attention maps vs baseline ViT-B/16.The 4 most distinct attention maps produced by each model are shown.

Figure 4 :
Figure 4: Comparison of attention maps between all variants of TP-ViT.

Table 1 :
Classification accuracy on ImageNet validation set with ViT-B/16 as baseline.