Cloth Interactive Transformer for Virtual Try-On

The 2D image-based virtual try-on has aroused increased interest from the multimedia and computer vision fields due to its enormous commercial value. Nevertheless, most existing image-based virtual try-on approaches directly combine the person-identity representation and the in-shop clothing items without taking their mutual correlations into consideration. Moreover, these methods are commonly established on pure convolutional neural networks (CNNs) architectures which are not simple to capture the long-range correlations among the input pixels. As a result, it generally results in inconsistent results. To alleviate these issues, in this paper, we propose a novel two-stage cloth interactive transformer (CIT) method for the virtual try-on task. During the first stage, we design a CIT matching block, aiming to precisely capture the long-range correlations between the cloth-agnostic person information and the in-shop cloth information. Consequently, it makes the warped in-shop clothing items look more natural in appearance. In the second stage, we put forth a CIT reasoning block for establishing global mutual interactive dependencies among person representation, the warped clothing item, and the corresponding warped cloth mask. The empirical results, based on mutual dependencies, demonstrate that the final try-on results are more realistic. Substantial empirical results on a public fashion dataset illustrate that the suggested CIT attains competitive virtual try-on performance.


Introduction
Virtual try-on (VTON), derived from fashion editing [63,35,60,7], aims to transfer a desired in-shop clothing item onto a customer's body.If properly resolved, VTON will provide a time and energy-saving shopping experience in our everyday life.In practice, VTON has already been deployed in some big-brand clothing stores or e-commerce shopping applications owing to its convenience [38,30,23,17].
However, most of the existing methods are designed based on 3D models pipelines [1,29,21,20,48] and follow the conventions of traditional computer graphics.Despite the detailed results, these methods require considerable labor resources, a significant amount of time investment, and complex data acquisition such as multi-view videos or 3D scans [40] that impede their widespread application.Alternatively, conditional generative adversarial Networks (GANs) based methods such as image-to-image translation or other image generation approaches [8,12], recently made some positive progress.However, there remain some obvious artifacts in the generated results.To enhance the results of 2D image-based VTON methods, the classic two-stage pipeline VTON [22] was proposed, utilizing the first stage to warp the in-shop clothing item to a desired deformation style, and in the second stage, the warped cloth is aligned to the body shape of a given customer.While the visual result looks better than previous methods, there is still a significant gap between the overall visual quality and plausible generation.Many approaches following this pipeline, i.e., CP-VTON [55], ACGPN [59], and CP-VTON+ [37] were proposed with improved performance.However, these methods are limited to plain texture or simple-style clothes.Their performance suffers when dealing with complicated cases like rich textures or complex patterns.To address this issue, Xu et al. [58] introduced an intermediate operation that takes the transformation of the target person's image into consideration.But the improvement in visual performance is achieved at the cost of a more complex network architecture, which is time-consuming for training such a model.In addition, we also noticed that most of the previous methods rarely pay enough attention to the correlations between two crucial input information, i.e., the clothagnostic person information and the in-shop cloth information.Hence, there will be some inevitable mismatch phenomena occurring in the warped in-shop cloths, and consequently, it degrades the quality of the final try-on results.
Moreover, for VTON, it's essential for a model to learn where to sample the pixels in the cloth image and where to reallocate them in the human body region.Hence, modeling the long-range dependencies is essential for achieving realistic try-on results.However, most of the previous methods use pure convolutional neural networks (CNNs) for modeling long-range dependencies.Since the pure CNN-based methods struggle to establish the long-range dependencies due to the design nature of the convolutional kernels, the final try-on performance is decreased.
Based on these observations, we assume that it would be advantageous to model the cloth-agnostic person information guided by the corresponding in-shop clothing information and vice versa.In addition, a better manner for capturing the long-range dependencies is also essential.To this end, based on the classic two-stage pipeline like VTON's [22], in this paper, we propose a novel Cloth Interactive Transformer (CIT) method to address the aforementioned limitations.The overall architecture of the proposed CIT is depicted in Figure 1.
In the first stage (i.e., geometric matching stage), we design a CIT matching block that models long-range relations between the person and clothing representations interactively.Concurrently, a valuable correlation map is generated to boost the performance of the thin-plate spline (TPS) transform [4].Unlike the traditional hand-crafted shape context matching strategies [36,34,46], which are only suitable for a certain feature type, the proposed CIT matching block (Block-I) has learnable features and can model long-range correlations via the cross-attention transformer encoders.As a result, the warped cloth becomes more natural and can fit a wearer's pose and shape more accurately.In the second stage (i.e., Try-On stage), unlike previous methods [55,37] that treat the warped in-shop clothing item and its corresponding mask as a single input, we propose a novel CIT reasoning block ((Block-II)) that takes as input three distinct information, i.e., the cloth-agnostic person representations, the warped clothing item, and the mask of the warped clothing item.Through the CIT reasoning block, a more precise correlation among these three input data can be established, and this correlation is further utilized for strengthening the mask composition process.In addition, such a co-relation also serves as an attention map to activate the rendered person image, making the final results clearer and more realistic.
More specifically, in the CIT matching block, our primary objective is to improve the ability to model the person feature by encoding the target in-shop cloth feature.Since the in-shop clothing item is non-rigid, it is difficult to directly learn the matching relationship only from the clothing item.Hence, we resort to the correlation between the person and the target in-shop clothing item.With the help of this learned correlation, the person-related feature can

Related Work
Virtual Try-On (VTON), as one of the most popular tasks within the fashion area, has been widely studied by the research community due to its practical potential [15,10,2,18,31,16].Conventionally, this task was realized by computer graphic techniques, which build 3D models and render the output images via the precise control of geometric transformations or physical constraints [14,5,6,20,61,47].By using 3D measurements or representations, these methods can generate promising results for VTON, but the additional requirements such as 3D scanning equipment, computation resources, and heavy labor are not negligible.
Compared to 3D-based methods, 2D generative adversarial networks (GANs) based methods are more applicable to online shopping scenarios.Jetchev and Bergmann [26] proposed a conditional GAN to swap fashion articles with only 2D images.Another interesting GAN-based method SWAPGAN [32] solved the VITON task in an end-to-end manner, but it utilizes 3 generators, and the balance among all generators is hard to control.Gu et al. [19] proposed a GAN-based image transformation strategy that can automatically learn the mapping from a combination of pose and text to a target fashion image.However, this method didn't consider pose variations, it also required the availability of the paired images of both the in-shop clothes and the wearer during inference.Hence their applicability in practical scenarios is limited.Unlike the previous 3D-based or GAN-based methods, VITON [22] tackled this problem with a coarse-to-fine architecture, which first computed a shape context [3] thin-plate spline (TPS) transformation [4] for warping an in-shop clothing item on the target person and then blended the warped clothing item onto a given person.Note that the TPS transformation is a commonly used method of transforming a source image into a target image.It works by using a predefined set of constraints that define how the source image should be warped and transformed into the target image.The constraints are based on a set of surface points on the source image and their corresponding points on the target image.The warping algorithm then uses these points to calculate a set of transformations that will transform the source image into the target image.The transformations include scaling, rotation, translation, and shearing.The result is a warped version of the source image which accurately reflects the target image.This warping method used in the VTON task is to make the in-shop clothing item well warped toward the body shape of a given person.However, VITON [22] utilized handcraft shape-context features for conducting the TPS transformation, which is not only time-consuming but also not robust when facing new samples.As an improvement, CP-VTON [55] and CP-VTON+ [37] adopted the learnable TPS transformation method proposed by [43] via a convolutional geometric matcher.Although the correlation between the person and the in-shape clothing features was established by such a differentiable TPS transformation, and the generated try-on results are better, there are still obvious artifacts when facing heavy occlusions, rich tex-ture, or large transformations.ACGPN [59] was proposed to tackle these issues.Compared to CP-VTON, ACGPN used a semantic generation module to generate a semantic alignment of spatial layout.It also introduced the secondorder difference constraint based on TPS.Though the performance is improved, the problem is still similar to previous methods [22,55,37] because they didn't consider the global long-range interactive correlations between the person representation and the in-shop clothing item.Recently, Chopra et al. [9] proposed to solve the VTON task with a gated appearance flow.Though better results are achieved, the need of modeling the 3D geometric priors makes the overall procedure more complex.
To alleviate these problems, we propose a two-stage Cloth Interactive Transformer (CIT) method for the virtual try-on task.In particular, the proposed CIT can well capture the long-range dependencies in both stages.As a result, our method generates sharper and more realistic try-on images.Long-Range Dependence Modeling.Although CNNbased structures have shown excellent representation ability in various vision tasks such as classification, segmentation, and so on.The long-range dependencies are still hard to be established due to the limited receptive fields of the convolution kernels.For example, a convolutional kernel usually focuses on local neighbors (e.g., 3×3 or 5×5), while longrange relations would require the response at a position as a weighted sum of the features at all other positions.This limitation raises huge challenges for many applications where long-range relationships are needed.
To overcome this limitation, the attention mechanism [50,51,54] has been widely used in vision tasks with CNN architectures though it was initially designed for natural language processing tasks.In addition, non-local neural networks [56] were designed based on the self-attention mechanism, allowing the model to capture the long-distance dependencies in the feature maps.However, this approach suffers from high memory and computation costs.[45] proposed an attention gate model to increase the sensitivity of a base model.Besides, the multilayer perceptrons (MLP) are also proposed for modeling the long-range relation, but it may heavily affect the efficiency [42,52].Moreover, Transformer [54] was first introduced for neural machine translation tasks because it can model long-range dependencies in sequence-to-sequence tasks and captures the relations between arbitrary positions in the given sequence.Unlike previous CNN-based methods, Transformers are built solely on self-attention operations, which are strong in modeling the global context.After Transformers demonstrated their overwhelming power on a broad range of language tasks (e.g., text classification, machine translation, or question answering [28,11,39,49,41]).Recently, Transformerbased frameworks have also shown their effectiveness on various vision tasks [33].In particular, vision Transformer (ViT) [13,53] splits the image into patches and models the correlation between these patches as sequences, and then the core self-attention module of ViT is stacked for modeling the long-range dependencies.
To this end, in this paper, we also utilize a vision Transformer for handling the long-range dependencies among the cloth-agnostic person representation, the in-shop clothing item (for both the original cloth and the warped cloth), and the corresponding mask of the warped clothing item in a novel cross-modal manner.

Cloth Interactive Transformer
In this section, we first give an overall introduction and the necessary notations of the proposed Cloth Interactive Transformer (CIT) method for virtual try-on in 3.1.Then we introduce the core modules (i.e., the Interactive Transformer I & II) of CIT in Section 3.2.Based on the interactive transformer -I & -II, we show the details of the CIT matching block (Block-I) and the CIT reasoning block (Block-II) in Section 3.3 and Section 3.4, respectively.Finally, the optimization objectives of the proposed CIT for both stages are described in detail in Section 3.5.

Overview and Notations
For the 2D image-based VTON task, the target in-shop clothing item is different from the source clothing item that is worn by a given person.Specifically, given a person image I∈R 3×h×w and one in-shop clothing image c∈R 3×h×w .Our goal is to generate the image I o ∈R 3×h×w where a person I wears the cloth c.Hence, what we need to do first is to reduce the side effects of the source clothes, like color, texture, or shape.Meanwhile, it's also necessary to preserve the information about the given person as much as possible, including the person's face, hair, body shape, and pose.To this end, we adopt the same pipeline as [22] for the person representation p from I. It contains three components, the 18-channel feature maps for the human pose, the 1-channel feature map for the body shape, and the 3channel RGB image.Note that the RGB image contains only the reserved regions of a person (i.e., face, hair, and lower parts of the person) for maintaining the identity of this person.
The basic structure of the proposed CIT is in a twostage (i.e., see the geometric matching stage and the try-on stage in Figure 1) pipeline, which is also adopted by CP-VTON [55] and CP-VTON+ [37].In particular, the former takes as input the cloth-agnostic person representation p and an in-shop clothing item c to produce a warped cloth ĉ and a warped mask ĉ m based on the given person's pose and shape.The latter uses the warped cloth ĉ, the corresponding warped mask ĉ m together with the person representation p to generate the final person image with the worn in-shop cloth.In the first geometric matching stage, we propose a CIT matching block (Block-I, see the upper part in Figure 2 for the details), which takes the person feature X p and the in-shop cloth feature X c as inputs.Then X p and X c are generated by two similar feature extractors from p and c, respectively (see the first geometric matching stage in Figure 1).After that, it produces a correlation feature X out−I followed by a down-sample layer for regressing the parameter θ.Note that θ is used for warping the original in-shop clothing c to the target on-body style ĉ via an interpolation method named the thin-plate spline (TPS) warping module [43].Specifically, given two images with some corresponding control points in different positions, these control points can be well aligned from one (i.e., in the in-shop clothing item) to another (i.e., in the corresponding human body region) with the thin plate spline interpolation operation in a geometry estimation manner (i.e., local descriptor extraction, descriptor matching, transformation-related parameter estimation) [22].
In addition, the TPS operation we adopted in this paper is the same as the one used in CP-VTON [55] from [43].It first utilizes its differentiable modules to conduct a transformation by mimicking the geometry estimation procedure in a learnable manner from c to the ĉ.Meanwhile, the corresponding mask ĉ m of ĉ is also produced based on θ via TPS warping operation.In the second stage, we utilize the warped cloth ĉ and the warped mask ĉ m together with the person representation p as inputs of the CIT reasoning block (Block-II, see the bottom part in Figure 2 for the details).And the output X out−II of the CIT reasoning block is used to guide the final mask composition for generating more realistic try-on results.

Interactive Transformer
Having leveraged the self-attention mechanism, Transformer is capable of modeling long-range dependencies.Given this inherent ability of the Transformer, we propose the innovative Interactive-Transformer for exploring the correlation between the person and the clothing item in the VTON task.There are two types of Interactive Transformers in the proposed CIT.The first version, i.e., Interactive Transformer I, is employed in the first geometric matching stage.The second version, i.e., Interactive Transformer II, is utilized in the second try-on stage.They are based on the basic Transformer encoders and the cross-modal Transformer encoders, and their detailed description is depicted in Figure 2.
Regarding a standard Transformer encoder, a positional embedding is initially added to the input feature as elucidated in [54].This is beneficial for keeping the initial spatial relation of the input.After the position embedding, the input feature will be projected into queries Q m , keys K m , and values V m by a linear layer.Subsequently, the output of where d is the dimension of the query, key.The aforementioned self-attention mechanism is usually employed for only one type of input data.However, in the two-stage VTON task, to accurately capture a more precise match between the information of the person and the cloth information, there are several pairs of correlations we can't overlook.Notably, in the geometric matching stage, we need to consider the correlation between the cloth-agnostic person representation p and the in-shop clothing item c since such a correlation is indispensable for producing a reasonable warped cloth ĉ.In the second try-on stage, there are three types of inputs i.e., p, ĉ, as well as ĉ m .To proficiently model the long-range connection of each two of them (i.e., p and ĉ, p and ĉ m , as well as ĉ and ĉ m ) is also a crucial issue since a well-captured correlation usually yields a good match between the person's body and the in-shop cloth.
Based on such an observation, instead of using only the self-attention layer in a Transformer encoder for processing a single-modal input, we propose a cross-modal Transformer encoder based on a cross-attention mechanism.Note that we treat each kind of input as a certain single-modal input since each of them provides a specific input.For example, p is for person identity, c and ĉ correspond to the texture, and ĉ m is related to the shape information.And the cross-attention is computed as follows: where we adopt the first input (i.e., person representation p) as query Q m1 , and the second input (i.e., the in-shop clothing item c) as the keys K m2 and values V m2 .Based on such a cross-interactive manner, each kind of input keeps updating its sequence via the external information from the multi-head cross-attention module.As a result, one modality will be transformed into a different set of key/value pairs to interact with another modality.
Then the cross-modal Transformer encoder is used for modeling the cross-modal long-range correlations between F ′ p and F ′ c : The Interactive Transformer II is designed mainly for exploring the correlations between every two inputs among three total inputs (i.e., p, ĉ, and ĉ m ).In particular, we adopt 3 regular Transformer encoders and 6 cross-modal Transformer encoders for constructing the Interactive Transformer II.Note that for better illustration, we depict X p , X ĉ, as well as X ĉ m and their corresponding information flows in yellow, green, and blue, respectively.
Within the Interactive Transformer II, there are three input features i.e., X p , X ĉ, and X ĉ m .Each of them works as the Query element within its own branch while working as the Key and Value elements in the other two branches.Specifically, we take the feature X p (depicted in yellow in Figure 2) that comes from person representation as the detailed introduction.Once we get X p after the 1D convolutional layer that is out of the red dash box.There are two pathways for X p to pass through.The first one is to directly let it meet two cross-modal Transformer encoders (i.e., the green-border cross-modal Transformer encoder between X ′ ĉ and X p , as well as the blue-border crossmodal Transformer encoder between X ′ ĉ m and X p ).Another one is to let X p pass through a regular Transformer encoder for producing the updated feature X ′ p .Note that here (X ′ ĉ→X p ) within the green-border cross-modal Transformer encoder means we utilize X p as Query and X ′ ĉ as Key and Value, while (X ′ ĉ m →X p ) within the blue-border cross-modal Transformer encoder indicates we use X p as Query and X ′ ĉ m as Key and Value.X ′ ĉ and X ′ ĉ m are the updated features from X ĉ and X ĉ m after their corresponding regular Transformer encoders.We formulate such procedures of the first yellow branch as follows: Similarly, we also get the output of the middle green branch X cross ĉ and the output of the bottom blue branch X cross ĉ m .Finally, the overall output of the Interactive Transformer II is: ). (6)

CIT Matching Block
Based on our Interactive Transformer I, we propose the CIT Matching block (Block-I) to boost the performance of the TPS transformation by strengthening the long-range correlation between X p ∈R (B×C×H×W ) and X c ∈R (B×C×H×W ) .Here B, C, H, and W indicate batch size, channel number, and the height and width of a given feature.To utilize the Transformer encoder for modeling long-range dependencies, we first adjust the dimensions of X p and X c from (B, C, H, W ) to (B, C, S), forming the X ′ p and X ′ c .Note that S = H × W . Besides, a 1D convolutional layer is also adopted to ensure that each element of each input sequence can get sufficient awareness of its neighborhood elements.When we get F p and F c after the convolutional layers, the proposed Interactive Transformer I is applied to F p and F c for capturing the long-range correlation between the person-related and the in-shop clothrelated features.As a result, we get the result (i.e., X 1 cross .) of the proposed CIT matching block.These procedures are depicted in Figure 2 with detailed annotations.
Instead of directly adding this long-range relation to features X p or X c , we strengthen each of them by a global strengthened attention X att operation as follows: Here × means an element-wise multiplication, (.) indicates that both features X p and X c follow the same form.Note that X att is produced from X 1 cross by a linear projection and a sigmoid activation.Based on this operation, the element position relation of each input will be activated by the sigmoid activation function.In particular, when it is applied to the input feature as attention, both the position information of each element within each input and the correlation between two inputs can be kept in a balanced manner.Then a matrix multiplication between X global p and X global c is conducted.The output X out−I of the proposed CIT matching block is finally obtained after a reshape operation, which represents the improved correlation between the person and clothing features.These procedures can be defined as follows: Here X global p and X global c have the same dimension (B, C, S), the output X out−I is in dimension (B, S, H, W ).

CIT Reasoning Block
Previous methods, CP-VTON [55] and CP-VTON+ [37], first concatenate the person information p, the warped cloth information ĉ, and the warped clothing mask ĉ m together.Then the concatenated input is directly sent to one UNet model as a single input to generate a composition mask M o as well as a rendered person image I R .However, such a rough concatenate operation may lead to coarse information matching, and consequently, it would be difficult to achieve a well-matched final try-on result.
To this end, we propose the CIT Reasoning block ( Block-II) depicted in Figure 2, aiming to model such more complicated correlations among p, ĉ, and ĉ m .Firstly, we adopt the patch embedding operation [13] to all these three inputs.Then each of them goes through a 1D convolutional layer to ensure the relation modeling of each element with its neighbor elements.After that, we get X p , X ĉ, and X ĉ m .To well capture the complicated long-range correlations among these features, we apply the proposed Interactive Transformer II to X p , X ĉ, as well as X ĉ m .Then the output X out−II of Interactive Transformer II is utilized to guide the final mask composition for a better generation as follows: here sigmod indicates the Sigmod activation function.

Optimization Objectives
The first stage of CIT is trained with sample triplets (p, c, c m ), while the second stage is trained with (p, ĉ, ĉ m ).In addition, in the first matching stage, we adopt the same optimization objectives as CP-VTON+ [37]: where L 1 indicates the pixel-wise L1 loss between the warped result ĉ and the ground truth c t .L reg indicates the grid regularization loss, and it can be formalized as follows: where G x and G y indicate the grid coordinates of the generated images along the x and the y directions.
In the second stage, the optimization objective is as follows: (12) The first item aims to minimize the discrepancy between the output I o and the ground truth I GT .The second item, the VGG perceptual loss [27], is a widely used loss item in image generation tasks.It is an alternative to pixel-wise losses and it attempts to be closer to the perceptual similarity of human beings.The VGG loss is based on the ReLU activation layers of the pre-trained 19-layer VGG network.It can be expressed as follows: where W i,j and H i,j describe the dimensions of the respective feature maps within the VGG network.ϕ i,j indicates the feature map obtained by the j-th convolution before the i-th max pooling layer within the VGG19 network.The third item is used to encourage the composition mask M o to select the most suitable warped clothing mask c tm as much as possible.

Experiments
Datasets.We conduct all the experiments on the same dataset collected by Han et al. [22] that used in VITON [22], CP-VTON [55], and CP-VTON+ [37].Note that due to copyright issues, we only use the reorganized version as the previous works [55,37] did.It contains around 19,000 front-view women and top clothing image pairs.Specifically, there are 16,253 cleaned pairs which are split into a training set and a validation set with 14,221 and 2,032 pairs, respectively.In the training set, the target cloth and the cloth worn by the wearer are the same.However, in the test stage, there are two kinds of test settings.The first one is the same as the training settings, where the target clothing item and the clothing item worn by the wearer are the same (we refer to this case as a retry-on setting because it is just like the wearer takes off the cloth first then retries this cloth on, hence we have the ground truth for this case).Another setting means that the target clothing item is different from the one worn by the wearer (we refer to this kind as the try-on setting).Evaluation Metrics.To evaluate the performance of our method.We firstly adopt the Jaccard Score [25] for the retry-on case (i.e., with ground truth) in the first stage.We also follow [37,59,24] to use the Structural Similarity (SSIM) [57], Learned Perceptual Image Patch Similarity (LPIPS) [62], Peak Signal-to-Noise Ratio (PSNR), Frechet Inception Distance (FID), and Kernel Inception Distance (KID) metrics in the second stage.Note that we adopt the original human image with the original clothing item as the reference image for SSIM and LPIPS (the lower, the better), and the parsed segmentation area for the current upper clothing is used as the reference for calculating the JS score.For try-on case (no ground truth), we evaluate the performance of our method and other state-of-the-art methods by the Inception Score (IS) [44].Implementation Details.For the geometric matching stage, we build 2 feature extractors (see the downsample layers shown in Figure 1) with 4 2-strided downsampling convolutional layers followed by 2 1-strided ones (filter numbers: 64, 128, 256, 512, 512) to generate X p and X c .Those two extractors are only different in input channels.The regression networks before the TPS warping operation contain 2 2-strided convolutional layers, 2 1-strided ones (filter numbers: 512, 256, 128, 64), and 1 FC layer with output size 50.For the try-on stage, the U-net used here consists of 6 2-strided down-sampling convolutional layers (filter numbers: 64, 128, 256, 512, 512, 512) and 6 up-sampling (filter numbers: 512, 512, 256, 128, 64, 4).Each convolutional layer is followed by one InstanceNorm layer, and a Leaky Relu with slope is set to 0.2.Note that we stack 3 CIT encoders in both matching and reasoning blocks.
Our training settings are similar to CP-VTON and CP-VTON+.Both stages are trained for 200K steps with batch size 4.Moreover, for Adam optimizer, β 1 and β 2 are set to 0.5 and 0.999, respectively.The learning rate was firstly fixed at 0.0001 for the first 100K steps and then linearly decayed to zero for the rest of the steps.All input images are resized to 256×192, and the output images have the same resolution.The source code and trained models are available at https://github.com/Amazingren/CIT.

Qualitative Comparisons
To validate the performance of the proposed CIT for virtual try-on, we first present the visualization results of both stages, including the warped clothing items and the final try-on person images.Comparison of Warping Results.We visualize the warped results of clothes for both retry-on and try-on settings in Figure 3.Note that in this paper, our visualization method is similar to CP-VTON [55], and CP-VTON+ [37].For ACGPN [59], we directly adopt its officially released code, and it produced the gray-background results.Figure 3 shows that the proposed CIT generates sharper and more realistic warped clothing items than the other methods like CP-VTON, CP-VTON+, and ACGPN.This typically happens for texture-rich cases, such as the case with line stripes (see the last row of the same-pair case and the second row of the different-pair case), or in the presence of logos (see the 2nd row of the same-pair case and the third row of the different-pair case), and so on.We marked out these obvious artifacts from other methods with red dash boxes in Figure 3 Comparison of Try-On Results. Figure 4 shows the tryon results.We can see that the proposed CIT outperforms other methods.Specifically, the proposed CIT can keep the original clothing texture and its pattern as much as possible, and the final results are more realistic and natural.Compared to our method, other approaches display many arti-  facts, for example, the irregular logo pattern (the 1st row), the over-warped cloth texture (the 3rd row), and the ridiculous results for unique or complicated-style cloth (the last row).We also marked out these artifacts in red dash boxes in Figure 4.In addition, more qualitative try-on results of the proposed method can be found in Figure 5.

Quantitative Evaluation
To further evaluate the performance of our CIT, we adopt five evaluation metrics, i.e., JS, SSIM, LPIPS, PSNR, IS, FID, and KID for numerical comparison.JS is to evaluate the quality of the warped mask in the first geometric matching stage with same-pair test samples, which is equivalent to the IoU metric used in CP-VTON+ but is more convenient for implementation.Note that we take cloth masks of the person as a reference image.Other metrics are designed to evaluate the performance of the second try-on stage.
The results of JS are shown in Table 2. Though our CIT doesn't achieve the best JS score, our visual results presented in Figure 3 are the most reasonable ones.We think with the help of the proposed Interactive Transformer I in the CIT matching block, our method can learn more reasonable texture transformation patterns.This learned strong texture-focused transformation pattern might affect the shape alignment.This may come from the reason that the JS score only focuses on the shape alignment aspect between the ground truth mask and the warped clothing mask.Consequently, it can not well reveal the overall quality of the final generated human images.For instance, though Table 2 shows that CP-VTON+ has the best JS score of 0.812, which is higher than ours (0.800), the qualitative results show that our method is superior to CP-VTON+.So the shape-only related evaluation metrics i.e., JS or IoU, are not always indicating a better overall visual result.We also con-ducted ablation experiments in the following to support this conclusion (see the comparison results in Table 3 between B3 and B4).In addition, in terms of the realism of the generated images, the proposed CIT can also achieve the best KID and the second-best results, which confirm the effectiveness of our method.
For the retry-on setting, we adopt SSIM, PSNR, and LPIPS to evaluate the performance.The numerical results are shown in Table 2.It can be seen that the proposed CIT achieves the best numerical evaluation results on SSIM and LPIPS compared to others.For the try-on setting, we use IS for evaluation.The results in Table 2 show that our CIT achieves just a slightly lower IS score of 3.060 compared to 3.074 of CP-VTON+.We think the most possible reason for this phenomenon is that IS is an objective metric that usually was used to measure the quality of the generated images at the feature level based on image diversity and clarity.Hence, it may ignore some pixel-level properties.
Overall, though we do not obtain the best quantitative scores on JS and IS metrics, our proposed CIT can generate sharper and more realistic try-on images compared to others.For ACGPN, we also test the performance based on the corresponding official checkpoints.However, because the test set of ACGPN is different from others, so we only present its visual results in Figure 3 and Figure 4.

Ablation Study and Discussion
To validate the effectiveness of each part of the proposed CIT, we conduct four ablation experiments (i.e., B1, B2, B3, and B4 in Table 3).CP-VTON+ [37] is adopted as the baseline (B0) of this paper.B1 means that we only use the proposed CIT matching block in the first geometric matching stage but keep the second stage the same as B0; B2 means that we only use the proposed CIT reasoning block in the second try-on stage while keeping the first stage the same as B0; B3 is the final version adopted in this paper, which contains both the proposed CIT matching and reasoning blocks; B4 is built based on B3, but with an extra L 1 loss item for providing more stricter constraints between the generated warped clothing mask ĉ m and the ground truth cloth mask c tm of the given person (depicted with red dash lines in Figure 1).The setting of this experiment is designed to sup-   From the comparison between B0 [37] and B1 in Table 3, though CP-VTON+ achieves a better JS score, the qualitative results presented in Figure 3 indicate that B1 can generate more reasonable and natural warped clothes (for both the retry-on and try-on settings).In other words, the proposed CIT Matching block can well capture more texture-related latent patterns with the help of the proposed Interactive Transformer I.The comparison between B0 and B2 shows that the numerical results, i.e., SSIM, and IS, are improved when we apply the proposed CIT reasoning block to the warped results from B0.This demonstrates that the proposed CIT reasoning block is more effective in generating the try-on results.B3, the combination of B1 and B2, is the final version of the proposed CIT.It produces not only the more natural warped clothing items but also achieves the more realistic try-on result.Hence, we think that for the two-stage 2D image-based VTON task, the JS or IoU metric focuses on only one aspect (i.e., shape) of the overall quality.Hence, the final try-on results are not always better when JS or IoU scores are higher.To support the above conclusion, we design B4 as a supplementary experiment based on B3.In Table 3, B4 obtains nearly all the best numerical results except the IS score.However, the visual comparison in Figure 6 shows that its virtual try-on results are far from satisfactory compared to B3.In addition, we asked 15 users for the user study according to the same questions asked in Table 1 with randomly selected 50 image sets: (Q1) Which image is the most photo-realistic?(Q2) Which image preserves better the details of the target clothing?And the results shown in Table 3 shows that B3 can generate more realistic images and preserves more details of the clothing items compared to B4.We also marked out the obvious artifacts region in Figure 6.

Failure Cases and Analysis
Though impressive person try-on images can be generated by our CIT, there are still three kinds of inevitable common failure cases.We visualize them in Figure 7 with a comparison to both CP-VTON and CP-VTON+.
The first case (See the 1st row of Figure 7) is that the difference between the clothing item in the reference image and the in-shop cloth is too large.Consequently, the mask of the person cannot well match the target in-shop clothing item.The second failure case comes from the self-occlusion problem, which leads to blurry ambiguity-prone generated images (See the 2nd row of Figure 7).The third failure case (See the 3rd row of Figure 7) derives from the drastic difference between the pose of the person and the sides of the in-shop cloth.This also leads to ambiguous results.In the first two cases, the main reason may be that the input data lack information in terms of whether the region of a human body should be covered with cloth or not.We propose to give further organization to the input data to remedy this issue, such as using more accurate segmentation maps or adopting more fine-grained human annotations.For the last case, the 2D image-based method cannot completely capture such a complicated relationship between a person and a clothing item.We think that taking the 3D input data, such as body mesh, and 3D clothing items, into consideration may alleviate such a problem.

Conclusion
In this paper, we propose a novel two-stage Cloth Interactive Transformer (CIT) method for the 2D image-based virtual try-on task.In the first stage, we introduce an interactive Transformer matching block, which is able to accurately model the global long-range correlations when warping a cloth through the thin-plate spline transformation.Consequently, the quality of the warped clothing item can be more realistic in terms of texture.We also present a transformer-based reasoning block in the second stage for modeling the mutual interactive relations, which can be utilized to further improve the rendering process, resulting in more realistic try-on results.Extensive experiments in terms of quantitative and qualitative comparisons validate that the proposed CIT achieves new competitive performance.

Figure 1 :
Figure 1: The overall architecture of the proposed CIT for virtual try-on.The upper part is the Geometric Matching stage for warping the in-shop clothing items, while the bottom part is the Try-On stage for synthesizing the final try-on image of the person.

Figure 2 :
Figure 2: The key components of the proposed Cloth Interactive Transformer (CIT) for virtual try-on.The upper area on the left is the CIT Matching Block (Block-I), while the bottom area on the left indicates the CIT Reasoning Block (Block-II).On the right, the normal Transformer encoder and the proposed cross-modal Transformer encoder are shown in detail.
Transformer I is shown in the red dash box of the upper area of Figure2.It consists of two regular Transformer encoders (depicted in gray) and two cross-modal Transformer encoders (depicted in light blue) that are directly applied to feature maps.We use self T rans(•) and crossT rans(•) to indicate the operators of these two kinds of Transformer encoders.Given two input features F p and F c with dimension (C, B, S).Note that the dimension of F p and F c are reshaped from input features X p and X c with original dimension (B, C, H, W ).Here B, C, H, W de-note the batch size, the number of channels, the height, and the width of the input features X p and X c , S = H × W denotes the spatial dimension.Then each of them will go through its corresponding N -layer regular Transformer encoder first.After that we get the processed features F ′ p and F ′ c as follows: here crossT rans(X ′ p , X ′ c ) indicates that we utilize X ′ c as the keys and values while we use X ′ p as the queries.On the other hand, crossT rans(X ′ c , X ′ p ) indicates that the keys and values are coming from X ′ p and queries come from X ′ c .After concatenating the outputs from the two crossmodal Transformer encoders, we get the output X 1 cross of the Interactive Transformer I.It can strengthen the correlation matching ability.Interactive Transformer II is shown in the red dash box at the bottom area of Figure 2. Similar to the Interactive-Transformer I, the Interactive Transformer II is also constructed by combining regular Transformer encoders and cross-modal Transformer encoders.

Figure 3 :
Figure 3: Qualitative comparisons of the warped cloths by the proposed CIT-based geometric matching stage.The left is for the retry-on setting (i.e., in the same cloth) while the right is for the try-on setting (i.e., in different clothing items).

Figure 4 :
Figure 4: Qualitative comparisons of different state-of-the-art methods.

Figure 5 :
Figure 5: More qualitative results of the proposed method.

Table 1 :
User study comparison on two questions.Q1 denotes 'Which image is the most photo-realistic?', and Q2 denotes 'Which image preserves the details of the in-shop clothing item the most?' in the user study

Table 2 :
Quantitative comparison in terms of JS, SSIM, LPIPS, PSNR, IS, FID, and KID evaluation metrics.For JS, SSIM, PSNR, and IS, the higher, the better, while for LPIPS, FID, and KID, the lower, the better.What's more, IS, FID, and KID are used to evaluate the unpaired try-on setting while the rest are all for paired retry-on setting.Note that the optimum results are bolded while the second-best results are underlined We also evaluate the proposed CIT and other methods via a user study.We randomly select 120 sets of reference and target clothing images from the test dataset.Given the reference images and the target in-shop clothing items, 30 users are asked to choose the best outputs of our model and baselines (i.e., CP-VTON, CP-VTON+, and ACGPN) according to the two questions: (Q1) Which image is the most photo-realistic?(Q2)Whichimage preserves better the details of the target clothing?As shown in Table1, we can see that the proposed CIT achieves significantly better results than the other methods, which further demonstrates that our model generates more realistic images.And CIT also preserves the details of the clothing items compared to the other methods.

Table 3 :
Ablation studies of the proposed CIT for virtual try-on.
port the conclusion that a higher JS (or IoU) score doesn't indicate more enjoyable visual results since it ignores the texture or pattern aspect of the overall quality.The overall matching loss of B4 can be summarized as follows: