skip to main content
research-article
Open Access

Harmonious Multi-branch Network for Person Re-identification with Harder Triplet Loss

Authors Info & Claims
Published:04 March 2022Publication History

Skip Abstract Section

Abstract

Recently, advances in person re-identification (Re-ID) has benefitted from use of the popular multi-branch network. However, performing feature learning in a single branch with uniform partitioning is likely to separate meaningful local regions, and correlation among different branches is not well established. In this article, we propose a novel harmonious multi-branch network (HMBN) to relieve these intra-branch and inter-branch problems harmoniously. HMBN is a multi-branch network with various stripes on different branches to learn coarse-to-fine pedestrian information. We first replace the uniform partition with a horizontal overlapped partition to cover meaningful local regions between adjacent stripes in a single branch. We then incorporate a novel attention module to make all branches interact by modeling spatial contextual dependencies across branches. Finally, in order to train the HMBN more effectively, a harder triplet loss is introduced to optimize triplets in a harder manner. Extensive experiments are conducted on three benchmark datasets — DukeMTMC-reID, CUHK03, and Market-1501 — demonstrating the superiority of our proposed HMBN over state-of-the-art methods.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Person re-identification (Re-ID) aims to retrieve a person of interest across non-overlapping camera views in a large image gallery with a given probe. Re-ID is a popular computer vision task for its giant potential in video surveillance applications. Recently, deep learning methods have pushed the performance of Re-ID to a new level. However, many challenges —such as pose variations, illumination variations, view angle variations, and occlusions — make Re-ID non-trivial.

To relieve these issues, many part-based methods [22, 24, 44, 50] with multiple branches have been proposed to learn local features that have achieved promising results. Specifically, these methods combine information in different granularities and learn coarse-to-fine representations in multi-branch networks. Although they achieve state-of-the-art performance, they still suffer from intra-branch and inter-branch problems, that is, the problems of feature learning in a single branch and correlation among different branches.

Feature learning in a single branch. In a single branch, some part-based methods conduct pre-defined horizontal or vertical partitions on feature maps to extract fine-grained information for local feature learning based on the assumption that images are well aligned [5, 9, 41, 45, 57, 58]. Part-based Convolutional Baseline (PCB) [41] achieves competitive results compared with state-of-the-art methods by partitioning feature maps into 6 horizontal stripes. In PCB, as the number of stripes increases, retrieval accuracy improves at first but drops dramatically in the end. The over-increased number of stripes helps to learn fine-grained information but compromises the representational capability in meaningful local regions. We argue that the uniform partition is not optimal, which separates important semantic regions, as shown in Figure 1.

Fig. 1.

Fig. 1. Illustration of intra-branch and inter-branch problems. For the problem of feature learning in a single branch, Branch N employs a uniform partition with 6 stripes. The head is divided into two stripes, which diminishes the representational capability in head regions. For the problem of correlation among different branches, the multi-branch network shares lower layers to learn strongly correlated features and performs independent feature learning in higher layers for different branches. However, strong relations between branches vanish after the split.

Correlation among different branches. As is shown in Figure 1, a multi-branch network shares lower layers and extracts distinct information at higher layers for different branches. The sharing learning scheme builds branch interaction in lower layers by extracting the same low-level features (e.g., edges, lines) for each branch. In this manner, the strongly correlated information in the low-level layer is exploited. However, the interaction among branches is neglected in higher layers of the network after the split.

Triplet loss [12] is a popular loss function in part-based methods with multiple branches because of its enormous capability to optimize the similarity among samples. Triplet loss aims at reducing and enlarging intra-class and inter-class variations. However, there is still room in optimizing triplets.

Optimizing triplets. A triplet contains one anchor, one positive, and one negative. Given an anchor, mining the hard positive and hard negative is an essential part of learning with triplet loss. Schroff et al. [33] select all anchor-positive pairs, and pick hard negatives by semi-hard negative mining. Hermans et al. [12] propose to choose the hardest positives and hardest negatives within a mini-batch. However, triplets selected by the hardest positive and hardest negative mining are still not hard enough for models to discriminate without up-weighting anchor-to-positive distance or down-weighting anchor-to-negative distance. In this manner, intra-class and inter-class variations are difficult to further reduce and enlarge.

In this article, we propose a novel model, a harmonious multi-branch network (HMBN) with harder triplet loss (HTP), to tackle these problems. The HMBN jointly learns pedestrian representations in multi-granularity with three branches called S1B, S2B, and S3B. HMBN adopts S1B to learn global features and applies S2B and S3B to capture fine-grained information. In the single branch, instead of performing a uniform partition, we design a pooling strategy called horizontal overlapped pooling (HOP) to conduct a horizontal overlapped partition on feature maps and cover meaningful local regions between adjacent stripes. Furthermore, to learn interactive features among branches, we incorporate the inter-branch attention module (IBAM), which involves three inter-branch attention submodules (IBASMs). The IBAM enables our HMBN to refine features by aggregating spatial contextual information from different branches in higher layers. In this manner, interaction among branches is preserved in higher layers of the HMBN. In addition, a novel harder triplet loss (HTP) is introduced to optimize intra-class and inter-class similarities more effectively by optimizing triplets in a harder manner. HTP up-weights anchor-to-positive distance and down-weights anchor-to-negative distance by a polynomial mapping function and penalizes more in cases in which anchor-to-positive distance is not substantially smaller than the anchor-to-negative distance. In this process, HTP further reduces and enlarges intra-class and inter-class variations.

To sum up, our main contributions are as follows:

  • We propose a novel harmonious multi-branch network (HMBN) to learn discriminative pedestrian information by handling intra-branch and inter-branch problems harmoniously.

  • We design a new pooling strategy named horizontal overlapped pooling (HOP) that helps to keep the balance between learning fine-grained information and extracting features in meaningful local regions.

  • We incorporate a compound attention module called the inter-branch attention module (IBAM) into the HMBN to learn interactive representations for each branch. To the best of our knowledge, this is the first module that builds strong relations among different branches in higher layers for Re-ID.

  • We introduce a generalized triplet loss termed harder triplet loss (HTP) to optimize triplets in a harder manner, which is more effective than traditional triplet loss.

  • Extensive experiments on three datasets show that the HMBN outperforms state-of-the-art methods. In addition, ablation studies verify that HOP, IBAM, and HTP all contribute to an accuracy gain.

This article is an extended version of our early and preliminary conference work [43]. In this extended journal version, we have four modifications. (1) We introduce a multi-branch architecture for robust intra-branch and inter-branch feature learning explicitly. In our conference version, we mainly introduce two independent components (HOP and IBAM) explicitly but ignore the elaborate system in its entirely. The whole system (HMBN) is also a contribution of our work, as it alleviates intra-branch and inter-branch problems harmoniously. (2) Each component is discussed in more detail, for example, an additional comparison between the original uniform partition and our proposed horizontal overlapped partition is presented. (3) We propose an HTP to optimize triplets more effectively. (4) More comprehensive experiments with parameter analysis and visualizations are conducted. Specifically, we add ablation experiments to validate the HTP and an additional ablation study on CUHK03-Labeled and CUHK03-Detected datasets. In additiom, each component is verified more thoroughly compared with the conference version, for example,, HOP is ablated both in single-branch and multi-branch networks.

Skip 2RELATED WORKS Section

2 RELATED WORKS

In this section, we discuss recent related works in terms of part-based Re-ID, attention-based Re-ID, and metric learning–based Re-ID.

2.1 Part-Based Re-ID

First, convolutional neural networks (CNNs) are applied in an image classification task [11, 16, 17, 19, 37, 42]. Thus, it is popular to treat the training process of Re-ID as an image classification task and extract global pedestrian representations. However, use of global features is insufficient for capturing fine-grained cues. Many researchers aggregate global and local features. These methods can be divided into two categories according to the number of branches. The first category is single-branch methods [41, 45, 68]. For example, Varior et al. [45] crop images into several horizontal stripes and process the stripes sequentially by long short-term memory (LSTM) [13] cells. In this manner, contextual information among image regions is leveraged to enhance local feature representative capability. Zhou et al. [68] design a novel OSNet, which is composed of stacked convolutional streams with different receptive field sizes for extracting multi-scale features. The second category is multi-branch methods. Multi-branch methods are superior to single-branch methods in aggregating branch-specific information, that is, fine-grained cues [22, 50], human parsing results [18, 38], pose estimation [30, 39, 54], key points estimation [54], and semantic attributes [24, 44]. For example, in the SPReID framework, Kalayeh et al. [18] utilize the parsing model to generate probability maps associated to 5 predefined human parts and extract robust local features. Sarfraz et al. [32] incorporate 14 main body joint keypoints to model pedestrian information.

However, none of them perform feature map partition with overlap. In contrast, HOP is designed to cover meaningful body regions while extracting fine-grained information.

2.2 Attention-Based Re-ID

Attention mechanisms have verified their effectiveness in many tasks, for example, image classification [16, 48, 53, 60], video classification [25, 29], image caption [6, 55], and in generative adversarial networks (GANs) [27, 59]. It is also efficient and effective for Re-ID tasks with dynamically focusing on salient regions. Some methods employ spatially based attention [14, 23, 34, 38] for feature refinement. For example, Li et al. [23] introduce the HA-CNN to learn soft pixel attention and hard regional attention jointly for robust feature extraction. Channel-based attention is also explored in many works [47, 54]. The attention mechanism is also applicable in frame or feature sequences [3, 15, 20, 36]. Si et al. [36] propose a framework termed DuATM, which is composed of a dual attention mechanism to learn context-aware information by modeling intra-sequence and inter-sequence dependencies.

These works apply an attention mechanism to focus at certain patterns with greater notice within a single branch. In contrast, our proposed IBAM helps to generate refined representations by aggregating information from all branches.

2.3 Metric Learning–Based Re-ID

Metric learning aims to learn a similarity or a mapping function to minimize the intra-class variation while maximizing the inter-class variation. Triplet loss is a widely used loss function that treats the Re-ID task as a ranking task and optimizes the similarity among anchor, positive, and negative samples. Various methods [12, 35, 40, 52] are proposed to select hard triplets for learning discriminately. Batch hard triplet loss [12] is designed with selecting the hardest positives and hardest negatives for robust Re-ID learning. In addition, many methods [61, 69, 70] are proposed to improve the gradient backpropagation. Zhou et al. [69] introduce the center point of the positive pair to model all of the pairwise relationships.

In contrast to previous variations of triplet loss, HTP is a generalized triplet loss to optimize triplets in a harder manner dynamically.

Skip 3HARMONIOUS MULTI-BRANCH NETWORK (HMBN) Section

3 HARMONIOUS MULTI-BRANCH NETWORK (HMBN)

In this section, we first describe the overall architecture of the HMBN. Then, the coarse-to-fine structure and horizontal overlapped pooling (HOP) are discussed, followed by a novel attention module named the inter-branch attention module (IBAM). Next, an improved triplet loss called harder triplet loss (HTP) is presented. Finally, we discuss the relations between the proposed modules and some existing methods.

3.1 Overall Architecture

As shown in Figure 2, the HMBN is a multi-branch network, including a base module and three independent branches. ResNet-50 [11] is applied for our feature extraction backbone. The base module consists of previous layers before conv4\( \_ \)2, which is capable of generating shared low-level visual features for each branch. Specifically, three branches are directly borrowed from subsequent layers after conv4\( \_ \)1 — stripe 1 branch (S1B), stripe 2 branch (S2B), and stripe 3 branch (S3B) — based on the number of stripes. S1B performs the Re-ID task at the global level, while S2B and S3B both perform feature learning at the global level and part level. In S2B and S3B, we remove the last spatial down-sampling operation to enrich the granularity. As a result, feature tensors \( \boldsymbol {T}_1 \), \( \boldsymbol {T}_2 \), and \( \boldsymbol {T}_3 \), the output of conv5 from S1B, S2B, and S3B, respectively, have different spatial sizes. In order to integrate multi-branch features, we inject the IBAM on the higher layer of the HMBN to exploit complementary information across branches.

Fig. 2.

Fig. 2. The overall architecture of the proposed HMBN. The HMBN contains a base module and three independent branches: S1B, S2B, and S3B. IBAM is injected in the higher layer of the network for modeling interactive information in different branches. HMBN learns global features by applying GMP on three branches and extracts local features by employing HOP on S2B and S3B. The whole network is trained with classification loss and HTP. GMP is short for global max pooling.

With global max pooling (GMP), the HMBN generates global feature representations \( \boldsymbol {g}_i (i=1,2,3) \) for each branch. A parameter shared 1x1 convolution layer, followed by a batch normalization layer and ReLU layer, is applied to reduce the dimension from 2048-dim \( \boldsymbol {g}_i(i=1,2,3) \) to 256-dim global feature \( \boldsymbol {u}_i(i=1,2,3) \).

With our proposed HOP, HMBN partitions \( \boldsymbol {T}_i(i=2,3) \) into 2 and 3 horizontal stripes in S2B and S3B, and pools these stripes to generate column feature vectors, that is, \( \boldsymbol {p}^n_m \), where \( m \), \( n \) refer to the \( m \)-th stripe in the stripe \( n \) branch. The dimension of \( \boldsymbol {p}^n_m \) is also reduced to 256 by the 1x1 convolution layer to acquire a dimension-reduced local feature \( \boldsymbol {v}^n_m \).

3.2 Coarse-to-Fine Structure

The HMBN is a multi-branch network for coarse-to-fine feature learning. S1B is designed to learn the coarse-grained information (i.e., global information), and S2B and S3B are incorporated to learn fine-grained information with different granularities. We compare the activations of the last convolutional feature maps from B + S1B, B + S2B, and B + S3B in Figure 3. B + S1B is short for a model including base module and S1B, and so forth. B + S1B mainly focuses on the most discriminative regions (e.g., shoulder, shoes). With the increase in number of stripes, more detailed regions can be observed. Regions marked by a red eclipse are ignored by B + S1B but are observed by B + S2B and B + S3B. Regions marked by a yellow eclipse are noticed only by B + S3B.

Fig. 3.

Fig. 3. Visualization results of activations in three coarse-to-fine networks.

3.3 Horizontal Overlapped Pooling (HOP)

Given a feature map \( \boldsymbol {F}\in \mathbb {R}^{C\times {H}\times {W}} \), HOP is illuminated in Figure 4. It has two parameters: \( l \) and \( k \). \( l \) is the total height of overlapped areas in one stripe and \( k \) is the number of stripes. When \( k \) = 1, HOP degrades into GMP, and remains the global information. When \( k \gt \) 1, we learn the fine-grained information. Thus, in HMBN, we keep \( k \) = 2 in S2B and \( k \) = 3 in S3B.

Fig. 4.

Fig. 4. Horizontal overlapped pooling (HOP) in a general form. GMP is short for global max pooling. The size of the overlapped portion is \( C\times {h}\times {W} \) . \( l \) is the total height of overlapped areas in one stripe. \( k \) is the number of partitions.

First, we perform a uniform partition on the feature map \( \boldsymbol {F} \) horizontally. With the aim of devoting equal attention to each stripe, parts on the top and bottom are extended in one direction; others are extended in two directions to keep the same spatial size. An overlapped portion is a smaller 3D tensor whose size is \( C\times {h}\times {W} \), where h refers to its height. In this case, \( l = 2h \). As a result, we require that \( l \) must be an even number. Finally, each horizontal stripe is pooled by GMP to generate a part-level vector.

To highlight the difference between a uniform partition and horizontal overlapped partition, we visualize the region covered in each stripe in Figure 5 when the number of stripes is 6. The uniform partition diminishes the representational capability for partitioning the head into 2 stripes. With the horizontal overlapped partition, information for the head region is preserved well.

Fig. 5.

Fig. 5. Comparison between a uniform partition and horizontal overlapped partition when the number of stripes is 6. (a) Uniform partition. (b) Horizontal overlapped partition.

3.4 Inter-Branch Attention Module (IBAM)

Features extracted from different branches together help to boost the feature representational capability. In order to make branches interact, an IBAM is applied, as shown in Figure 2. Features from paired branches are fed into an inter-branch attention submodule (IBASM), which outputs paired refined features. The HMBN has three branches, which form \( C_3^2=3 \) combinations when we choose paired branches. As a result, each branch is selected twice and has two refined outputs that build interaction between various branches. The mean of two refined outputs is used to update the original feature, which is represented by the mean operation in Figure 2.

Figure 6 depicts the detailed structure of IBASM. Given two feature maps \( \boldsymbol {A}\in \mathbb {R}^{C\times {H}\times {W}} \), \( \boldsymbol {B}\in \mathbb {R}^{C\times {H}\times {W}} \) from different branches, a 1x1 convolution layer is employed to generate four new feature maps \( \boldsymbol {X} \), \( \boldsymbol {Y} \), \( \boldsymbol {M} \), and \( \boldsymbol {N} \), where \( {\boldsymbol {X}, \boldsymbol {Y},\boldsymbol {M}, \boldsymbol {N}}\in \mathbb {R}^{\frac{C}{8}\times {H}\times {W}} \). These four feature maps are reshaped to \( \mathbb {R}^{\frac{C}{8}\times {L}} \), where \( L=H\times {W} \) is the number of feature locations. Pixel-wise similarity in the spatial domain is calculated by matrix multiplication between transposed \( \boldsymbol {X} \) and \( \boldsymbol {N} \). It is then normalized to obtain the spatial attention map \( \boldsymbol {S}\in \mathbb {R}^{L\times {L}} \), as shown here: (1) \( \begin{equation} S_{i,j}=\frac{\exp {(m_{i,j})}}{\sum _{i=1}^L \exp {(m_{i,j})}}, m_{ij}=\boldsymbol {X}^T_i{\boldsymbol {N}}_j, \end{equation} \) where \( \boldsymbol {X}_{i} \), \( \boldsymbol {N}_{j} \) denote the \( i^{th} \) and \( j^{th} \) spatial features of \( \boldsymbol {X} \) and \( \boldsymbol {N} \), respectively.

Fig. 6.

Fig. 6. The inter-branch attention submodule (IBASM). “ \( \oplus \) ” denotes element-wise sum; “ \( \otimes \) ” denotes matrix multiplication.

To calculate the output \( \boldsymbol {C} \), the HMBN first predicts \( \boldsymbol {A} \) with attention map \( \boldsymbol {S} \) and information from input \( \boldsymbol {B} \). The prediction, which is the result of matrix multiplication between transposed \( \boldsymbol {S} \) and \( \boldsymbol {M} \), is reshaped to \( \mathbb {R}^{C\times {H}\times {W}} \). Then, the HMBN performs an element-wise sum between the weighted prediction and the original \( \boldsymbol {A} \). The output \( \boldsymbol {C} \) is defined as (2) \( \begin{equation} {\boldsymbol {C}}_j=\gamma _1\sum _{i=1}^L{{S^T}_{i,j}\boldsymbol {M}_i}+ {\boldsymbol {A}}_j, \end{equation} \) where \( \gamma _1 \) is a learnable weight that is initialized as 0. The output \( \boldsymbol {D} \) is defined as (3) \( \begin{equation} {\boldsymbol {D}}_j=\gamma _2\sum _{i=1}^L{{S}_{i,j}\boldsymbol {Y}_i}+ {\boldsymbol {B}}_j. \end{equation} \) In this manner, the refined \( \boldsymbol {A} \), which is denoted as \( \boldsymbol {C} \), contains reciprocal information from \( \boldsymbol {B} \). The refined \( \boldsymbol {B} \), which is denoted as \( \boldsymbol {D} \), contains reciprocal information from \( \boldsymbol {A} \).

The IBAM can be plugged into two positions in the HMBN: the output of conv4 and the output of conv5. Since the inputs of the IBAM need to have the same size, we apply a modification, removing the last spatial down-sampling operation in S1B when injecting the IBAM on the output of conv5. We find that adding the IBAM at the output of conv4 brings more performance improvement, because keeping the down-sample operation in S1B will produce complementary features when the down-sampling operation is removed in S2B and S3B. For this reason, the IBAM is placed on the output layer of conv4.

It is worth noted that the proposed IBAM is pluggable and can be injected into any existing multi-branch network because IBASM does not change the size of the feature map.

Armed with our proposed IBAM, spatial contextual dependencies across branches are well established and the interactive information in multi-granularity is utilized in higher layers.

3.5 Harder Triplet Loss (HTP)

In this subsection, we revisit the traditional triplet loss, then discuss its drawbacks in optimizing triplets. Finally, we propose the harder triplet loss (HTP) for making up these deficiencies.

Normally, triplet loss is trained on a set of triplet units \( \lbrace (\boldsymbol {x}, \boldsymbol {x}^+, \boldsymbol {x}^-)\rbrace \), in which \( (\boldsymbol {x}, \boldsymbol {x}^+) \) represents a positive pair from the same pedestrian and a negative pair \( (\boldsymbol {x}, \boldsymbol {x}^-) \) represents images from different pedestrians. Given one triplet \( (\boldsymbol {x}, \boldsymbol {x}^+, \boldsymbol {x}^-) \), triplet loss is formulated as (4) \( \begin{equation} \begin{aligned}L_{tri}(f(\boldsymbol {x}), f(\boldsymbol {x}^+), f(\boldsymbol {x}^-)) &= \left[m+ d_{a,p} - d_{a,n}\right]_+, \\ d_{a,p} &= d\left(f(\boldsymbol {x}),f(\boldsymbol {x}^+) \right),\\ d_{a,n} &= d\left(f(\boldsymbol {x}),f(\boldsymbol {x}^-) \right), \end{aligned} \end{equation} \) where \( m \) is the margin parameter, \( d_{a,p} \) and \( d_{a,n} \) are short for anchor-to-positive distance and anchor-to-negative distance, \( d(\cdot) \) is the Euclidean distance, \( \left[\cdot \right]_+ \) denotes \( max(\cdot ,0) \), and \( f(\boldsymbol {x}) \), \( f(\boldsymbol {x}^+) \), and \( f(\boldsymbol {x}^-) \) are features of sample \( \boldsymbol {x} \), \( \boldsymbol {x}^+ \), and \( \boldsymbol {x}^- \), respectively.

The core idea for triplet loss is to optimize the similarity in triplets so that \( d_{a,n} \) should be larger than \( d_{a,p} \) by the margin \( m \). However, the optimization can still be improved, as shown in Figure 7. To compare traditional triplet loss and HTP, we analyze the empirical distribution of the relative distance of a positive pair and negative pair, which is defined as \( [m+ d_{a,p}-d_{a,n} ]_+ \), from a converged HMBN. The number of samples is displayed by log axis because easy samples have extremely large numbers. With our proposed HTP, the majority of samples move left in Figure 7(b) compared with Figure 7(a), indicating that HTP helps intra-class variation and inter-class variation further reduced and enlarged.

Fig. 7.

Fig. 7. Empirical distribution of relative distance of a positive pair and negative pair from a converged HMBN model trained with traditional triplet loss and HTP on the DukeMTMC-reID dataset.

For optimizing triplets in a harder manner, HTP penalizes large \( d_{a,p} \) and small \( d_{a,n} \) with polynomial mapping function, defined as follows: (5) \( \begin{equation} \begin{aligned} \widetilde{d}_{a,p}=\left(d_{a,p}+1\right)^{(1+\alpha)}-1, \end{aligned} \end{equation} \) (6) \( \begin{equation} \begin{aligned} \widetilde{d}_{a,n}=\left(d_{a,n}+1\right)^{(1-\alpha)}-1, \end{aligned} \end{equation} \) where \( \alpha \) is the scale factor. We update \( d_{a,p} \) and \( d_{a,n} \) with \( \widetilde{d}_{a,p} \) and \( \widetilde{d}_{a,n} \). The polynomial mapping function is visualized for several values of \( \alpha \) in Figure 8. When \( \alpha =0 \), \( \widetilde{d}_{a,p}=d_{a,p} \), and \( \widetilde{d}_{a,n}=d_{a,n} \). The larger \( \alpha \) is, the more penalty \( d_{a,p} \) and \( d_{a,n} \) get.

Fig. 8.

Fig. 8. Illustration of polynomial mapping function. (a) Polynomial mapping function on \( d_{ap} \) . (b) Polynomial mapping function on \( d_{an} \) .

Based on the polynomial mapping function, the HTP is defined as follows: (7) \( \begin{equation} \begin{aligned}L_{HTP}(f(\boldsymbol {x}), f(\boldsymbol {x}^+), f(\boldsymbol {x}^-))=\left[m+ \widetilde{d}_{a,p} - \widetilde{d}_{a,n}\right]_+. \end{aligned} \end{equation} \)

As shown in Figure 2, global features \( \boldsymbol {u}_i(i=1,2,3) \) are trained with HTP and classification loss. Specifically, HTP on global features can be formulated as (8) \( \begin{equation} \begin{aligned}L_{HTP}^{g} = \sum _{i=1}^{N_g}\left(\frac{1}{N_t}\sum _{j=1}^{N_t}L_{HTP}\left(({\boldsymbol {u}_i}^{(j)},{\boldsymbol {u}_i}^{(j+)},{\boldsymbol {u}_i}^{(j-)}\right) \right), \end{aligned} \end{equation} \) where \( N_g \) and \( N_t \) are the numbers of global features and sampled triplets, \( {\boldsymbol {u}_i}^{(j)} \), \( {\boldsymbol {u}_i}^{(j+)} \), \( {\boldsymbol {u}_i}^{(j-)} \) are the feature \( {\boldsymbol {u}_i} \) extracted from anchor, positive, and negative samples in the \( j \)-th triplet, respectively. Classification loss on global features can be formulated as (9) \( \begin{equation} L_{cls}^{g} = \sum _{i=1}^{N_g}\left({-\frac{1}{N}{\sum _{j=1}^{N}log}\frac{\exp ({((\boldsymbol {W}^i)_{y_j})^T{\boldsymbol {u}_i})}}{{\sum _{k=1}^{C}{\exp ({((\boldsymbol {W}^i)_{k})^T{\boldsymbol {u}_i}})}}}}\right), \end{equation} \) where \( N \), \( C \) are the number of input images and identities, and \( y_j \) is the ground truth of the \( j \)-th input image. \( (\boldsymbol {W}^i)_{k} \) is the \( k \)-th column of the fully connected layer whose input is \( \boldsymbol {u}_i \). Local features \( \boldsymbol {v}^n_m \) are trained only with classification loss. Classification loss on local features is formulated as (10) \( \begin{equation} L_{cls}^{l} = \sum _{n=2}^{N_b} \sum _{m=1}^{n} \left({-\frac{1}{N}{\sum _{j=1}^{N}log}\frac{\exp ({((\boldsymbol {W}_m^n)_{y_j})^T{\boldsymbol {v}_m^n})}}{{\sum _{k=1}^{C}{\exp ({((\boldsymbol {W}_m^n)_{k})^T{\boldsymbol {v}_m^n}})}}}}\right), \end{equation} \) where \( N_b \) is the number of branches and \( (\boldsymbol {W}_m^n)_{k} \) is \( k \)-th column of the fully connected layer whose input is \( \boldsymbol {v}^n_m \). The final loss is defined as follows: (11) \( \begin{equation} L=\frac{1}{N_{htp}}L_{HTP}^{g} + \lambda \frac{1}{N_{cls}}\left(L_{cls}^g + L_{cls}^l\right), \end{equation} \) where \( N_{htp} \) and \( N_{cls} \) are the numbers of features trained with HTP and classification loss and \( \lambda \) is the weight of classification loss. Specifically, we set \( \lambda \) to 2 in the following experiments.

3.6 Discussions

This subsection contains a brief discussion of the proposed modules and some similar existing methods that emphasizes the difference between them. However, our proposed modules and the compared existing methods are designed with different purposes, which means that they can hardly be compared in a fair experimental setting.

Relations between HOP and OBM. The overlapping blocks model (OBM) [4] proposes a multiple overlapping blocks structure to pool features from overlapping regions. The OBM requires multiply scaled horizontal partitions. However, HOP performs on a single scale, which is a lightweight method in the training procedure for its relatively fewer fully connected layers.

Relations between IBASM and non-local block. In some ways, the IBASM can be regarded as a variation of the non-local block [51]. The IBASM differs from the non-local block as follows: (1) The IBASM takes two input features while the non-local block takes one input feature. The IBASM performs non-local operations on two features. This modification helps the model refine one input feature with the consideration of the other input feature. (2) The IBASM produces two output features corresponding to two refined input features containing reciprocal information from each other. “Encoder-decoder attention” layers [46] and pair-wise non-local operation [10] both take two input features to compute non-local operations and produce one output feature corresponding to one refined input feature containing the reciprocal information from the other input feature. The IBASM is the first module to build relations between two branches for Re-ID.

Relations between IBAM and PS-MCNN. The IBAM has some similarities with the partially shared multi-task convolutional neural network (PS-MCNN) [1] because both are designed to make branches interact. However, our IBAM is different from a PS-MCNN in three aspects. (1) An IBAM aims to build relations among different branches with various granularities while a PS-MCNN focuses on building relations among different branches with various attribute groups. (2) An IBAM builds interactions among all branches by modeling the relations of paired branches while a PS-MCNN introduces a new Shared Network (SNet) to learn shared information for all branches. In addition, IBAM considers spatial information in the process of interaction, which is ignored by a PS-MCNN. (3) An IBAM is a module that can be easily embedded into any multi-branch network architecture, while a PS-MCNN is a network designed for building interactions among different branches with various attribute groups specifically. Our IBAM is more general than a PS-MCNN.

Skip 4EXPERIMENTS Section

4 EXPERIMENTS

In this section, we first describe three datasets and evaluation protocols in our experiments. Then, implementation details are introduced. Next, we compare the retrieval accuracy of the HMBN with state-of-the-art methods on these three datasets. Finally, we carry out ablation studies on DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected to verify the effectiveness of each component. Parameter analysis and visualization are also included.

4.1 Datasets and Evaluation Protocols

We conduct experiments on three popular Re-ID datasets: DukeMTMC-reID [31, 65], CUHK03 [21], and Market-1501 [63]. Dataset statistics are shown in Table 1. \( {\bf DukeMTMC-reID} \) is a subset of the DukeMTMC dataset [31] for Re-ID. It is also one of the largest datasets in the Re-ID task. It contains 36,411 images of 1,812 pedestrians captured from 8 high-resolution cameras. There are 702 pedestrians with 16,522 images randomly divided for a training set. The other 702 pedestrians are included in the testing set, in which 2,228 images and 17,661 images are included in the query set and gallery set, respectively. The remaining 408 pedestrians are distractors. \( {\bf CUHK03} \) is a relatively small dataset compared with DukeMTMC-reID. It has 1,467 pedestrians with 14,097 images captured by 6 cameras on the CUHK campus. Both manually annotated and DPM-detected bounding boxes are provided, which are denoted as CUHK03-Labeled and CUHK03-Detected. In this article, we use the setting of both of them. \( {\bf Market-1501} \) is another large dataset, which is collected by the Deformable Part Model (DPM) detector [8] from 6 cameras. The whole dataset is separated into a training set with 12,936 of 751 pedestrians and a testing set including 3,368 query images of 750 pedestrians and 15,913 gallery images of 751 pedestrians.

Table 1.
DukeMTMC-reIDCUHK03Market-1501
Train (IDs/Images)702/16,522767/7,365751/12,936
Gallery (IDs/Images)1,110/17,661700/5,332751/15,913
Query (IDs/Images)702/2,228700/1,400750/3,368
Cameras866

Table 1. Dataset Statistics

We report Cumulative Matching Characteristics (CMCs) at rank-1, and the mean Average Precision (mAP) with the single-shot setting on all candidate datasets.

4.2 Implementation Details

The proposed HMBN is implemented using the Pytorch [28] framework on a single NVIDIA GTX 1080 Ti GPU. The weights of ResNet-50 [11] pretrained on ImageNet [7] are adopted to initialize parameters of the HMBN.

In the training phase, we resize the input images to \( 384\times {128} \). Then, they are augmented by random horizontal flip, normalization, and random erasing [67]. The total training phase takes 500 epochs. The initial learning rate is set to 2e-4, and then decays to 2e-5, 2e-6 after 320, 380 epochs. An Adam optimizer is used to update the weight parameters with weight decay 5e-4. The batch size is set to 32, in which each identity contains 4 images. The margin in the HTP is set to 1.2 in the following experiments. It should be emphasized that the current experimental setting is slightly different from the conference version in two ways. (1) Different batch sizes —The batch size is set to 32 in this journal version while the batch size in the conference version is 16. (2) Different machines —Experiments from the conference and journal versions are conducted on two independent machines with different hardware and software, that is, CPUs, GPUs, and operation systems. Based on different experimental settings, the results reported in this article are slightly different from the conference version.

In the testing phase, the input images are resized to \( 384\times {128} \) and augmented only by normalization. All dimension-reduced (256-dim) global and local features are concatenated as the final embedding vector of a pedestrian.

4.3 Comparison with State-of-the-Art Methods

More than 10 existing state-of-the-art methods are compared with our proposed HMBN on DukeMTMC-reID, CUHK03, and Market-1501 datasets in Table 2, Table 3, and Table 4, respectively. We separate these compared methods into three groups: single-branch methods (S), multi-branch methods (M), and attention-based methods regardless of the number of branches (A).

Table 2.
MethodsDukeMTMC-reID
Rank-1mAP
SMLFN [2] (CVPR2018)81.0062.80
PCB+RPP [41] (ECCV2018)83.3069.20
HPM [9] (AAAI2019)86.6074.30
OSNet [68] (ICCV2019)88.6073.50
BoT [26] (CVPRW2019)86.4076.40
MPSE [32] (CVPR2018)79.8062.00
HA-CNN [23] (CVPR2018)80.5063.80
C\( A^3 \)Net [24] (ACM MM2018)84.6070.20
CAMA [56] (CVPR2019)85.8072.90
HOReID [49] (CVPR2020)86.9075.60
MGN [50] (ACM MM2018)88.7078.40
PISNet [62] (ECCV2020)88.8078.70
ADuATM [36] (CVPR2018)81.8264.58
Mancs [47] (ECCV2018)84.9071.80
AANet-50 [44] (CVPR2019)86.4272.56
CASN [64] (CVPR2019)87.7073.70
HMBN89.8679.68
HMBN (RK)92.1990.44
  • The top results are in bold. “RK” means re-ranking.

Table 2. Comparison of HMBN with State-of-the-Art Methods for DukeMTMC-reID Dataset

  • The top results are in bold. “RK” means re-ranking.

Table 3.

Table 3. Comparison of HMBN with State-of-the-Art Methods for CUHK03-Labeled and CUHK03-Detected Datasets

Table 4.

Table 4. Comparison of HMBN with State-of-the-art Methods for Market-1501 Dataset

\( {\bf DukeMTMC-reID.} \) Our proposed HMBN achieves the best result of a Rank-1 accuracy of 89.86% and a mAP of 79.68% on the DukeMTMC-reID dataset. We should emphasize the following. (1) The gaps between HMBN and single-branch methods (MLFN [2], PCB+RPP [41], HPM [9], OSNet [68], and BoT [26]) demonstrate the effectiveness of the multi-branch structure, for example, the HMBN surpasses the BoT by Rank-1/mAP = 3.46%/3.28%. (2) Multi-branch methods (PSE [32], HA-CNN [23], C\( A^3 \)Net [24], CAMA [56], HOReID [49], MGN [50], and PISNet [62]) integrate complementary information (e.g., pose estimation, human parsing results, attribute information) into final pedestrian representations, for example, HOReID [49] aligns local features with key-points estimation. Without injecting prior knowledge, such as attributes or poses, the HMBN exceeds the MGN and achieves the best results in this group, by 1.16% in Rank-1 accuracy and 1.28% in mAP. We argue that these methods ignore the interaction among branches in the multi-branch network. On the contrary, the HMBN builds interaction among branches in higher layers of the network by injecting the IBAM. (3) The HMBN outperforms the CASN [64], which achieves the top result in attention-based methods (DuATM [36], Mancs [47], AANet-50 [44], CASN [64]), by 2.16% in Rank-1 and 5.98% in mAP. Instead of modeling intra-branch contextual dependency in attention-based methods, our designed IBAM builds inter-branch dependency. With the help of re-ranking [66], we achieve a higher result of 92.19% in Rank-1 accuracy and 90.44% in mAP.

Figure 9 shows the top-10 ranking results with four query images on the DukeMTMC-reID dataset. Given a query image, the HMBN can retrieve the correct pedestrian under severe visual recognition problems such as view angle variations, illumination variations, and occlusion.

Fig. 9.

Fig. 9. Example results of our HMBN on the DukeMTMC-reID dataset. Given a query image, the top-10 ranking list is presented. Correct and incorrect matches are highlighted green and red, respectively.

\( {\bf CUHK03.} \) We report a clear winner case on CUHK03-Labeled and CUHK03-Detected in Table 3. The HMBN achieves the top result of Rank-1 accuracy 78.07%, mAP 75.63% on CUHK03-Labeled and Rank-1 accuracy 75.43%, mAP 73.05% on CUHK03-Detected. The HBMN outperforms the CASN [64] and achieves the best result of all previous existing methods by 4.37% in Rank-1 accuracy, 7.63% in mAP, and 3.93% in Rank-1, 8.65% in mAP, respectively.

\( {\bf Market-1501.} \) As illustrated in Table 4, our proposed HMBN achieves competitive results of 94.86% Rank-1 accuracy and 87.45% mAP on Market-1501. Although the MGN and PISNet outperform the HMBN by 0.84% and 0.74% in Rank-1, respectively, the HMBN clearly exceeds all existing methods in terms of mAP (87.45%).

4.4 Ablation Studies

To further verify the effectiveness of each component in the MBN, we present ablation analysis on the DukeMTMC-reID, CUHK03-Labeled and CUHK03-Detected datasets. Parameter analysis and visualization are performed on the DukeMTMC-reID and CUHK03-Labeled datasets.

\( {\bf Multi-branch Structure.} \) The effectiveness of uniform partition and multi-branch structure are shown in Table 5 with the comparison of model 1, 2, 3, 4, 5, 6. B + S1B is short for a model including a base module and S1B, B + S1B + S2B (\( l \) = 0) is a multi-branch model composed of a base module, S1B, S2B, and so forth. For uniform partition in the single branch, the number of horizontal stripes controls the granularity of the local feature. With the increase of number of stripes, the performance boosts as well. However, as the number of stripes further increases, the improvement seems to be marginal but enlarges the model parameters: for example, B + S2B (\( l \) = 0) outperforms B + S1B in mAP by 7.93% but B + S3B (\( l \) = 0) outperforms B + S2B (\( l \) = 0) in mAP by 1.98% on DukeMTMC-reID. For the multi-branch structure, a model with a multi-branch structure is better than each composed branch: for example, B + S1B + S2B (\( l \) = 0) + S3B (\( l \) = 0) beats B + S1B, B + S2B (\( l \) = 0), B + S3B (\( l \) = 0), B + S1B + S2B (\( l \) = 0) in Rank-1/mAP by 6.19%/13.74%, 2.92%/5.81%, 2.02%/3.83%, and 0.76%/2.62% on DukeMTMC-reID. As the number of branches further increases, the performance increases as well but marginal: B + S1B + S3B (\( l \) = 0) outperforms B + S1B in mAP by 11.78% but B + S1B + S2B (\( l \) = 0) + S3B (\( l \) = 0) outperforms B + S1B + S2B (\( l \) = 0) in mAP by 2.62% on DukeMTMC-reID. To keep the balance between high retrieval accuracy and low model parameter, we recommend using B + S1B + S2B (\( l \) = 0) + S3B (\( l \) = 0) as the architecture to verify the effectiveness of HOP, IBAM, and HTP.

Table 5.

Table 5. Ablation Studies of HMBN on DukeMTMC-reID, CUHK03-Labeled, CUHK03-Detected Datasets

\( {\bf Effectiveness of HOP.} \) Figure 10 shows the Rank-1 accuracy and mAP change with parameter \( l \) in HOP, in which \( l \) is the total height of overlapped areas in one stripe from one branch. To simplify the notation, \( l \) of HOP in S2B is denoted as \( l_2 \) and \( l \) of HOP in S3B is denoted as \( l_3 \). In the model with one branch S2B (B + S2B), as illustrated in Figure 10(a), the performance rises as well when \( l_2 \) is increased, indicating that HOP is essential. However, the accuracy is not always growing with \( l_2 \). In the model with one branch S3B (B + S3B), as illustrated in Figure 10(b), with the increase of \( l_3 \), the accuracy show the same trend. It is noted that over-increased \( l \) helps to cover meaningful local regions between adjacent parts but diminishes the feature learning in fine-grained cue. A proper \( l \) helps to achieve a good balance between learning fine-grained information and extracting features in meaningful local regions. As is shown in Table 5, \( (2, 0) \), \( (2, 0) \), \( (2, 2) \) are recommended for \( (l_2, l_3) \) on DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets.

Fig. 10.

Fig. 10. Parameter analysis for \( l \) in S2B and S3B. (a) Rank-1 and mAP changes with \( l_2 \) on DukeMTMC-reID. (b) Rank-1 and mAP changes with \( l_3 \) on DukeMTMC-reID. (c) Rank-1 and mAP changes with \( {l_3} \) on CUHK03-Labeled. (d) Rank-1 and mAP changes with \( l_3 \) on CUHK03-Labeled.

\( {\bf Effectiveness of IBAM.} \) The comparison of models 7 and 8 in Table 5 shows the effectiveness of the IBAM. Empirically, the IBAM improves by 0.45%/0.42%, 1.79%/1.26%, and 0.93%/1.73% in Rank-1/mAP on the DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets, respectively (model 7 vs. model 8). To verify whether the IBAM builds interaction among branches, we visualize the activations of the last convolutional feature maps from three branches on DukeMTMC-reID and CUHK03-Labeled in Figure 11. It is noted that (1) S1B performs the Re-ID task at the global level, which is likely to ignore detailed information. The IBAM helps S2B and S3B interact with S1B, and S1B learns the global information with the consideration of local information. Red eclipses mark the detailed information ignored in S1B from the HMBN without the IBAM but considered by S1B from HMBN with the IBAM. (2) S2B and S3B perform the Re-ID task at the part level, which fails in learning consecutive local regions. S2B and S3B successfully keep an eye on larger local areas with the injection of IBAM. Yellow eclipses mark the consecutive local region ignored in S2B or S3B from the HMBN without the IBAM but considered by S2B or S3B from the HMBN with the IBAM.

Fig. 11.

Fig. 11. Visualization results of activations from three branches in the HMBN. The top and bottom two input images are come from DukeMTMC-reID and CUHK03-Labeled, respectively. For each input image, the activations from the first row and second row are returned from the HMBN without and with the IBAM.

\( {\bf Effectiveness of Harder Triplet Loss.} \) The HMBN is trained with classification loss and HTP. Traditional triplet loss can be seen in the special case of HTP if \( \alpha =0 \). As shown in Table 5, the HTP outperforms traditional triplet loss in Rank-1/mAP by 0.41%/0.04%, 1.78%/2.08%, and 3.14%/2.26% on the DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets, respectively (model 8 vs. model 9). Parameter analysis for \( \alpha \) in HTP is illustrated in Figure 12. Parameter \( \alpha \) controls the scale factor of anchor-to-positive distance and anchor-to-negative distance. We can see that the performance of THE HMBN is sensitive to \( \alpha \), and parameter \( \alpha \) helps the HMBN achieve a higher retrieval accuracy when \( \alpha =0.01 \).

Fig. 12.

Fig. 12. Parameter analysis for \( \alpha \) in the HMBN on the DukeMTMC-reID and CUHK03-Labeled datasets.

Skip 5CONCLUSIONS Section

5 CONCLUSIONS

In this article, we pay attention to widely used multi-branch methods with different stripes and propose a harmonious multi-branch network for Re-ID with HTP. Unlike previous methods that design more branches or more stripes for extracting coarse-to-fine pyramid representations, we analyze how to improve feature learning in a single branch and build interaction among different branches. For feature learning in a single branch, we design the HOP to enhance representational capability in meaningful local regions while extracting fine-grained information. For the interaction among different branches, we incorporate the IBAM to refine representation within a single branch by integrating information from other branches. In addition, we analyze the deficiencies of the commonly applied triplet loss and propose the generalized triplet loss, namely, HTP. Our HTP optimizes triplets in a harder manner, further reducing and enlarging intra-class and inter-class variations. Each component is verified thoroughly in extensive ablation experiments. In addition, the HMBN achieves superior performance compared with state-of-the-art Re-ID methods. In the future, we will explore our idea of harmonious multi-branch learning in more computer vision tasks, that is, image retrieval.

REFERENCES

  1. [1] Cao Jiajiong, Li Yingming, and Zhang Zhongfei. 2018. Partially shared multi-task convolutional neural network with local constraint for face attribute learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 42904299.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Chang Xiaobin, Hospedales Timothy M., and Xiang Tao. 2018. Multi-level factorisation net for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 21092118.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Chen Guangyi, Lu Jiwen, Yang Ming, and Zhou Jie. 2019. Spatial-temporal attention-aware learning for video-based person re-identification. IEEE Transactions on Image Processing 28, 9 (2019), 41924205.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Yipeng, Zhao Cairong, and Sun Tianli. 2019. Single image based metric learning via overlapping blocks model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Long Beach, USA, 00.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Cheng De, Gong Yihong, Zhou Sanping, Wang Jinjun, and Zheng Nanning. 2016. Person re-identification by multi-channel parts-based CNN with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Las Vegas, USA, 13351344.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cornia Marcella, Baraldi Lorenzo, and Cucchiara Rita. 2019. Show, control and tell: A framework for generating controllable and grounded captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 83078316.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Miami, USA, 248255.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Felzenszwalb Pedro, McAllester David, and Ramanan Deva. 2008. A discriminatively trained, multiscale, deformable part model. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Anchorage, USA, 18.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Fu Yang, Wei Yunchao, Zhou Yuqian, Shi Honghui, Huang Gao, Wang Xinchao, Yao Zhiqiang, and Huang Thomas. 2019. Horizontal pyramid matching for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. AAAI Press, Honolulu, USA, 82958302.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Fu Zhihang, Chen Yaowu, Yong Hongwei, Jiang Rongxin, Zhang Lei, and Hua Xian-Sheng. 2019. Foreground gating and background refining network for surveillance object detection. IEEE Transactions on Image Processing 28, 12 (2019), 60776090.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Las Vegas, USA, 770778.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Hermans Alexander, Beyer Lucas, and Leibe Bastian. 2017. In defense of the triplet loss for person re-identification. arXiv:1703.07737Google ScholarGoogle Scholar
  13. [13] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Hou Ruibing, Ma Bingpeng, Chang Hong, Gu Xinqian, Shan Shiguang, and Chen Xilin. 2019. Interaction-and-aggregation network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 93179326.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hou Ruibing, Ma Bingpeng, Chang Hong, Gu Xinqian, Shan Shiguang, and Chen Xilin. 2019. VRSTC: Occlusion-free video person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 71837192.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 71327141.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Huang Gao, Liu Zhuang, Maaten Laurens Van Der, and Weinberger Kilian Q.. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, USA, 47004708.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Kalayeh Mahdi M., Basaran Emrah, Gökmen Muhittin, Kamasak Mustafa E., and Shah Mubarak. 2018. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 10621071.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 25. MIT Press, Stateline, USA, 10971105.Google ScholarGoogle Scholar
  20. [20] Li Shuang, Bak Slawomir, Carr Peter, and Wang Xiaogang. 2018. Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 369378.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Li Wei, Zhao Rui, Xiao Tong, and Wang Xiaogang. 2014. DeepReID: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Columbus, USA, 152159.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Li Wei, Zhu Xiatian, and Gong Shaogang. 2017. Person re-identification by deep joint learning of multi-loss classification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, San Francisco, USA, 21942200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Wei, Zhu Xiatian, and Gong Shaogang. 2018. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 22852294.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Liu Jiawei, Zha Zheng-Jun, Xie Hongtao, Xiong Zhiwei, and Zhang Yongdong. 2018. CA 3 Net: Contextual-attentional attribute-appearance network for person re-identification. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, Seoul, South Korea, 737745.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Long Xiang, Gan Chuang, Melo Gerard De, Wu Jiajun, Liu Xiao, and Wen Shilei. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Munich, Germany, 78347843.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Luo Hao, Gu Youzhi, Liao Xingyu, Lai Shenqi, and Jiang Wei. 2019. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Virtual Seattle, USA, 00.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Ma Shuang, Fu Jianlong, Chen Chang Wen, and Mei Tao. 2018. Da-gan: Instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 56575666.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Alban Desmaison, Andreas Kopf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. arXiv:1912.01703Google ScholarGoogle Scholar
  29. [29] Peng Yuxin, Zhao Yunzhen, and Zhang Junchao. 2018. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology 29, 3 (2018), 773786.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Qian Xuelin, Fu Yanwei, Xiang Tao, Wang Wenxuan, Qiu Jie, Wu Yang, Jiang Yu-Gang, and Xue Xiangyang. 2018. Pose-normalized image generation for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, Munich, Germany, 650667.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Ristani Ergys, Solera Francesco, Zou Roger, Cucchiara Rita, and Tomasi Carlo. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision. Springer, Springer, Amsterdam, the Netherlands, 1735.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Sarfraz M. Saquib, Schumann Arne, Eberle Andreas, and Stiefelhagen Rainer. 2018. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 420429.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Schroff Florian, Kalenichenko Dmitry, and Philbin James. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, USA, 815823.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Shen Yantao, Xiao Tong, Li Hongsheng, Yi Shuai, and Wang Xiaogang. 2018. End-to-end deep Kronecker-product matching for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 68866895.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Shi Hailin, Yang Yang, Zhu Xiangyu, Liao Shengcai, Lei Zhen, Zheng Weishi, and Li Stan Z.. 2016. Embedding deep metric for person re-identification: A study against large variations. In European Conference on Computer Vision. Springer, Springer, Amsterdam, the Netherlands, 732748.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Si Jianlou, Zhang Honggang, Li Chun-Guang, Kuen Jason, Kong Xiangfei, Kot Alex C., and Wang Gang. 2018. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 53635372.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556Google ScholarGoogle Scholar
  38. [38] Song Chunfeng, Huang Yan, Ouyang Wanli, and Wang Liang. 2018. Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 11791188.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Su Chi, Li Jianing, Zhang Shiliang, Xing Junliang, Gao Wen, and Tian Qi. 2017. Pose-driven deep convolutional model for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Venice, Italy, 39603969.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Suh Yumin, Wang Jingdong, Tang Siyu, Mei Tao, and Lee Kyoung Mu. 2018. Part-aligned bilinear representations for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, Munich, Germany, 402419.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Sun Yifan, Zheng Liang, Yang Yi, Tian Qi, and Wang Shengjin. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, Munich, Germany, 480496.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, USA, 19.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Tang Zengming and Huang Jun. 2020. Branch interaction network for person re-identification. In Proceedings of the Asian Conference on Computer Vision (ACCV’20). Springer, Virtual Kyoto, Japan, 322337.Google ScholarGoogle Scholar
  44. [44] Tay Chiat-Pin, Roy Sharmili, and Yap Kim-Hui. 2019. AANet: Attribute attention network for person re-identifications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 71347143.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Varior Rahul Rama, Shuai Bing, Lu Jiwen, Xu Dong, and Wang Gang. 2016. A Siamese long short-term memory architecture for human re-identification. In European Conference on Computer Vision. Springer, Amsterdam, the Netherlands, 135153.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. MIT Press, Long Beach, USA, 59986008.Google ScholarGoogle Scholar
  47. [47] Wang Cheng, Zhang Qian, Huang Chang, Liu Wenyu, and Wang Xinggang. 2018. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, Munich, Germany, 365381.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Wang Fei, Jiang Mengqing, Qian Chen, Yang Shuo, Li Cheng, Zhang Honggang, Wang Xiaogang, and Tang Xiaoou. 2017. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, USA, 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wang Guan’an, Yang Shuo, Liu Huanyu, Wang Zhicheng, Yang Yang, Wang Shuliang, Yu Gang, Zhou Erjin, and Sun Jian. 2020. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Virtual Nashville, USA, 64496458.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wang Guanshuo, Yuan Yufeng, Chen Xiong, Li Jiwei, and Zhou Xi. 2018. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia. ACM, Seoul, South Korea, 274282.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Wang Xiaolong, Girshick Ross, Gupta Abhinav, and He Kaiming. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 77947803.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Wang Yan, Wang Lequn, You Yurong, Zou Xu, Chen Vincent, Li Serena, Huang Gao, Hariharan Bharath, and Weinberger Kilian Q.. 2018. Resource aware person re-identification across multiple resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 80428051.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Woo Sanghyun, Park Jongchan, Lee Joon-Young, and Kweon In So. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, Munich, Germany, 319.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Xu Jing, Zhao Rui, Zhu Feng, Wang Huaming, and Ouyang Wanli. 2018. Attention-aware compositional network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 21192128.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, Lille, France, 20482057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Yang Wenjie, Huang Houjing, Zhang Zhang, Chen Xiaotang, Huang Kaiqi, and Zhang Shu. 2019. Towards rich feature discovery with class activation maps augmentation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 13891398.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Yao Hantao, Zhang Shiliang, Zhang Dongming, Zhang Yongdong, Li Jintao, Wang Yu, and Tian Qi. 2017. Large-scale person re-identification as retrieval. In IEEE International Conference on Multimedia and Expo (ICME’17). IEEE, Hong Kong, China, 14401445.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Yi Dong, Lei Zhen, Liao Shengcai, and Li Stan Z.. 2014. Deep metric learning for person re-identification. In 22nd International Conference on Pattern Recognition. IEEE, Stockholm, Sweden, 3439.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Zhang Han, Goodfellow Ian, Metaxas Dimitris, and Odena Augustus. 2019. Self-attention generative adversarial networks. In International Conference on Machine Learning. PMLR, Long Beach, USA, 73547363.Google ScholarGoogle Scholar
  60. [60] Zhang Hang, Wu Chongruo, Zhang Zhongyue, Zhu Yi, Lin Haibin, Zhang Zhi, Sun Yue, He Tong, Mueller Jonas, Manmatha R., Li Mu, and Smola Alexander. 2020. ResNeSt: Split-attention networks. arXiv:2004.08955Google ScholarGoogle Scholar
  61. [61] Zhang Shun, Huang Jia-Bin, Lim Jongwoo, Gong Yihong, Wang Jinjun, Ahuja Narendra, and Yang Ming-Hsuan. 2020. Tracking persons-of-interest via unsupervised representation adaptation. International Journal of Computer Vision 128, 1 (2020), 96120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Zhao Shizhen, Gao Changxin, Zhang Jun, Cheng Hao, Han Chuchu, Jiang Xinyang, Guo Xiaowei, Zheng Wei-Shi, Sang Nong, and Sun Xing. 2020. Do not disturb me: Person re-identification under the interference of other pedestrians. In European Conference on Computer Vision. Springer, Virtual Glasgow, UK, 647663.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Zheng Liang, Shen Liyue, Tian Lu, Wang Shengjin, Wang Jingdong, and Tian Qi. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Santiago, Chile, 11161124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Zheng Meng, Karanam Srikrishna, Wu Ziyan, and Radke Richard J.. 2019. Re-identification with consistent attentive Siamese networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 57355744.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Zheng Zhedong, Zheng Liang, and Yang Yi. 2017. Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Venice, Italy, 37543762.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Zhong Zhun, Zheng Liang, Cao Donglin, and Li Shaozi. 2017. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, USA, 13181327.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Zhong Zhun, Zheng Liang, Kang Guoliang, Li Shaozi, and Yang Yi. 2020. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, New York, USA, 1300113008.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Zhou Kaiyang, Yang Yongxin, Cavallaro Andrea, and Xiang Tao. 2019. Omni-scale feature learning for person re-identification. arXiv:1905.00953Google ScholarGoogle Scholar
  69. [69] Zhou Sanping, Wang Fei, Huang Zeyi, and Wang Jinjun. 2019. Discriminative feature learning with consistent attention regularization for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, Long Beach, USA, 80408049.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Zhou Sanping, Wang Jinjun, Wang Jiayun, Gong Yihong, and Zheng Nanning. 2017. Point to set similarity based deep feature learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, USA, 37413750.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Harmonious Multi-branch Network for Person Re-identification with Harder Triplet Loss

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 4
          November 2022
          497 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3514185
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 March 2022
          • Accepted: 1 November 2021
          • Received: 1 June 2021
          Published in tomm Volume 18, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!