Learning a Robust Model with Pseudo Boundaries for Noisy Temporal Action Localization

Temporal Action Localization (TAL) aims to locate starting and ending times of actions and recognize categories in untrimmed videos. Significant progress has been made in developing deep models for TAL. The success of previous methods relies on large-scale training data with precise boundary annotations. However, fully accurate annotations are unpractical to be obtained due to the ambiguities of the action boundaries and the crowd-sourcing labeling process, leading to a degradation in performance. In this work, we take the first step into learning with inaccurate boundaries in TAL tasks. Motivated by the fact that inaccurate boundary annotations harm localization precision more than classification accuracy, we propose to use classification as a guidance signal to improve localization precision. Specifically, we introduce a pseudo-boundary generation and refinement method (PbGaR). PbGaR first treats each action segment as a bag of instances to select the instances with more accurate boundaries for training. Then these boundaries are refined via two strategies for higher quality. The proposed method significantly alleviates the degraded performance of TAL models under inaccurate boundaries. Extensive experiments on two popular datasets demonstrate the effectiveness of our method.


INTRODUCTION
Due to its wide applications in surveillance, video retrieval [5] and video anomaly detection [23], the task of Temporal Action Localization (TAL) has drawn much attention in the computer vision communities.Remarkable progress has been made under the fully-supervised setting in recent years.Under this setting, the success of previous methods rely on large-scale video datasets like ActivityNet-1.3 [1] and THUMOS14 [9] with precise boundary annotations.However, fully accurate annotations are unpractical to be obtained in professional fields, thus limiting scalability and practicability in real-world scenarios.In TAL, noisy annotations refer to inaccurate categories and boundaries.However, inaccurate boundaries is more common than that of categories'.As in some domains such as sports competitions, the labeling of the start and the end of an action is strict and difficult.One can recognize most action categories by key frames alone, but needs to browse through all frames to get the accurate boundaries of an action.However, many annotators lack of expertise, leading to inaccurate action boundary annotations.Besides, with the increasing video data, many datasets are annotated by crowd-sourcing or volunteers within limited budgets and resources [30].This undoubtedly results in low-quality annotations and further affects the training process of existing models.Ultimately, this cause performance degradation of models.In view of these phenomena, handling inaccurate annotations especially inaccurate boundaries is a critical and pressing task.
There are two main types of frameworks in temporal action localization, i.e. two-stage method and one-stage method.Two-stage method generate candidate proposals at first, then take strategies to recognize categories among proposals and further refine the predictions.Recently, one-stage method has become the mainstream due to its simplicity and efficiency.Such method classifies and localizes actions simultaneously.Despite these facilitative work in TAL, the quality of supervised learning models depends on the quality of training datasets [15].Inaccurate annotations can directly mislead models to learn or memorize wrong relations and thus limit the abilities of these models and deteriorate the performance.No work before has considered the impact of noisy annotations on TAL model performance.In this work, for the first time, we step forward the temporal action localization with noisy annotations.Considering the delicate requirements for boundary detection on TAL tasks and the reality that boundary labeling is more error-prone, we focus on tackling inaccurate boundary annotations.
Motivated by weakly-supervised TAL [7,27] and object detection [14], we propose a method to improve the degraded performance of TAL models under low-quality boundary annotations.Inspired by the fact that compared with localization, classification precision suffers slightly from inaccurate annotated boundaries, we propose leveraging classification as a guidance signal for localization based on multiple instance learning (MIL) [3].Specifically, every labeled action is treated as a bag of intances (i.e. a bag of action proposals with same action segment).Our target is to select the most accurate instance from each bag to generate pseudo boundaries and then replace the original inaccurate boundary annotations for training.Our method called PbGaR basically consists of two parts, i.e. pseudo-boundary generation module and pseudoboundary refinement module.The former is to generate instances with more accurate pseudo-boundaries based on MIL and the latter aims to further enhance the quality of pseudo-boundaries.The proposed mothod can improve the robustness of existing TAL models when dealing with inaccurate noisy boundary annotations.We construct noisy datasets based on two benchmark THUMOS14 and ActivityNet-1.3 and conduct experiments.Extensive experimental results prove the effectiveness of our proposed method.
Our main contributions can be summarized as follow: • This paper proposes a novel framework for Temporal Action Localization with inaccurate boundary annotations.To our best knowledge, this paper is the first attempt to deal with this setting.
• By carefully generating and refining more accurate pseudo boundaries for training, our proposed method can considerably improve the performance in different degrees of noisy data, thus boost the robustness of existing TAL models.
• Extensive experiments on public benchmarks ActivityNet-1.3 and THUMOS14 show that our proposed method achieves remarkable improvement.

RELATED WORK 2.1 Fully-Supervised TAL
Fully-Supervised Temporal Action Localization is a process where temporal boundaries and categories of action instances are available for training.There are mainly two kinds of frameworks in fully-supervised TAL, i.e. one-stage method and two-stage method.Two-stage method generates candidate proposals at first, then take strategies to recognize categories among proposals and further refine the predictions.One-stage method has become the mainstream recently for its simplicity and efficiency.ActionFormer [31], as one of the representative methods , without using proposals, classified every moment into action categories and simultaneously regressing their corresponding boundaries.Additionally, it introduced a Transformer-based [26] network to extract multiscale features, which significantly boosted its performance.TriDet [19] improved on the structure of Transformer and proposed to model relative probability distribution of boundaries, thus going a step further in localization accuracy.The above approaches assume that all of the training data in untrimmed videos are accurate and clean, which impedes their application to real scenario.Our paper takes a rigorous perspective, focuses on how TAL models do their best when facing with inaccurate annotations, especially inaccurate boundaries.

Weakly-Supervised TAL
Weakly-Supervised Temporal Action Localization is a more resourceefficient setting that has become popular recently.In training process, only video-level classification labels are available.Untrimmed-Net [27] firstly introduced Multiple Instance Learning (MIL) [3] to this task.MIL assumes that all instances (i.e.frames in an untrimmed video) belong to a bag that is either positive or negative.In other words, considering a video as a bag consists of frames, MIL-based method would assign the video-level labels on a set of instances (frames).Subsequently, many derivative work [7,10,11,17] followed the MIL-based framework and advanced the development of weakly-supervised TAL.The newest method [17] replaced segmentbased MIL framework with proposal-based one to tackle the inconsistent objectives between training and testing stages.Notably, our work differs from weakly-supervised TAL in that we focus on settling TAL models with frame-level annotations rather than being provided with only video-level classification annotations.Although we also formulate TAL as a MIL problem, we regard each action in the video as the concept of bag instead.Besides, our bag can be constructed in a dynamic way to better correct noisy boundaries.

Learning with noisy data
Work in image domain, especially image classification and object detection, is closely related to video-understanding tasks.There has been a series of studies [8,12,15,28,30,33] on noisy data in image tasks.Some methods [18,21] designed re-weighting strategies to adaptively assign different weights to noisy samples and clean samples.Another major line to minimize the impact of corrupted labels is loss correction.Common methods along this line use a confusion matrix [25], design extra inference steps to correct corrupted labels [6,16,22] or replace hard labels with soft labels for unclear boundaries [4,8,29].In addition to the two main directions of resolution mentioned above, Liu et al. [14] proposed to correct the inaccurate annotations to facilitate the object detectors in a MIL-based framework.Most existing image tasks focus on noisy classification label, but for video domain, due to the complexity and diversity of action instances, misclassification is considerably less common than localization [15], thus making annotators prone to inaccurate boundaries.In this paper, we are the first to step further towards TAL with inaccurate boundary problem.The uniqueness of localization in untrimmed videos makes this task more challenging and valuable.

METHOD
An untrimmed video  can be represented by a set of features X = { 1 ,  2 , . . .,   }, where T denotes the number of instances.Fully-Supervised TAL consists of two sub-tasks, i.e. classification and The former utilizes classification as a guide to generate instances with pseudo-boundary that are more accurate than noisy groundtruth ones based on Multiple Instance Learning [3].The latter is to further improve the quality of pseudo-boundaries by refining and extending action instances.

MIL-based Pseudo-boundary Generation
Preliminaries.A typical MIL-based method [17,27] in weaklysupervised TAL treats each video as a bag of instances (frames) and performs feature extraction on it.Then extracted features are used to calculate confidence score for determining whether frames belong to action or background.Formally, for an untrimmed video containing multiple action categories, video-level action labels denoted as  ∈ {0, 1}  are given.In order to correspond action categories to specific moments, each video is represented as a bag of instances.MIL use classification loss as the signal to choose suitable instances.
Problem Formulation.In our method, unlike the MIL in weaklysupervised TAL, we treat each segment (action or background) in the video as a bag rather than the entire video.A bag is labeled negative only if all instances in it are negative.Put differently, once ( The second step is to generate the final    by considering the initial action annotation  0  as a complementary.The final new action annotation    is generated as follows: Here  is a mapping function that adaptively assigns weights to  *  and  0  .Considering our goal is to generate as high quality pseudo boundaries as possible,  (•) needs to satisfy two conditions.Firstly, as it indicates the confidence of instances, when (  ,  ) outputs large value, higher weight should correspondingly be assigned to  *  .Secondly, when (  ,  ) is very close to 1,  (•) needs to balance the weight between  *  and  0  rather than sharply favoring  *  .Thus, a bounded exponential function is adopted to fulfill the two conditions above: We adopt a standard hinge loss to train the MIL-based pseudoboundary generation module.The loss function is defined as: ∈ 1, −1 is a label attached to each bag   to indicate whether this bag has any positive instance or not.

Multi-step Pseudo-boundary Refinement
The action instances generated by pseudo-boundary generation module play an important role of new annotations for training detectors.However, they are roughly generated based on the original inaccurate ground-truth annotations and the predictions from the detector.Thus, the quality cannot be guaranteed.Objectively, instances in the same bag have similar properties, i.e. their classification feature and temporal localization are closely related to each other.Besides, our new pseudo boundary is a tradeoff between the initial annotation and the instance selected in the bag by Eq.2.Accordingly, in this section we propose a multi-step refinement module that progressively enhance the quality of the pseudo-boundaries via two strategies.
As we construct the initial bag based on the original ground truth and generate pseudo boundaries, a natural idea arises is that continuing construct new positive bags with these generated pseudo boundaries and repeating the construction until reaching termination condition.This inspires our first strategy, i.e. bag reconstruction strategy.To achieve more efficient reconstruction, we improve the quality of the candidate instances in bags.As illustrated in Fig- ure2, for the  ℎ instance    = (  ,   ,   ) in bag   , features are sampled at the interval {  ,   } via interpolation and aggregated by a fully-connected layer.Boundaries of these instances are then refined and calibrated based on the features.Then we perform bag reconstruction.As in Figure4, new bags are iteratively constructed.After  times iterations of construction, for bag   there will be a construction sequence { 0  ,  1  , . . .,    }.Note that negative bags are not involved in this strategy.Consequently,    is used to optimize the generation module and the loss in Eq.4 is further expressed as: where  ∈ {0, 1, . . .,  } only if bag   is positive.
The second strategy called memory bank is to improve the pseudo boundary quality in the generation process described in Eq.( 2).After bag reconstruction, pseudo-boundary generation module generates annotations {  .Therefore, Eq.2 evolves into:

PbGaR Training
Training.Our method focuses on providing better performance for TAL detectors in inaccurate training data.It is not limited to specific TAL detectors.In the training stage, as a bootstrap of bag construction, we first train the base detectors (e.g.ActionFormer [31]) for  epochs.Detectors output the probability of action categories and boundary proposals.Based on the output, instances are obtained to construct our initial bags.We adopt an IoU threshold to distinguish which instances are positive.After that, most positive instances are selected to generate pseudo boundaries via Eq.6.Then we apply the pseudo-boundary refinement module to obtain the refined candidate instances with a more accurate estimation of the action location.The same process can be performed with memory bank for multiple steps until the quality of instances is converged, i.e. bag reconstruction.The total training loss of our method is: Clean Model ActionFormer [31] 70.9 43.9 66.8 70.9 43.9 66.8 70.9 43.9 66.8 70.9 43.9 66.8 TriDet [19] 72.7 46.5 68.5 72.7 46.5 68.5 72.7 46.5 68.5 72.7 46.5 68.5 Specially, the loss function has three terms. cls is for instance classification, we adopt focal loss [13] to train it. reg is a DIOU loss [32] for boundary regression.1(  ) is an indicator function that denotes whether a bag   is positive or not. g is for training the MIL based pseudo-boundary generation process, which is given in Eq.5. reg and  g are both balance coefficients.

EXPERIMENTS 4.1 Settings
Datasets.Since modern temporal action localization datasets are delicately annotated and contain few inaccurate boundary annotations.To evaluate the performance of our proposed PbGaR method, we simulate noisy boundaries by perturbing the clean ones on two on two common used datasets, THUMOS14 [9] and ActivityNet1.3[1].THUMOS14 is comprised of 412 videos with 200 for training and 212 for validation, including 20 action categories.ActivityNet 1.3 contains 20,000 videos covering 200 action categories.It is divided into three subsets, 50% is training set, 25% is validation set and the rest is test set.
Following [14], we simulate noisy action boundaries by perturbing clean ones.Specially, let (, ) denote the center  and duration of an action.We randomly shift and scale an action boundary as follows: where Δ  and Δ  follow the uniform distribution  (−,  ),  refers to the boundary noisy level.We simulate boundary noise levels varying from 10% to 40% and perform Eq.8 on every action boundary in the training data.
Implementation Details.Our method PbGaR is implemented on ActionFormer [31] and TriDet [19] which are two latest state-of-theart TAL models.Pre-trained I3D [2] is used as backbone.PbGaR is applied after 5 training epochs.We empirically set  to 0.5 and  to 0.75 in Eq.3.The number of bag reconstruction  in refinement module is set to 2. The loss weight  reg and  g are selected from {0.01, 0.1, 1} depending on datasets and nosiy level.The memory bank is activated after 11 epochs and  is set to 0.2 in Eq.6.The rest settings are kept unchanged.
Evaluation Metrics.We evaluate our method using the standard TAL metric, i.e. the mean Average Precision(mAP) at differnet temporal intersection over union (tIoU) thresholds for all datastes.Mean average precision (mAP) measures the average precision across all action categories for a given temporal intersection over union (tIoU) threshold.We also report average mAP over several tIoU thresholds.

Main Results
We compare our method with several state-of-the-art approaches [19,20,24,31] on THUMOS14 [9] and ActivityNet1.3[1].We denote Clean-Model and Noisy-Model as models trained under clean and noisy training data with the default setting.Our intention here is to validate that our method is robust to noisy data and significantly mitigates the performance degradation of TAL models encountered with inaccurate training data.
Results on the THUMOS14 dataset.Table1 shows the comparison results on the THUMOS14 test set.For the existing representative models [19,20,24,31] listed, we observe that inaccurate boundary annotations significantly deteriorate the detection performance of the vanilla model.Our approach, in contrast, demonstrates greater Results on the ActivityNet1.3dataset.The comparison results on ActivityNet1.3dataset are reported in Table2.Our approach achieves considerable improvements over the vanilla model.For example, under 40% noise, the ActionFormer suffers from obvious performance drop, e.g., drops from 36.6% to 32.11% under 40% noise.With our PbGaR, it achieve a 2.02% improvement in performance.Even in the case of low noise level, our method is still effective.For example, it enhances the accuracy the ActionFormer at tIoU=0.95 under 20% noise.Our method also assists TriDet in alleviating a 0.18% performance decline.However, a contradiction arises at different noise levels.At 20% level, our method maintains a stable performance at tIoU=0.95, while drops at 40% level.We attribute it to our method makes a tradeoff between mAP at specific tIoU and average mAP.It focuses more on those segments where the boundary annotation is absolutely wrong.For instance, the improvement is obvious at tIoU=0.5 and tIoU=0.75under 40%.

Ablation Study
We conduct ablation experiments on THUMOS14 dataset to validate the effectiveness of two modules in our method and we also analyse parameter sensitivity and the two strategies of PBR in this part.
Analysis on main components .To investigate the effectiveness of the two components of PbGaR, we start from the vanilla Action-Former on noisy data and then gradually add the two modules of our method on it.The results are shown in Table 3.The first row is the vanilla ActionFormer trained under different boundary noise levels.As we gradually add PBG and PBR module into training, it is evident that both modules boost the performance under different noisy levels of training data.For the PBG module, training under our MIL formulation improve the mAP performance of Action-Former across various boundary noise levels.For instance, the PBG module achieves 2.54% and 3.93% improvements under 30% and 40% box noise level.The second module PBR further enhances the quality of pseudo boundaries, especially under high noise levels.We observe that the impact of PBR is minor under low boundary noise levels.This is likely attributed to the relatively high quality of the action instances in bags when the noise level is low.The results demonstrate that both modules contribute greatly to our method.Ablation on the starting epoch of PBG module.The starting epoch of our first module determines when to generate pseudo boundaries and thus affects the quality of pseudo boundaries.We train 35 epochs (containing warmup 5 epochs) ActionFormer with PBG on THUMOS14 dataset under 30% noise.We present results for the choice of the starting epoch  of PBG in Tab4.We observe our PBG module can produce stable improvement and the optimal value is obtained at 6.
Analysis on two strategies of PBR.We validate the effectiveness of the two strategies in pseudo-boundary refinement module: bag reconstruction (BR) and memory bank (MB).To verify the effectiveness of these two strategies, we add the PBG module and use only one refinement strategy from PBR to ActionFormer.Experiments are conducted on THUMOS14 under 30% noise level.As shown in Table5, the first row is the result that we add our first PBG module to ActionFormer.The remain three rows demonstrate that either bag reconstruction or memory bank can benefit the performance of TAL models trained under noisy data.This demonstrates that both strategies improve the quality of the generated pseudo boundaries from first module and the combination of them is a preferred option that can better utilize the capabilities of the TAL model.

CONCLUSION
In this paper, we focus on learning with inaccurate boundaries in Temporal Action Localization task.By using classification as a signal, we propose a PbGaR method to deal with the performance degradation of TAL models under noisy boundary annotations.The PbGaR firstly generates more accurate pseudo boundaries for training models and then improve the quality of pseudo boundaries via our refnement module.Extensive experiments on two benchmarks demonstrate that PbGaR effectively cooperate with modern TAL detectors and obtain promising performance with inaccurate action boundary annotations.

Figure 1 :
Figure 1: Inaccurate boundary annotation illustration.TAL methods suffer from inaccurate annotations.Our method PbGaR generates a more precise boundary for training.

Figure 2 :
Figure 2: The overall framework of the proposed PbGaR.Features of the untrimmed video are fed into detectors.Based on the output by detector, our pseudo-boundary generation module treats every inaccurate GT (orange blocks) as a bag of instances (green lines).We select the most positive instance (pink lines) to generate the pseudo boundary.A refinement module consisted of two strategies (i.e.bag reconstruction and memory bank) is applied to further enhance the quality of pseudo boundaries (green blocks).Pseudo boundaries will be used for training detectors.

Figure 3 :
Figure 3: Pseudo-boundary generation module.As the action detectors output class confidence scores and boundary proposals, our first module utilize the output to generate a more accurate pseudo boundary based on MIL. a instance is positive, the bag would be labeled positive.Formally, let   denotes the  ℎ bag in the video  , the  ℎ instance in bag   is denoted as    .We treat annotated action segments as positive bags  +  , instances in it refer to action proposals.Background segments are treated as negative bags  −  .The video is formulated as  = { + 0 ,  + 1 , . . .,  +  ,  − 0 ,  − 1 , . . .,  −  }.Our goal is to select the most positive instance  *  = {( *  ,  *  ,   )} in the positive bag   with unchanged classification label   but more precise boundaries { *  ,  *  }, i.e. generating a pseudo boundary.We then tab it as a new action annotation for model training.The process of generating  * where  and  are hyper-parameters and  ∈ [0, 1].

Figure 4 :
Figure 4: Bag Reconstruction.New positive bags with generated pseudo ground truth are constructed until reaching the termination condition.After  times iterations of construction, for bag   there will be a construction sequence { 0  ,  1  , . . .,    }.Note that negative bags are not involved in this strategy.Consequently,    is used to optimize the generation module and the loss in Eq.4 is further expressed as: g ({   ,  }) = more accurate boundaries for each bag.These annotations are stored in the memory bank and further used with  0  in Eq.2 to provide better localization prior in the next training epoch.Let   denote the  ℎ bag,   ( −1)  represents annotations generated in the  − 1 ℎ epoch.In the  ℎ epoch, we perform a weighted average of   ( −1)  and  0

Table 1 :
Comparison with state-of-art methods on THUMOS14 test set under four boundary noise levels.The average mAPs are computed under the IoU thresholds [0.3:0.1:0.7].Best results are in bold.

Table 2 :
Comparison with state-of-art methods on ActivityNet1.3test set under two boundary noise levels.The average mAPs are computed under the IoU thresholds [0.5:0.05:0.95].Best results are in bold.

Table 3 :
Analysis of the effectiveness of two main components.Experiments are conducted on THUMOS14 dataset.

Table 4 :
Ablation on the starting epoch  of PBG.Experiments are conducted on THUMOS14 dataset under 30% noise.

Table 5 :
Analysis on two strategies of PBR.Experiments are conducted on THUMOS14 dataset under 30% noise.