Joint Semantic Graph and Visual Image Retrieval Guided Video Copy Detection

Recent years, video copy right infringement including secondary creation and video editing, have emerged one after another. The infringement examples of pirated videos are not limited to simple piracy or watermarking and other easily identifiable ways. The existing traditional methods mainly distinguish whether the video is copied by comparing the similarity of the video in the database. However, for video editing with large spatial and temporal differences, the cost of pixel-level comparison is huge, and the robustness of copy detection is also difficult to guarantee. Even the latest network learning methods still need to rely on enough training data and the generalization ability is poor. Aiming at the above issues, we propose a novel “coarse to fine” video copy accurate detection framework. In order to improve the detection efficiency of massive video frames, we first construct semantic graph for all key frames between source and target videos in the coarse detection stage. The pixel matching problem is transformed into graph structure matching to roughly divide the instances that may be copied, which significantly reduces the search range of subsequent video frames and improves the search speed. In the fine detection stage, we innovatively applied the cross-self-attention to video frame copy detection. By strengthening the contextual links between pixels in the keyframe space under the constraints of semantic mapping with the method of self-attention, global features are constructed by obtaining features common between the source video and the copy video through cross-attention to achieve the similarity measure between video frames in the high-dimensional space. Finally, based on the obtained global features, we perform an accurate and fast comparison through the hash segmentation method to achieve fast and high-precision video copy fine detection. Experimental results show that, compared with the existing state-of-arts our method has extremely high detection robustness against all mainstream video tampering results, and has the highest detection efficiency even in massive videos.

existing traditional methods mainly distinguish whether the video is copied by comparing the similarity of the video in the database.However, for video editing with large spatial and temporal differences, the cost of pixel-level comparison is huge, and the robustness of copy detection is also difficult to guarantee.Even the latest network learning methods still need to rely on enough training data and the generalization ability is poor.Aiming at the above issues, we propose a novel "coarse to fine" video copy accurate detection framework.In order to improve the detection efficiency of massive video frames, we first construct semantic graph for all key frames between source and target videos in the coarse detection stage.The pixel matching problem is transformed into graph structure matching to roughly divide the instances that may be copied, which significantly reduces the search range of subsequent video frames and improves the search speed.In the fine detection stage, we innovatively applied the cross-self-attention to video frame copy detection.By strengthening the contextual links between pixels in the keyframe space under the constraints of semantic mapping with the method of self-attention, global features are constructed by obtaining features common between the source video and the copy

INTRODUCTION
With the popularity and explosive growth of Internet video applications, the cost of shooting, editing and uploading videos has been gradually reduced.However, the cost of copying and distributing video content on the Internet is inexpensive, which leads to the video copyright infringement on video platforms and social media frequently.Therefore, it is urgent to pay attention to and protect video copyright.To solve this problem, researchers proposed video-level Copy Detection (VCD) and Partial Video Copy Detection (PVCD) methods.Both of them aim to find suspected copies of videos in a large-scale video database and detect the similarity of the two videos.The similarity information can effectively support the discrimination of infringement cases.Some of the copied videos have been carefully edited and significantly transformed, there are very large time and space differences with the original video.If the pixel-level comparison is carried out frame by frame, the time cost will become inestimable, which makes the video copy detection a very challenging task.So, how to obtain accurate detection results while significantly reducing the time cost has become a hot topic.The existing methods of video content copy detection can be divided into two categories: video level and image level detection.In the method of video hierarchical detection, the most basic method to detect whether two videos are exactly the same is to use MD5 information digest algorithm [1] ,which are calculated by comparing the MD5 values of two videos.But this method cannot detect tampered videos, even with the latest technique [2], However, detection methods [3] [4] [5] based on video title, label, introduction and other content need a lot of manual assistance to summarize the video.Although there are corresponding machine learning methods for classification in the future, they need a lot of manual assistance to complete.Moreover, due to differences in understanding, everyone may have different labels for videos, which makes it difficult to achieve objective and consistent detection of video copies.
At present, the mainstream research on video-level copy detection combines a variety of video features to infer the existence of video copy behavior, such as Franco et al. [6] utilized a hybrid network composed of convolutional neural network and autoencoder to extract the features of illumination variation, image deformation, horizontal mirroring and image blur invariance, and obtained good experimental results on multiple datasets.However, due to the consideration of the above factors, the computation time is unacceptable.In addition, a large amount of time is needed to process samples and train models in the early stage.Due to the complex network structure, network parameters are numerous and the network generalization ability is weak.And in the image hierarchy detection method, Revaud et al. [7]proposed to use dense convolution network to extract the deep features of key frames, and then use the feature fusion algorithm of classification correlation analysis to analyze the extracted parameter features, to improve the detection accuracy.However, due to the method does not use the corresponding video dataset for training, there is some missing detection.Therefore, most of the current image-based video similarity comparison methods are hash encoding image features to obtain image or video fingerprints, and then measure the similarity of these fingerprints.Jia et al. [8]proposed an optical flow-based video replication mobile forgery detection algorithm for image frames, but the computational complexity is too large and the efficiency is too low.The video image features methods mainly compares the shape, color, texture and other visual features of video image frames.If the similar video changes the color and shape of the original video, the frame features of the image will be greatly different, leading to the error.STHyeal et al. [9] proposed a method based on SPH (Spectral Hashing).After obtaining the hash sequence, they used Hamming distance to calculate the similarity between two videos.In the above method based on hash value, only the relationship between each single feature is used, while the relationship between multiple features is not fully utilized.Therefore, the relationship between different feature representations is not mined, which makes the final retrieval effect not reach the optimal.
To solve the above problems, we propose a method that takes into account both retrieval efficiency and similarity measurement accuracy, and our contributions can be summarized as below: 1.An efficient coarse to fine video copy detection framework is proposed and multiple data sets are extended.Most video copy detection relay on time-consuming strategies of pixel to pixel redundant matching.Meanwhile, the global and semantic relationships between pixels are often ignored by models and resulting in errors.To solve above issues, a hierarchical detection framework from coarse to fine is proposed to effectively improves the performance of video copy detection.In coarse stage, instancelevel semantic graphs of each key frames is generated to transform the pixel matching problem into graph similarity searching, which can fast eliminate dissimilar frames and shorten the search area.Then, the global relationship between pixels is extracted by the selfattention module with hash block to achieve high efficiency and detection accuracy.It is worth noting that we propose a framework, which means that the semantic segmentation, feature extraction and other modules in the framework are not limited to the content proposed in this paper, and can integrate and combine new methods to further improve the performance of video copy detection.On the other hand, some public datasets(CC_WEB_VIDEO, FIVR-200k) are also extended by using multiple data augmentation for better training and evaluation.Experimental results show that the proposed framework has higher recall rate while using less computing resources, and has better practical application value.
2. A semantic graph guided coarse detection method for video copy is proposed.Most of the existing methods do not applied the pixel semantic information to the video copy detection task.In addition, the existing methods can only describe the simple spatial geometric relationship between pixels, and it is difficult to accurately describe the complex video frame content.What's worse, this frame-by-frame, pixel-by-pixel matching seriously degrades the efficiency of detection.To solve this issue, we first segmented the key frames at the instance level, and generate the semantic relation graph for each key frames of video pairs.We abandon the pixellevel search and transform the inter-frame similarity problems into graph searching and graph structure similarity comparison.Due to the linear time complexity of searching, this method can efficiently detect the rough copy of massive video content, thereby further narrowing the search range of video frames for subsequent fine detection algorithm, which not only improves the overall operation efficiency but also reduces the probability of false detection and missing detection.
3.A semantic constrained global context fine video copy matching method is proposed.Most of the existing methods unable to accurately detect frames with similar content (similar geometric structure or semantic) but different ambient lighting, seasonal changes and different views, resulting in more missed detection.Inspired by the image retrieval problem, we apply encoderdecoder model to extracted deep global features from video key frames in the key frame space under the constraint of semantic graph.A semantic constrained cross self attention is designed to mining contextual relationships between pixels and realize the similarity measurement between video frames in high-dimensional space, which pay more attention to the similarity of video content (scene).Compared with the general convolutional network method, ours can retain more features details, thus has a higher recall rate, and realizes efficient and high-precision video copy detection.
In addition, we have extensively experimented and validated the effectiveness of Joint Semantic Graph and Visual Image Retrieval Guided Video Copy Detection method on VCDB [10], CC_WEB_VIDEO [11], FIVR-200k [13] datasets.Compared with existing advanced methods such as SURF [13], RBSIF-ICD [14], our method achieves the best performance in terms of robustness of copy video detection and efficiency of large-scale video data retrieval.

RELATED WORK
In this section, we will first introduce the spatial domain-based video copy detection method, which is divided into local featurebased and global feature-based methods based on the extraction of hash codes and feature vectors for key frames, and subsequently, introduce some datasets on copy video detection.

Methods Based on Local Features
Local feature-based methods are used to form a hash value by extracting local features, such as the extraction of interest points for frame images.Local features have a finer resolution of images, but it is difficult to perceive the global changes of frame images and less sensitive to global information.Neelima et al. [15] proposed a scale-invariant feature transform (SIFT)-based local feature descriptor, SIFT descriptor is a descriptor that does not change after the video frame sequence is scaled, panned, and rotated.Using the characteristics of SIFT to extract the invariant key points, the descriptors of each key frame selected from the video frame sequence, and the clustering of these feature values, based on the clustering center to form several size of clustering blocks, use singular value decomposition (SVD) for each block.However, the SIFT descriptor only partially maintains its invariant property when the video undergoes an affine transformation or illumination change, which is a critical weakness.
A similar solution was proposed by Massoud et al. [16] and Maani et al. [17] Massoud used the difference of Gaussians (DoG) method, where the Gaussian kernel is used as a candidate kernel for the scale space.By tracking the key points, it is possible to find the pixel position of the video frame sequence even after geometric rotation and scale change.In contrast, Maain et al.'s scheme uses an Improved Harris detector [18]based on a generalized random transform, which is insensitive to rotation and scaling changes.For each interest of a video frame, a local descriptor of the region is computed, and the similarity is given by distance calculation, incorporating color intensity with geometric information in the frame.However its difficult to respond adequately to rotations as well as global changes.
Li et al [19].proposed a method to extract interest points from regions of interest (roi), called FREAK (Fast Retina Keypoint) ,which is robust to noise, rotation and scaling of video frame sequences.And Zhang et al. [14] extracted local features from frame image sequences by introducing the method based on speeded up robust features (SURF) and determined the repetition direction of interest points.The SURF method was also used by Ozbulak et al. [20] but the method was still unable to handle tampering with light.

Methods Based on Global Features
The global feature-based approach was initially proposed by Lee et al. [21] with the centroid of gradient orientations (CGO) descriptor, which obtains a  ×  -dimensional feature vector by dividing each video frame into a grid of  ×  blocks and calculating its CGO value for each block.However, the global features are not sensitive to some common video tampering methods, such as displacement, rotation, cropping and other general geometric transformations, and it remain to be improved.
Most existing global feature-based methods, such as [22], are to first chunk the frame image and extract the visual features of the video frame by calculating its grayscale, luminance prime, etc.These methods are robust to common video transformations such as frame noise, filtering, recompression, etc., and not very robust to displacement, cropping, geometric distortion, icon insertion, etc.Even though Uchida et al. [23] improved the luminance centroidbased scheme by proposing based on the quadrant of the luminance centroid method, the pairwise independence between video clips was enhanced by optimizing the selection of key frames, and stable image features were compared using an adaptive mask.However, it is still difficult to achieve good robustness for stronger local variations, such as local cropping.
To improve the robustness to rotation and flip changes, Himer et al. [24] obtained global descriptors by combining weighted parameters to binarized statistical image features (BSIF) local texture descriptor and local color descriptor.BSIF calculates the BSIF histogram from all rings of each BSIF frame image, and the color histogram is calculated frame by frame based on the RGB value corresponding to each pixel.So Himer et al. gave another improved scheme RBSIF-ICD, based on the BSIF method introducing the invariant color descriptor (ICD), trying to construct a global description that is robust to geometric attacks such as rotations and flips, but experiments show that there is some room for improvement and the method does not achieve good copy video detection.
In addition, in order to enhance the robustness of copy video detection better, some methods [25] [26] such as hu et al. proposed methods based on the joint use of CNN and RNN recurrent neural networks, which both employ local features with global features, and although they have high robustness, the detection takes a long time.Some recent work has adopted attention mechanisms [27] [28], but they have not combined self-attention with cross-attention, and the accuracy of detection has not been greatly improved.

Related Datasets
Due to the explosion in the field of video copy detection, many excellent datasets have been proposed, and many of them have been applied to the study of video copy detection.We will introduce some representative datasets .

METHOD
The existing video tampering methods are mainly the following, global noise, light (gamma) change, flip, displacement rotation, tailor, affine transformation, partial occlusion etc.In Section 2, we can see that the similar video detection method based on local features is insensitive to changes in global noise, illumination, flip, etc.The global feature-based approach, on the other hand, is insensitive to changes in displacement, cropping, geometric deformation, and local occlusion.To solve the above problems.We propose the Joint Semantic Graph and Visual Image Retrieval Guided Video Copy Detection method, It mainly contains the following three contents, an efficient video copy detection framework from coarse to fine, an efficient coarse detection method for video copy based on semantic mapping guidance, and a fine detection method for video copy based on semantic constraints and global deep features, and we will introduce their specific contents in turn next.

Efficient Coarse to Fine Video Copy Detection Framework
Most of the existing video copy detection methods take a frame-byframe scan of video key frame sequences and extract and analyze features, and subsequently compare these features with those in the database on a frame-by-frame basis.This method will compare and calculate videos that are almost impossibly similar on a frameby-frame basis, generating a large number of redundant operations and great time costs.As for video frame feature extraction, the traditional method focuses on the extraction of local features or global features of video frames, which will always have defense loopholes for specific video tampering means.But if both local and global features of video are extracted, although the robustness of video copy detection is enhanced to a certain extent, it also further enhances the computational cost, and the practicality for super large-scale video databases is lower.
To solve these problems, we propose a coarse-to-fine video copy detection framework.This framework effectively improves the performance of video copy detection with a coarse-to-fine hierarchical retrieval strategy.In this stage, we use the semantic information of video frames, which is ignored by traditional video copy detection methods, to build a semantic map of frame images using instance-level semantic segmentation, perform information clustering, and then use the efficient graph search feature to do fast coarse screening of the content in the video library to filter out the instances that may be copied, which significantly reduces the search range of video frames.And in the fine detection stage of the copied video, the global context deep feature extraction based on the attention mechanism is performed on the key frames of the video, and the hash chunking comparison is performed on the self-attention module in the constraint space obtained from the previous stage to derive several videos with the highest similarity rate for manual review.We will describe the specific steps of the two stages in detail below.
Before performing copy video coarse detection, scene segmentation of the video and key frame extraction are needed first.Based on the difference of video frame image histogram for edge detection, the key frame is calculated after segmenting the footage.The coarse detection stage of copy video is divided into two main steps.After getting the key frames, the semantic segmentation of each key frame is performed at the instance level to construct a semantic map of the key frame images and perform clustering to derive a preliminary search space.Finally, the obtained semantic map information is compared with the data in the library to derive a search space based on the semantic map constraints, and the videos in the search space are labeled with the similarity obtained during coarse detection.
For the video copy fine detection stage, which is divided into two steps.Firstly, the global deep feature extraction is performed on key frames based on attention mechanism to obtain the contextual global relationship between pixels.And then ,the deep features of the videos in the search space constrained by the semantic map are quickly compared using the hash chunking method, and the several videos with the highest similarity are finally obtained by combining the comparison results during the coarse comparison.
By introducing instance-level semantic segmentation and knowledge graph technology, our framework is very sensitive to detecting video tampering such as panning, flipping, affine transformation, partial occlusion, and geometric changes, while global context feature extraction for key frame comparison is very sensitive to global changes such as global noise, video re-coding, and lighting value modification, making the framework as a whole highly robust and accurate to detect most video tampering.And the modules in the framework are highly interchangeable, such as key frame detection module, instance-level semantic segmentation module, etc.The performance improvement of individual modules will directly show the overall performance improvement.The structure of the efficient video copy accuracy detection framework from coarse to fine is shown in Figure 1.

Semantic Graph Guided Coarse Detection Method for Video Copy
Most existing methods extract and use only local or global features of the image, ignoring the rich semantic information of the image, which can highly summarize the picture information, including the spatial and geometric relationships of the objects, and can greatly improve the robustness to changes in the global information of the picture.Even simple semantic information can be of great help for video classification.More unfortunately, the efficiency of existing methods for frame-by-frame matching detection of videos is too low, while the efficient coarse detection method for video copies guided by semantic mapping can significantly reduce the range of videos that need to be precisely compared.Our method can effectively reduce a large number of redundant computations and is sensitive to video local occlusion, affine transformations such as translation flip and zoom, geometric transformations, video cropping, and other tampering means, and can ignore the effects of global changes such as illumination and noise.
The MMDetection [33]algorithm library is derived from the code developed by the MMDet team, the winner of the COCO 2018 target detection competition, which has been improved and enhanced by SenseTime Technology to have the top speed and performance in Object detection.
The instance labeling information of each key frame is obtained by pixel-level instance segmentation, which contains the instance name, instance center pixel position, and instance like bounding box size information, and the fully connected graph is constructed by this, and the fully connected graph nodes store the instance name and bounding box size information, we calculate the relative distance of each instance center position and derive the arithmetic mean as the unit distance, and thus derive the relative distance of each node as the fully connected graph edge weights.The semantic graph schematic is shown in Figure 3. using relative pixel distances can be extremely robust to affine transformations.Next, we perform clustering statistics on the fully connected graph nodes to obtain the largest few clustered cluster labels as video labels, and screen the videos in the database whose label matches reach the threshold as the subsequent copy detection objects.the size of the candidate video set after initial screening has been less than one thousandth of the original database.We pass the obtained fully connected graph through the Prim algorithm [34]to obtain the minimum spanning tree, add the minimum spanning tree nodes in turn to a vector for matching search, and use the weight sum of the minimum spanning tree as the head node of the search vector.First we find the node information of the search vector, whether it is a subset of a frame sequence in the candidate video set, if it is a true subset, directly add the matching video in the candidate video to the subsequent search space and attach the similarity to the video, which is very helpful to improve the robustness of the video to cropping.Finally, the videos whose instance information matches with the weight information up to the threshold value are also added to the search space of video copy fine detection.Since semantic recognition can only detect trained targets, untrained targets will be recognized as some target in the training database, so that we get the wrong semantic labels.But this has less impact on the copy video detection, the same class of untrained targets will be recognized as the same class of targets with high probability due to similar features, which does not affect the calculation for the video similarity.It also does not increase the probability of missed detection of copied videos.

Semantic Constraint and Global Image Retrieval for Video Copy Fine Matching
After refining the scope of the search video, we need to precisely compare the suspected copy video with the original video in the search space to derive the degree of similarity between the original video and the suspected copy video.In image retrieval, the recognition of different environments for the same scene is a difficult problem.The same scene in different seasons and under different illumination often has similar semantic information while differing greatly in terms of pixel information, which can easily cause missed detection of video frames.In order to improve the robustness of video copy fine detection, we are inspired by the image retrieval problem and applied the encoder-decoder model to extract deep global features from video key frames.In order to obtain dense global context information and thus establish the dependency relationship between pixels two by two, we design a cross-self-attentive mechanism based on semantic constraints to mine the global contextual relationship between pixels.In order to obtain dense global context information and thus establish the dependency relationship between pixels two by two, we design a cross-self-attentive mechanism based on semantic constraints to mine the global contextual relationship between pixels.The specific network structure is shown in Fig 4 .Firstly, the RGB values of image pixels are input into encoder with their corresponding semantic information, and the output vector is then passed through the self-attention mechanism to generate the corresponding Code, and finally the two are fused by cross-self-attention to generate crossself-attention feature maps based on cross-self-attention feature maps.This can effectively solve the problems of scene transformation, lighting transformation, etc., which are difficult to recognize in semantic recognition.Thus, we can achieve similarity measurement between video frames in high-dimensional space and pay more attention to the similarity of video contents (scenes).
When comparing a video with m keyframes to a video with n keyframes, it is difficult to retrieve the video efficiently because it needs to traverse the comparison  × times.To solve this problem, we propose a hash-based feature cut block video similarity fast detection method.We divide the set of  ×  individual elements to be compared between two videos into n subsets, and define the frames whose distance between two feature vector blocks Hamming is within a threshold  as similar frames.We cut the features of a video into k equal blocks, among which the blocks that are not exactly equal must not be similar, thus excluding some video frames that are not likely to be similar.We take out the  −  blocks of the cut blocks at random and construct a reverse sort index for the keys composed of them.The features of the video in the database are similarly sliced and randomly removed.The features of the video in the database are also sliced and randomly taken out of the  −  blocks, and the values formed by the combination of these blocks are found in the inverted index: if they exist, it means that the complete features may be similar to the features containing them, and then the two complete features are compared, and the Hamming distance less than  is the similar frame; if they do not exist, it means that the two complete features must not be similar, and then the values formed by the next feature block are compared.The time complexity of this method is: We can see that when D takes a fixed value, there exists a constant K, so that there is a minimal value of the function, when the time spent is minimal.In the following we will give the comparison algorithm based on hash chunking: Through the above method, we easily get a set of video similarity sequences, arrange them from high to low.These final output videos are very robust to various video tampering methods due to the detection by our method, which provides great convenience for video auditing and minimizes the possibility of missing detection.

EXPERIMENTS
In order to verify the effectiveness of Joint Semantic Graph and Visual Image Retrieval Guided Video Copy Detection method , we conducted several experiments and tests on VCDB, CC_WEB_VIDEO,FIVR-200K.The experimental environment is based on windows 10 operating system, Intel Core I7 and NVIDIA GTX 1060 GPU.

Experiment Preparation
We extracted 2478 videos from 24 classes of CC_WEB_VIDEO dataset and subjected them to secondary processing, Processing methods include add text, shift, flip, local tailor, rotate and other Fifteen kinds of Video tampering method, some of which use two or more tampering methods.We divided it into 8 major categories, global noise, light change, flip, displacement rotation, tailor, affine transformation, partial occlusion, and mixed.We also extended the CC_WEB_VIDEO dataset to have 53,340 near-duplicate videos, and we applied a similar approach to FIVR-200K to obtain a dataset of 58,000 videos of the extended dataset.In addition, we unified the tagging of the video replicas.

Copy video miss detection rate experiment in Semantic Graph Guided Coarse Detection
As the first stage in the copy video detection framework, whether the search range obtained by Semantic Graph Guided Coarse Detection contains all the target videos is a necessary condition for the final video to be recalled.So we extracted 1000 sets of copy videos in each of CC_WEB_VIDEO and FIVR-200K and used Semantic Graph Guided Coarse Detection to query the video library.If all the copied videos corresponding to the searched video exist within the search range obtained by Semantic Graph Guided Coarse Detection, it is marked as a successful recall, otherwise it is a miss detection.In addition, we also provide the percentage of video frames filtered by coarse matching by calculating the ratio of the final search range to all videos.The final experimental results are shown in Table2.
In the experimental results we see that the video miss detection rate of Semantic Graph Guided Coarse Detection method is at a low level.This indicates that by using the Semantic Graph Guided Coarse Detection method, both the search range is reduced and the copied videos are guaranteed to be almost never missed.Also, in the video filtered rate, CC_WEB_VIDEO has a low video filtered rate because of the small amount of data and high repetition rate, while the large database FIVR-200K has a very high video filtered rate.

Robustness Experiments under Multiple Tampering methods
To verify that our method works for most video tampering methods, we have experimented with all 8 major categories of video tampering methods.The tested methods include SIFT, SURF, based on local features, and RBSIF-ICD methods based on global features.In addition, hu [35] proposed a method based on joint spatio-temporal features, we called CNN and RNN.Experimental results are shown in Table 3. Experiment test in the processed CC_WEB_VIDEO dataset.
Where Precision Rate is obtained when Recall Rate reaches 60% and Recall Rate is obtained when Precision Rate reaches 90%.As previously described, the similar video retrieval scheme based on local features is less robust to global noise and gamma changes, while the global feature-based method is very robust to global noise and gamma changes.But there still have the potential for improvement in detecting video tampering methods such as flip, affine transformation, and local occlusion.In contrast, our method is not only robust to all the above tampering methods, but also achieves 87.9% detection accuracy even when dealing with videos tampered by multiple tampering methods.Therefore, our method has stronger performance in detecting copied videos.

Search Efficiency Comparison Experiment
In a video copy detection system, the detection time is also very important, whether similar videos can be retrieved in a reasonable time is a criterion to judge whether the algorithm is available.In this section, we will do tests based on the following five video datasets: CC_WEB_VIDEO (12790 videos), the expanded CC_WEB_VIDEO (53340 videos), VCDB (100000 videos), FIVR-200K (225960 videos), and the video set obtained by mixing all the above datasets (437300 videos).We compared the feature histogram-based approach SPH, and by the deep neural network approach [36], and the NetVLAD model alone with the NeXtVLAD model approach.We tested several videos with lengths around 30s and finally averaged the results, we give the effect of video length on the search rate in Table 4and the experimental results are shown in Table 5.We can conclude from the above two tables that our method has excellent time performance in the face of large-scale data sets, and the search time does not increase linearly with the size of the search data set, even if the size of the video database increases by tens of times, the search time only increases by nearly two times, and the search time is only closely related to the length of the search video.

CONCLUSION
In this paper, we propose the Joint Semantic Graph and Visual Image Retrieval Guided Video Copy Detection method.A coarseto-fine copy video detection framework.Experiments on several datasets show that the method is robust to various video tampering methods and has significantly higher search efficiency for largescale video datasets.In addition, the modular framework design makes the modules in the framework very flexible to use, and the performance improvement of any one module can make the overall performance improvement.Of course, our method currently has some limitations.Our framework is sensitive to the accuracy of semantic segmentation, and the overall video copy detection accuracy will be affected if there are problems such as semantic failure to segment or incorrect recognition in the out detection phase.It was found after experiments that the semantic segmentation effect is not ideal for abstract class videos like artworks and animations, leading to a lower final recall rate.Therefore, in future work, how to identify the videos that are difficult to perform semantic segmentation at a fine-grained level will be the focus of future research.

Figure 3 :
Figure 3: Semantic Graph in the Upper Right Corner of Figure 2.

Table 1 :
[32]t al. provided a public dataset called CC_WEB_VIDEO, which contains 24 queryable items and consists of 12,790 videos.TRECVid2008 [29] is a content-based dataset published by Comparisons of Different Video Copy Dataset Institute of Standards and Technology that contains 200h TV shows with 2000 query clips.VCDB, a commonly used dataset for video copy detection, contains 528 core videos searched from Youtube and MetaCafe and 9236 copy clips.Li et al.[30] provided a multimodal dataset CTV collected from Chinese TV channels, containing a total of 96 hours of test videos.FIVR-200K is a massive annotated video dataset containing 225,960 major news events crawled by Wikipedia from Youtube videos.Jiang et al. provided SVD[31], a dataset containing over 500,000 short videos and over 30,000 near-repeats of tagged videos.And the latest VCSL[32]dataset, includes over 160,000 infringing video pairs, 280,000 infringing clips, and covers a large number of video domains and video durations.We summarize them in Table1for review.

1
Comparison Algorithm Based on Hash Chunking Input: Feature Set:  ; Feature Set:   ; Number of chunks:; Hamming distance:; Output: Video similarity:   :  Hashing  and   Dicing the hash features of both  and   into  blocks   = min(len(  ), len(  )),   = max(len(  ), len(  )), Inverted index: ←Create inverted index for   , Counter initialization For  = 0 to len(  ) × ,do   ←Find the complete feature   (/)corresponding to the feature block in    ←Find the complete feature   (  ()) containing the feature block in   by inverted index If hamming ( , )<D, do Counter accumulation End if End for Video similarity   :  is equal to the Counter number divided by the min(len(  ), len(  )) End

Table 2 :
Missing Detection Rate and Video Filtered Rate of Semantic Graph Guided Coarse Detection

Table 3 :
Algorithm Robustness Testing under Multiple Tampering Methods

Table 4 :
Relationship Between Video Length and Search Time

Table 5 :
Search Time of the Algorithm on Each Dataset