Cover Song Identification Technologies: A Survey

Cover Song Identification (CSI) is a highly crucial task in the field of Music Information Retrieval (MIR) and holds significant importance in the domain of copyright infringement detection. With the widespread adoption of digital music and the online dissemination of musical works, issues related to music copyright violations have become increasingly prevalent and severe. Music copyright holders and creators face the risk of their musical compositions being copied, interpreted, distributed, or pirated without proper authorization, leading to potential misuse of their works and causing financial losses, as well as reputational damage. As a result, cover song identification technology has emerged as a vital tool for safeguarding music copyrights and preserving the legitimate rights of creators. In essence, the significance of cover song identification in the context of copyright infringement detection lies not only in the protection of music copyrights but also in maintaining a healthy ecosystem for music rights. Furthermore, it enhances the user experience on music platforms by preventing unauthorized cover versions from infiltrating recommendation systems.In conclusion, cover song identification technology provides robust support for the sustainable development of the music industry and the protection of creators’ legitimate rights. It plays an indispensable role in ensuring the long-term prosperity of the music sector. This article aims to provide a comprehensive exposition of cover song identification technologies.


INTRODUCTION
With the rapid advancement of digital technology and the widespread proliferation of the internet, the mode of music dissemination has undergone a revolutionary transformation.Music is no longer confined to physical records or radio broadcasts but can be effortlessly shared through online music platforms, social media, and video-sharing websites.While this convenience has brought tremendous opportunities for music promotion and sharing, it has also given rise to a pressing issue -music copyright infringement.Music copyright infringement has become a significant challenge in the music industry.It not only infringes upon the rights of music creators but also has adverse effects on the entire music ecosystem.Particularly in the digital era, music can be easily replicated, disseminated, and modified, leading to a proliferation of potential copyright violations.In this context, cover song identification technology has emerged as a potent tool aimed at distinguishing and identifying different versions of musical works.
The significance of cover song identification technology lies in its ability to protect the copyrights of original music compositions.It efficiently accomplishes this by analyzing various aspects of music, including sound characteristics, melodies, rhythms, and more.It can accurately recognize different versions of music, encompassing both original compositions and cover renditions.This recognition technology is paramount for copyright holders, enabling them to swiftly detect unauthorized cover song activities.When individuals or groups attempt to create music closely resembling an original song without obtaining explicit permission from the copyright owner, cover song identification technology comes into play, assisting in detecting such copyright infringements and upholding the legitimacy of copyright.In the realm of copyright enforcement, cover song identification technology holds immense potential in safeguarding the interests of music creators.It aids in monitoring the music market, ensuring that copyright holders receive their rightful compensation.Furthermore, it contributes to the vitality and sustainability of the music industry, encouraging more musicians to engage in music creation, knowing that their works are protected.
In conclusion, cover song identification technology not only plays a critical role in copyright protection but also contributes to the overall health of the music ecosystem.As digital music continues to evolve, this technology will remain pivotal in the music industry, ensuring that music creators receive the respect and rewards they deserve.
In the early days of cover song identification, the approach involved extracting manually designed music features [14][15] [18]and subsequently comparing the correlation or similarity between two pieces of music.However, due to the complexity of music and constraints related to efficiency and performance, this technology proved challenging to apply on a large scale.With the advancement of technology, deep learning has demonstrated its immense capabilities and astonishing performance in fields such as computer vision and natural language processing [11][7] [6].Researchers have also shifted their focus in cover song identification from traditional methods to deep learning, achieving highly impressive results in some larger datasets.
One of the more classical deep learning approaches in cover song identification treats the training process as a classification task [22] [23].In this approach, the embedding extracted from the trained bottleneck layer is utilized for cover song representation.Building upon this foundation, researchers have further optimized the training models by incorporating loss functions such as triplet loss, center loss, PNL loss, and others.These loss functions are employed to reduce the feature distance between similar songs while increasing the feature distance between dissimilar ones [10] In terms of model architecture, Yu et al. [22] proposed Temporal Pyramid Pooling(TPPNet), which is used to extract information at different scales and generate fixed-dimensional representations.In a subsequent study, COTNet [23]was designed based on the characteristics of the cover song task and trained using softmax classification.Ken et al. [10]introduced PiCKINet, which constructs a model with PitchClass Blocks to maintain key-invariant features of music.Inspired by person re-identification research [9], Du et al. introduced Bytecover [3] and Bytecover2 [1] , which utilize the ResNet-IBN50 [12] model architecture combined with BNNeck [9] and multiple loss functions, including softmax and triplet loss.Additionally, for shorter audio clips, they proposed Bytecover3, achieving state-of-the-art results.Hu et al. [8] proposed the LyraC-Net model, which combines multiple loss functions on the foundation of a WideResNet.They demonstrated that training models with large-scale data can yield excellent performance.These methods are highly classic, have achieved significant results on publicly available datasets,such as SHS100K [20], Covers80 [4], and Da-TACOS [21], Youtube350 [16], and hold important academic research value.

APPROACH
Yu et al. [22] employed Convolutional Neural Networks (CNNs) to learn crucial invariant features for re-sung song recognition (CSI).Inspired by the successful application of pyramid pooling techniques in the field of computer vision [5] [19], this approach was introduced to the domain of music analysis, incorporating Temporal Pyramid Pooling(TPP) to aggregate information at different scales to handle the temporal aspects of music segments.This technique divides music segments into subsegments of different time scales, extracts features from them individually, and then pools these features to comprehensively capture the temporal information of the music.This method aids in enhancing the model's sensitivity to musical variations, consequently improving the accuracy of re-sung song recognition.The innovation and contributions of this paper can be summarized in three key aspects.Firstly, it effectively demonstrates the potential of Convolutional Neural Networks(CNNs) in music processing, similar to their application in computer vision for extracting translation-invariant features.This innovation indicates that CNNs can be leveraged to learn critical invariant feature representations of music, thus providing new approaches for music information extraction and processing.Secondly, the introduction of Temporal Pyramid Pooling(TPP) enables the extraction of music information from various scales, efficiently converting variable-length music sequences into fixed-length representations.This approach aids in addressing temporal structures in music, making the model more adaptable and robust.Thirdly, a multi-length training scheme for data augmentation was developed to enable the model to better handle temporal variations in music.This method contributes to improving the model's performance, allowing it to excel when dealing with various types of music data.
In subsequent research [23], Yu and colleagues extended the model's receptive field using specialized convolutional kernels and dilated convolution techniques to better capture melodic structures and key features in music, marking a pioneering development in the field of music information retrieval.Firstly, to address musicspecific challenges such as key changes, tempo variations, and structural changes, the paper introduced specialized convolutional kernel sizes and incorporated dilated convolutions, which are operations that expand the receptive field.This innovation allowed the model to capture long-range dependencies in music more effectively, contributing to improved performance in re-sung song recognition.Secondly, by framing the problem as a classification task, the research team successfully trained a model to distinguish between different versions of music.To enhance the model's robustness, they designed a training strategy that enabled it to handle music inputs of varying rhythms and lengths.This approach demonstrated superior performance on multiple public datasets compared to state-of-the-art methods while maintaining lower time complexity.
Ke net al. [10] have introduced a Pitch Class Key-Invariant Network(PiCKINet) for Cover Song Identification.They utilized more effective Constant-Q Transform(CQT) features, which were converted into an activation map with pitch class dimensions through a key-invariant multi-octave filter.What sets PiCKINet apart from other networks is its use of large multi-octave kernels that generate a latent representation maintaining pitch class dimensions consistently throughout the PiCKINet through key-invariant convolutions.At the time of the study, PiCKINet demonstrated superior performance compared to other models using CQT features.Additionally, the authors proposed an extended model called PiCKINet+, which incorporates a center loss penalty, squeeze and excite units, and octave swapping data augmentation.This model also achieved outstanding performance.
The ByteCover method proposed by DU et al. [3] offers a significant advantage compared to other methods, with two notable improvements: The first improvement draws inspiration from the method described in reference [12], which combines Instance Normalization(IN) and Batch Normalization(BN).In computer vision, this combination harnesses the ability of IN to learn features invariant to appearance changes, such as color, style, and virtual/reality, along with the ability of BN to retain content-related information.In the context of Chorus Section Identification(CSI), it is essential to preserve information about the music versions while designing features robust to variations in key, rhythm, timbre, genre, etc.Therefore, the authors introduce Instance Normalization(IN) into the CSI problem and combine it with Batch Normalization(BN) to construct IBN blocks, creating the ResNet-IBN model.Experimental results demonstrate that integrating IN and BN into ResNet50 significantly enhances the performance of CSI.The second improvement involves a change in the model's learning approach.Most existing studies typically treat CSI as either a classification problem or a metric learning problem separately.ByteCover attempts to learn features by jointly optimizing classification loss and triplet loss.Reference [17] has shown that classification loss emphasizes inter-class distinctiveness, while triplet loss emphasizes intra-class compactness.Numerous researchers have demonstrated through experiments that combining these two losses yields better results in various tasks, such as person re-identification [9] , fine-grained visual recognition, and ego-motion action recognition [17].The authors extensively investigate the use of BNNeck from reference [9], and based on this, they merge classification and triplet losses into CSI.By employing this multi-loss training, it becomes possible to learn feature representations that are both robust and discriminative.
Bytecover2 [1] is an upgraded version of Bytecover.The authors introduced a new module in the ByteCover2 architecture called PCA-FC, drawing inspiration from the classic PCA dimensionality reduction method.This module consists of a fully connected(FC) layer whose weights are initialized using Principal Component Analysis(PCA).PCA-FC not only reduces the dimensionality of the original audio embeddings but also provides a trainable component to fine-tune the model, achieving better performance with smaller embedding sizes.ByteCover2 was evaluated on multiple CSI datasets, including SHS100K [20], Covers80 [4], and Da-TACOS [21], among others.Experimental results demonstrate that even with significantly smaller embedding sizes, ByteCover2 outperforms all competitors, including the previous ByteCover model.Specifically, when using a dimension size of 1536, ByteCover2 achieved new state-of-the-art CSI performance on all datasets, with average mean average precision(mAP) measurements surpassing ByteCover by 2.8%, 2.2%, and 7.7%, respectively.
With the surge in short video content, the need to match short music clips with complete music tracks in databases has become a practical requirement, yet it remains an underexplored field.The shorter the duration of the music, the lower the precision of the extracted features.Bytecover3 [2] is introduced in this context as an extension of Bytecover and Bytecover2, addressing the problem of recognizing short music queries against full-length audio databases.Unlike existing research that relies on global features, ByteCover3 aims to learn a set of deep features from each audio and employ feature matching to achieve recognition of short queries against complete songs.The authors propose a novel loss function called Local Alignment Loss(LAL).Additionally, to enhance the efficiency of feature matching, a two-stage feature retrieval pipeline is designed, including an Approximate Nearest Neighbors(ANN) stage and a re-ranking stage.The authors conducted a series of comparative experiments on the feature dimensions extracted by the model and the length of the music, all while ensuring time efficiency, on publicly available datasets.Remarkably, they achieved excellent experimental results.Through these improvements, the performance of cover song identification on short queries can be significantly enhanced, particularly in the cover songs.In the realm of practical applications in short cover song identification, this method holds significant importance and far-reaching implications.
Hu et al. [8] introduced a new model called LyraC-Net, which learns robust representations covering different versions of songs through joint training objectives.Specifically, the model combines loss functions from both classification and metric learning, including PNL(Positive and Negative Learning) and triplet loss.This joint training strategy aims to produce more robust and less prone to overfitting representation learning outcomes.Unlike traditional ResNet architectures, the model utilizes a Wide Residual Network (WideResNet) [24] as the model architecture for the Chorus Section Identification(CSI) task, which has demonstrated strong performance in experiments.Additionally, they applied SpecAugment [13], a popular augmentation method in speech recognition, to train the song model.Also, they collected approximately 670,000 music samples from the SecondHandSongs1 website as training data, labeled as SHS600K-TRAIN.SHS100K-VAL and SHS100K-TEST were used as evaluation datasets and were not included in SHS600K-TRAIN.Meanwhile, 73823 music samples from the SecondHandSongs website as training data, named SHS100K-TRAIN*.The model trained on the same test data but using the SHS100K-Train* dataset achieved the following test results: MAP=0.765,P@10=0.478,MR1=48.32.In contrast, the model trained on the SHS600K-Train dataset achieved the following test results: MAP=0.884,P@10=0.528,MR1=32.50.The model trained on the expanded dataset showed substantial improvements in testing on public datasets.This phenomenon of better test results with larger training data was experimented not only in the LyraC-Net model but also in other models.During experimentation, the authors found that adding more training data significantly improved the results.

DATASET
To evaluate the performance of a model, researchers commonly use publicly available benchmark datasets.These datasets are designed to provide a standardized evaluation platform for comparing the performance of different algorithms or models.In the context mentioned, there are four publicly available benchmark datasets, namely SHS100K [20],Covers80 [4], and Da-TACOS [21], Youtube350 [16].Here is a detailed introduction to these datasets: The SHS100K dataset was collected from the Second Hand Songs website and comprises 8858 unique songs and 108523 different recordings, encompassing various cover versions.The dataset was divided into training, validation, and test sets in an 8:1:1 ratio.The datasets are used to facilitate independent assessments of model generalization performance when training models, tuning hyperparameters, and ultimately evaluating model performance on real data.Specifically, the training set was used for model training, the validation set was used for tuning and optimizing model hyperparameters, and the test set was used for the final evaluation of model performance to obtain reliable estimates of how the model performs on real-world data.In summary, the SHS100K dataset is a music-related dataset containing a large number of songs and audio recordings.It is widely used in research, particularly in tasks related to music information retrieval and music recognition.By adopting consistent configurations and dataset partitioning, researchers can more accurately assess the performance of different models or algorithms, driving advancements in related fields.
Covers80 is a benchmark dataset widely utilized in the field of music.Its value lies in providing a rich and diverse collection of music samples to assist researchers in evaluating and improving models and algorithms for various music-related tasks.This dataset consists of 80 songs, with each song having 2 cover versions, resulting in a total of 160 distinct audio recordings.
The Da-TACOS database is a music-related dataset consisting of a total of 15,000 music performances.Among these, 13,000 performances are contributed by 1,000 different music groups, with each group providing 13 samples, showcasing a diverse range of musical styles and expressions.The remaining 2,000 samples are independent performances not associated with any specific group.This dataset is designed to support tasks such as music group recognition, music genre classification, and various other music-related research endeavors.It provides researchers with a rich and diverse collection of music samples, allowing for in-depth exploration of issues within the field of music and further advancements in the performance of models for music-related tasks.
The Youtube350 dataset is a music-related dataset consisting of 50 songs spanning various genres, with each song having 7 versions, including 2 original versions and 5 cover versions, resulting in a total of 350 audio recordings.It is designed to support music information retrieval and music-related research, providing a challenging experimental platform for evaluating the performance of models in extracting essential music metadata from music videos.The dataset covers different music types and genres, making it suitable for improving the performance of music search engines, music recommendation systems, and other music-related applications.
During the retrieval phase, a cosine distance metric was utilized to estimate the similarity between two musical performances.Following the evaluation protocol of the Mirex Audio Cover Song Identification Contest 2 , the evaluation metrics reported included Average Mean Average Precision(mAP), Precision at 10(P@10), and Mean Rank of First Correct Identification(MR1).

CONCLUSION
In summary, cover song identification technology is an important and highly regarded research direction in the field of music information.These technologies provide powerful tools and solutions for the music industry, copyright protection, music recommendation, and more through the recognition and similarity analysis of music works.From our review, we can draw the following conclusions and viewpoints.
Firstly, existing cover song identification technologies have made significant progress.Researchers have achieved high accuracy in cover song identification by employing feature extraction based on audio signals, machine learning algorithms, and deep learning models.These technologies offer effective tools for music creators, music platforms, and copyright organizations to manage and protect music works.
Secondly, cover song identification technology also holds extensive prospects in music research.It can be applied in various fields such as music style analysis, music recommendation, music 2 https://www.music-ir.org/mirex/wiki/2020:Audio_Cover_Song_Identificationeducation, and cultural studies.These technologies aid in a deeper understanding of the evolution and influence of music, providing robust tools for music research.
However, despite substantial progress, cover song identification technology still faces several challenges and limitations.For instance, efficiency concerns when handling large-scale music databases, the complexity of cross-cultural cover songs, and the fusion of multimodal data all require further research.Moreover, as music creation and distribution methods continue to evolve, new issues and challenges will arise.
Therefore, future research directions include but are not limited to improving the efficiency and accuracy of cover song identification technology, exploring methods for integrating multimodal data, and adapting to the ever-changing music industry and cultural environment.Additionally, interdisciplinary collaboration will play a crucial role in addressing complex cover song identification problems, combining expertise from fields such as computer science, musicology, cultural studies, and more, with the potential to achieve even more exciting results.
In conclusion, cover song identification technology holds a significant position in the field of music information.It not only contributes to the management and protection of the music industry but also provides powerful tools for music research and cultural understanding.We look forward to future research continuously advancing this field, addressing new challenges, and bringing forth innovation and insights for the music community and academia.