Music4All-Onion — A Large-Scale Multi-Faceted Content-Centric Music Recommendation Dataset

When we appreciate a piece of music, it is most naturally because of its content, including rhythmic, tonal, and timbral elements as well as its lyrics and semantics. This suggests that the human affinity for music is inherently content-driven . This kind of information is, however, still frequently neglected by mainstream recommendation models based on collaborative filtering that rely solely on user-item interactions to recommend items to users. A major reason for this neglect is the lack of standardized datasets that provide both collaborative and content information. The work at hand addresses this shortcoming by introducing Music4All-Onion , a large-scale, multi-modal music dataset. The dataset expands the Music4All dataset by including 26 additional audio, video, and metadata characteristics for 109,269 music pieces. In addition, it provides a set of 252,984,396 listening records of 119,140 users, extracted from the online music platform Last.fm, which allows leveraging user-item interactions as well. We organize distinct item content features in an onion model according to their semantics, and perform a comprehensive examination of the impact of different layers of this model (e.g., audio features, user-generated content, and derivative content) on content-driven music recommendation, demonstrating how various content features influence accuracy, novelty, and fairness of music recommendation systems. In summary, with Music4All-Onion, we seek to bridge the gap between collaborative filtering music recommender systems and content-centric music recommendation requirements.


INTRODUCTION AND MOTIVATION
With the spiraling increase of digital content available to users, and likewise interaction data between users and content items, Recommender Systems (RSs) have become ubiquitous. Compared to other domains, Music Recommender Systems (MRS) are characterized by a large item-set size and a high sparsity of user-item interactions, making them prone to issues such as the cold-start problem and popularity biases. Those issues are often mitigated with Content-Based Recommenders (CBRs) that leverage item features, as opposed to Collaborative Filtering (CF), which relies entirely on user-item interaction data. Music consumption is also characterized by the fact that human music perception happens at different levels of semantics, and often involves not only the listened audio signal but also textual or visual input. Additionally, owing to the developments in Music Information Retrieval (MIR), many techniques [4,12] allow the audio-based extraction of features characterizing music items at different semantic levels. Because of these aspects, MRSs are particularly apt for research in CBRs [7][8][9][10].
One big obstacle to the development of advanced content-based MRSs is the lack of comprehensive, standardized, and large-scale datasets, providing features characterizing the items at different semantic levels. Another one is understanding how feature semantics affect recommendation, which requires a categorization of features depending on their semantic charge. We address both points in this paper. First, we present Music4All-Onion, a dataset that enhances the established Music4All [29] and LFM-2b [30] datasets, by including several additional item features. Second, we propose an onion model to organize item features according to their semantics, thus helping the interpretation of the impact of item features on the  [17] Audio, video 44 Music4All-Onion AF (HL+LL), lyrics embeddings, 109,269 genre, tags, video embeddings recommendation task. We benchmark, in terms of accuracy and beyond-accuracy metrics, these newly categorized features by comparing the performance of CBRs fueled by these features among each other and with pure CF models. Our analysis shows that content features improve recommendation accuracy with respect to pure CF, and that multi-modal CBRs, leveraging several features simultaneously, achieve the best performance. We also show that optimal selection of content features depends on the objective of the MRS, e.g., maximizing accuracy, diversity, or fairness.
Our contribution is, therefore, three-fold: First, we introduce Music4All-Onion, a large-scale multi-faceted dataset for music recommendation. Second, we propose an onion model for categorizing features according to their semantics. Based on these two contributions, we show how multi-modality improves recommendation, and how different features can be leveraged to optimize for accuracy and beyond-accuracy metrics. We provide the Music4All-Onion dataset and accompanying source codes for the conducted experiments at http://www.cp.jku.at/datasets/Music4All-Onion.

RELATED RESOURCES
While there are many datasets in the fields of RSs and MIR, only a few combine multiple modalities: those publicly available vastly differ in terms of size and covered modalities (see Table 1).
The AcousticBrainz (AB) Genre dataset [3] provides audio features (via AcousticBrainz) and genre information for up to 1,935,991 songs. Genre labels are collected from four different sources, where genres are organized hierarchically into main genres and subgenres. The ALF-200k dataset [42] combines acoustic and lyrics features, resulting in 176 high-level (HL) features for each of the 226,747 included tracks. The Multimodal Music dataset (MuMu) [23,24] is based on the Million Song Dataset (MSD) [1] and the Amazon Reviews dataset [20] and encompasses 147,295 songs. The MuMu dataset combines information on purchases of albums (recovered from Amazon) with information on individual tracks, hence providing audio features (extracted from AcousticBrainz), multi-label genre annotations, album reviews, average rating per album, selling rank, similar products, and URL of the cover image of the album. The proposed Music4All-Onion dataset extends the Music4All dataset, which provides 109,269 songs and high-level acoustic features (extracted from Spotify), genres, and Last.fm tags. The Musi-Clef dataset [31] contains 1,355 popular music songs and provides high-and low-level (LL) audio features (e.g., MFCCs and block-level features), manually annotated genre and mood labels by domain experts, Last.fm tags, and textual artist descriptions crawled from various websites. The MIREX mood dataset [25] is based on the mood tags used in the MIREX mood classification task [14]. Songs annotated with these tags are retrieved from AllMusic and extended with mood tags, lyrics, and MIDI data for each of the 193 songs. The University of Rochester Multimodal Music Performance (URMP) dataset [17] provides 44 multi-instrument classical music pieces, where for each track, the audio for the individual tracks, the musical score in MIDI format, the audio and video recording of the assembled mixture, and frame and note-level pitches are contained. Music4All-Onion both interlinks and substantially extends the established Music4All [29] and LFM-2b [30] datasets. In contrast to existing datasets, Music4All-Onion is large-scale and provides features extracted from audio, video, and metadata for 109,269 music tracks. In addition, it includes 252,984,396 listening records of 119,140 users of Last.fm. This combination of rich content features across multiple modalities and extensive collaborative information (listening records) makes it a unique resource for RSs research.

ONION MODEL OF MUSIC FEATURES
We present an onion model (as depicted in Figure 1) proposed by Deldjoo et al. [10] to categorize music content features in layers that reflect a transition from highly objective features with a low semantic charge (the inner layers) to more subjective and semantically meaningful features (the outer layers). The innermost layer corresponds to features extracted from the raw audio signal, commonly adopting traditional MIR signal processing techniques. Features in the Embedded Metadata (EMD) layer contain descriptive and technical metadata such as artist, track, and album name, or lyrics. Expert-Generated Content (EGC) refers to attributes assigned by or filtered with information from users with training or experience in the music domain, while User-Generated Content (UGC) encompasses information attached to items by general users. The Derivative Content (DC) layer refers to works created in relation to the original.

THE MUSIC4ALL-ONION DATASET
In this section, we describe the features provided by Music4All-Onion, categorized according to the onion model introduced in Section 3. These are summarized in Table 2. We also describe the additional set of listening events obtained by matching Music4All-Onion with the popular LFM-2b dataset [30].

Audio
Low-level (LL) and high-level (HL) features are extracted from the audio signal. We divide these features into short-term and blocklevel features, depending on the length of the sequence considered.  [35] are computed on longer sequences (several seconds) of spectrograms, and then aggregated using percentiles. We compute the six features defined in [35], capturing spectral, harmonic, rhythmic, and tonal music characteristics.

Embedded Metadata (EMD)
In addition to existing artist, album, and track names, we extract different representations of song lyrics.

Lyrics embeddings.
We provide two vector representations of the preprocessed lyrics: word2vec and tf-idf. The word2vec representation is obtained by first mapping each word to its 300dimensional pre-trained word2vec embedding [21], and then averaging each component over the set of words. For tf-idf, tf is defined in terms of absolute word counts while idf is defined as being the total number of tracks, and df being the fraction of lyrics documents (one document for each track) in which the term appears. The resulting tf-idf vectors are 2 -normalized.

Lyrics emotions.
Emotional content is represented by mapping words from the lyrics onto valence, arousal, and dominance values according to the extended Affective Norms for English Words [41] lexicon. Words not present in this lexicon are mapped with the National Research Council Canada (NRC) lexicon [22]. In addition, we compute the polarity compound measure according to the Valence Aware Dictionary for sEntiment Reasoning (VADER) [15] using NLTK [2]. The values are aggregated as BoWs.

Expert-Generated Content (EGC)
The maintainers of the original Music4All dataset infer track genre by filtering out the tags appearing on the track Last.fm page, by the genres defined on Every Noise at Once, 1 and provide the resulting genre list per track. We convert those lists into tf-idf representations. We exclude genres that are associated with only one track. The tf of a specific genre of a track is defined as one divided by the number of genres attached to the track, idf is defined in Equation 1.

User-Generated Content (UGC)
The Music4All dataset comes with lists of track tags crawled from the Last.fm website. Those give no information on how often each tag was associated with the track. To fill this gap, we provide the tags retrieved with the Last.fm API, 2 which attaches a weight (∈ {0, . . . , 100}) to each tag depending on the frequency of its occurrence for the track under consideration, i.e., how many users assigned the tag to the track. We further convert the tags into a tf-idf representation by first filtering out tags with more than 50 characters (to remove tags consisting of sentences or extracts of the lyrics) and removing tags appearing in less than 5 tracks (to remove tags that are only meaningful to a very restricted subset of users). We then transform the tags for each track into a tf-idf representation, where tf is defined as the Last.fm tag weight divided by the sum of all weights of the track and idf as in Equation 1.

Derivative Content (DC)
Since official music videos frequently feature additional artistic contributions, such as those of directors or occasionally actors, and since many YouTube videos do not correspond to the official ones, but rather are covers or videos created by users of YouTube, we consider videos of songs uploaded to YouTube to be DC. For 98,877 out of 109,269 tracks, YouTube videos are available; we download them and extract image frames at the rate of 1 Hz. Each frame is then converted to three different vector representations using pretrained versions of VGG19 [38], Inception v3 [39], and Resnet [13], and aggregated to track level using maximum and mean. Table 3: Performance of the recommenders in terms of accuracy and beyond-accuracy metrics sorted descendingly on NDCG. For each metric, the best value is marked in bold, the second-best value in italic, and the worse is underlined.

Model
Feature

BENCHMARKING
We showcase the impact of Music4All-Onion by comparing the performance of Bilateral Variational Autoencoders (BiVAEs) [40] leveraging features from different layers of the onion model (ivectors of 256 components, tf-idf of lyrics, genres and tags, and VGG19 representation of videos) to learn the priors of the item VAE. This CBR is built on VAEs, which have been proven to be successful for recommendation tasks [18,19,37,40]. Furthermore, we consider two CF models: BiVAE with Gaussian priors (i.e., not leveraging any item feature) and matrix factorization with Bayesian Personalized Ranking (BPR) [27], as well as a non-personalized algorithm recommending the overall most popular items to all users (MostPop). In addition to accuracy metrics (NDCG@10 and Recall@10), we include several beyond-accuracy metrics, defined in [32], that we briefly describe here. Item-and user-entropy measure how well relevant recommendations are spread on the set of items and users, respectively, with higher values of entropy indicating fairer recommenders [6]. Coverage is the fraction of items in the catalog appearing at least once in the top-10 recommendations. Novelty is a measure of how likely the recommender is to make unpopular recommendations. The BiVAE models and BPR are trained using the library Cornac [28], while for the evaluation we rely on our implementation of the metrics since Cornac does not provide beyond-accuracy metrics, and since it does not allow evaluating the performance of models that are not included in the library, a feature required for optimization of and comparison with the aggregated model introduced below. The dimensionality of all latent representations is set to 10. For BiVAE, the encoders consist of a hidden layer with 20 nodes and tanh activation, and are trained for 100 epochs with a batch size of 128 and a learning rate of 0.001. The regularization hyperparameter in BPR is set to 0.01 and the model is trained for 200 iterations. To evaluate the impact of multimodality on recommendation, we also consider a late fusion of all the BiVAE-based models with a generalization of Borda count rankaggregation that we name Truncated Weighted Borda (TWB) [5]: ranking points are assigned to the top-50 items of every individual model and weighted with 1 -normalized weights. The combination of weights is optimized on NDCG, performing a grid-search with weights ∈ {0, 0.2, . . . , 1}.
The set of songs consists of the 79,072 items for which all features are available. The corresponding listening events of Music4All (4.2M) are binarized by assigning 1 to the (user, item) pairs with listening counts greater than or equal to 2, resulting in 707,284 positive user-item interactions. These are split into a train (60%), a validation (20%), and a test (20%) set. The weights for TWB are optimized on the validation set. All reported results refer to the test set. By inspecting the results in Table 3, the following conclusions can be drawn. On accuracy, the best values of NDCG and recall are achieved by the aggregated model, with a combination of weights ( Aud , Lyr , Gen , Tag , Vid , CF ) = (0.2, 0.2, 0, 0.2, 0.4, 0), indicating apositive impact of multi-modality on music recommendation. The results show a trade-off between accuracy and beyondaccuracy/fairness metrics, indicating that different modalities can be leveraged depending on the evaluation dimension to be optimized, which can depend on the interests of the various stakeholders of the MRS. Optimizing consumer-side measures such as accuracy and user-fairness may result in the selection of TWB or video features, which negatively impacts provider-centric metrics (coverage, item-fairness). For a two-sided equitable ecosystem, the system designer may choose the weights accordingly.

CONCLUSIONS
We introduce Music4All-Onion, a large-scale multi-modal dataset providing 26 content features for 109,269 songs. We also propose an onion model to organize these features according to their differing semantic charge. A set of experiments shows that content-based MRSs should leverage different features, depending on which objective is to be optimized, and that content-based music recommenders tend to outperform pure CF algorithms in terms of accuracy, with multi-modal variants achieving the best performance.