When the Music Stops: Tip-of-the-Tongue Retrieval for Music

We present a study of Tip-of-the-tongue (ToT) retrieval for music, where a searcher is trying to find an existing music entity, but is unable to succeed as they cannot accurately recall important identifying information. ToT information needs are characterized by complexity, verbosity, uncertainty, and possible false memories. We make four contributions. (1) We collect a dataset - $ToT_{Music}$ - of 2,278 information needs and ground truth answers. (2) We introduce a schema for these information needs and show that they often involve multiple modalities encompassing several Music IR subtasks such as lyric search, audio-based search, audio fingerprinting, and text search. (3) We underscore the difficulty of this task by benchmarking a standard text retrieval approach on this dataset. (4) We investigate the efficacy of query reformulations generated by a large language model (LLM), and show that they are not as effective as simply employing the entire information need as a query - leaving several open questions for future research.


INTRODUCTION
The Tip-of-the-tongue (ToT) retrieval task involves identifying a previously encountered item for which a searcher was unable to recall a reliable identifier.ToT information needs are characterized by verbosity, use of hedging language, and false memories, making retrieval challenging [1,4].As a consequence, searchers resort to communities like r/TipOfMyTongue and WatzatSong, where they can post descriptions of items that they know exist but cannot find, relying on other users for help.Recent research of ToT information needs explored how searchers pose these requests in specific domains like movies [1,4], or games [24].Music-ToT, however, is under-explored despite being frequent: it represents 18% of all posts made in a five-year period in the r/TipOfMyTongue community (cf.§3.1).Our work is motivated by the need to understand how such requests are expressed in the music domain.
We examined the r/TipOfMyTongue community, focusing on requests looking for musical entities like albums, artists or songs.We show that these requests often refer to multiple modalities (cf.§4) and thus encompass a broad set of retrieval tasks-audio fingerprinting, audio-as-a-query, lyric search, etc.In our work, we focus on song search.We create ToT 1 : the dataset consists of 2,278 solved information needs pertaining to a song, each of which is linked to the corresponding correct answer in the publicly available Wasabi Corpus [7].Using ToT , we develop a schema for Music-ToT information needs to reveal what information is contained in them (cf.§3.2).In addition, we are interested in the extent to which standard text retrieval approaches are able to deal with ToT queries.To this end, we benchmark a subset of ToT information needs 2 on the Wasabi corpus, as well as Spotify search.Across both settings, the low effectiveness-compared to non-ToT queries-of our evaluated retrieval methods underscores the necessity of novel methods to tackle this task.Lastly, we conduct a preliminary study on reformulating Music-ToT queries using GPT-3 [5]; we find that the task remains very challenging.

BACKGROUND
Tip-of-the-tongue (ToT) retrieval is related to known-item retrieval (KIR) or item-re-finding [31], however ToT queries are typically issued only once-not multiple times-and importantly, lack concrete identifiers, instead relying on verbose descriptions, frequently expressed uncertainty and possible false memories [1,4,16,24].Approaches for simulating such queries [3,13,26] may lack realistic phenomena like false memories [19,20], necessitating the collection of real world data.Data on a large scale is available for only one domain, movies [4]; smaller scale datasets are available for games [24] and movies [1].Hagen et al. [16] collect a corpus of general known-item queries, including music; however their focus was on general known-item queries and false-memories, and lacked retrieval experiments.Our focus is on the music domain, examining modalities employed by searchers and how they express Music-ToT queries.We build upon Arguello et al. [1] and Bhargav et al. [4], with key differences in (1) the domain-music, (2) the corpus size-millions of items instead of thousands, and, (3) reformulation experiments utilizing an LLM.Music-ToT relates to several research areas in Music IR (MIR).
Other modalities like videos may need to be handled as well, necessitating multi-modal or cross-modal (retrieving one modality using another) methods [33], e.g.retrieving audio using video [23,34].Approaches to solve Music-ToT have to account for multiple modalities and free-form natural language including noise, e.g., uncertainty [1] and/or false memories [1,24].

METHODOLOGY 3.1 Data Collection
Gathering ToT .We gathered posts made across 2017-2021 in the r/TipOfMyTongue community, yielding 503,770 posts (after filtering out posts not marked Solved or Open), each containing two fields: title and description.We extracted text categories from the title, e.g.SONG from "[SONG] Slow dance song about the moon?".We manually identified a set of 11 overarching music-focused categories (e.g.Music Video, Band, Rap Music).We discarded the remaining non-music posts, resulting in ToT : 94,363 (60,870 solved and 33,493 unsolved) Music-ToT posts.These posts form a large proportion-18.73%-of the 503K posts we started out with.

Extracting ToT
. We extracted answers from Solved posts following Bhargav et al. [4], retaining Solved posts which have a URL as an answer.If the URL points to a track on Spotify, obtaining the answer was trivial.Otherwise, the title portion of the markdown inline URLs, formatted as [title](url) (with title often formatted as 'Artist-Song') was used as a query to the Spotify search API.Since the API returns multiple results, we created a classifier3 with 31 features based on the scores of the retriever, the edit distances between title and artist name, song title, etc.We used the classifier to predict if a title matches the track and artist, scoring 100% on precision on a held out set of 100 samples.Low-confidence candidates were filtered out.This left us with a set of 4,342 posts with Spotify tracks as answers.Lastly, we only retained those posts where the ISRC 4 of the answer track is also present in the Wasabi Corpus [7]: a total of 2,278 posts.We call this collection ToT .
Gathering reformulations.We gathered reformulations for all posts in ToT by prompting GPT-3 [5] 5 with the respective post description and a word count limit: <description> Summarize the query above to <N> words, focusing on musical elements.We used = {10, 25, 50}. 6We also employed a prompt without a specific word limit: <post description> Shorten the query above, focusing on musical elements.

Music-ToT Schema
Our annotation process involved three steps.We first developed and then refined a schema to describe Music-ToT information needs; in the final step, we annotated 100 samples from ToT .
Developing the schema in 2 steps.A preliminary study conducted with one author (self-rated music expertise 7 out of 10) and two volunteers (music expertise 8/10 and 7/10 respectively) involved assigning one or more labels to 78 sentences from 25 randomly sampled posts from ToT .We focused on developing new labels specific to Music-ToT, while also re-using labels from Arguello et al. [1]: specifically the Context labels, pertaining to the context an item was encountered in (Temporal Context, Physical Medium, Cross Media, Contextual Witness, Physical Location, Concurrent Events), and Other annotations (Previous Search, Social, Opinion, Emotion, Relative Comparison).The latter are generally applicable across ToT information needs.This preliminary study revealed 25 new music labels, in addition to 11 labels from prior work (6 × Context and 5 × Other).In the second step, the three authors (self-rated musical expertise 7, 6 and 5 respectively) of this paper labeled 110 sentences (20 posts from ToT ) to validate the schema.Based on our results and discussions, we combined a few finer-grained categories with low support into more general categories, e.g.specific musical elements like Rhythm / Repetition, Melody, Tempo, etc., were combined to Composition, resulting in 28 labels in total.
Annotating.Lastly, in step 3, two authors employed the final schema to annotate 536 sentences corresponding to 100 posts.The resulting labels, their frequency, category, inter-rater agreement (Cohen's [2,9]) along with their description and an example, are presented in Table 1.

DATA ANALYSIS
We now first discuss Table 1, followed by a brief discussion about the modalities present in the whole collection, ToT .
Annotation results.Among the music-focused annotations, Genre and Composition, a description of musical elements and how they fit together, are the two most frequent labels.This is followed by Music Video Description, and either direct quotes (Lyric Quote) or a description of the lyrics (Story/Lyric Description) further highlighting the different information needs that need to be addressed i.e., lyric search, text search and multi-modal search.However, a simple extraction of Genre and metadata such as Time Period/Recency, Instrument, etc., may not be useful without considering the most frequent label, Uncertainty.Search systems therefore would have to handle these elements, as well as consider potential false memories.Furthermore, annotations like Social, Opinion are also fairly common occurrences in our data, which may have limited utility for retrieval [1], motivating reformulations (cf.§3.1).Searchers also express their queries in terms of other music entities in a Relative Comparison, and describe Previous Search attempts, explicitly ruling out certain candidates.References to other modalities like user created clips (Recording) or existing media (Embedded Music) also pose a challenge.We now explore this challenge with a brief study of references to external content in the entire collection, ToT .
Cross-modal references Music-ToT, like other ToT domains, contains cross-modal and media references [1], where a searcher refers to external content.We here show that Music-ToT posts in particular contain such references frequently.To this end, we gathered frequent websites that appear in ToT .One author manually labeled these as one of: ( with a small number of posts containing references to both types (1.1%).Therefore, Music-ToT information needs are inherently multimodal.We characterize the remaining 57.7% of queries as descriptive queries, which include references to lyrics, or story descriptions (cf.§3.2).In summary, Music-ToT information needs are characterized by uncertainty and multi-modality, requiring methods like text-based audio retrieval, content based audio retrieval/fingerprinting and multi-or cross-modal retrieval.

BENCHMARKS 5.1 Experimental Setup
Corpora.We run experiments on two corpora.The first is the Wasabi 2.0 Corpus [6,7].It consists of 2M commercial songs from 77K artists and 200K albums.Crucially, (1) songs have the ISRC linked, enabling linking to data in Spotify; (2) it is an open dataset, consisting of rich information that includes lyrics, extensive metadata, and music snippets.We index the Song Name, Artist Name and Lyrics7 of all songs using Elasticsearch (BM25 with default parameters).The second corpus corresponds to the Spotify US catalog, consisting of hundreds of millions of tracks.The Spotify search system [18] utilizes multiple retrieval stages (including lexical-and semantic search) and incorporates historic log data for retrieval purposes.
Queries.We conducted experiments on the 1,256 posts (849 train, 191 validation, and 216 test) from ToT that contain no URLs in the post title or post text; we make this choice as in the most extreme case, the entire post may contain just a URL, requiring audio-based search while we focus on text-based methods.From each post, we create different queries and label them as follows: (1) Title: using the post title only; Evaluation.We report Recall@K, equivalent to Success@K (i.e., one correct answer) for = {10, 100, 1000} on Wasabi.All reported results are on the test set.For Spotify search we describe the observed trends (due to the proprietary nature of the system).

Results
Table 2 provides an overview of our Wasabi results.
Post parts as query.The low success across queries and underscores the difficulty of the task.On Wasabi, Title queries are more effective than Text queries-increased verbosity leads to retrieval failure.However, the text may indeed contain data useful in retrieval, with comparable or higher effectiveness scores for Title+Text over Title at = {100, 1000}, motivating keyword extraction: crucial details might be present in the text, but including the entire On Spotify search we observe a different trend: Title+Text is the most effective query followed by Title.
LLM reformulations as query.Examining Table 2, reformulations have limited success compared to Title queries.Reform 25 and Reform 50 perform as well as Title on S@1000, with Reform ∞ outperforming it.While Keywords beat all but Reform 25 on S@10, it is outperformed by reformulations on S@100 and S@1000.On Spotify search, we find that reformulations fare worse than Title queries for S@10, but see limited success on S@100, with Reform 25 and Reform 50 achieving higher effectiveness.Most importantly, there is no ideal on either index, with varying success across metrics.We thus conclude that in our study, reformulations generated using state-of-the-art LLMs have only mixed success.

CONCLUSIONS
We explored Tip-of-the-Tongue retrieval for music.Of the 94K posts corresponding to Music-ToT information needs from an online community for ToT requests, we linked 2,278 posts to the corresponding answers in the Wasabi corpus, resulting in ToT , thus enabling further research for this challenging task.
We iteratively developed and refined a Music-ToT schema that contains 28 fine-grained labels as shown in Table 1.Labeling 100 posts using this schema, we showed that users express uncertainty frequently, and almost as often refer to other modalities.We benchmarked a subset of 1.2K descriptive queries from ToT , and highlight the difficulty of the task.Future work should leverage cross-and multi-modal retrieval as well as better approaches for reformulations.

Table 1 :
Annotation Schema: Label, frequency of occurrence in 100 submissions / 536 sentences (F), annotator agreement ( ) and description of label, along with an example for each label.Conveys an opinion or judgment about some aspect of the music.I don't remember the lyrics or title, only that it was a kind of angsty teen "I want to set the world on fire" Describes other people involved in the listening experience.A few years back, a friend of mine showed me an . . .
. . . the name of the song was brief, one nordic word.On the cover there was also a cyan teal line going along the bottom with white text in it.Song Quality / Type 4 0.00 Describes the type of music (original/cover, live/recorded) or the production quality (professional, amateur, etc.)Live Cover of All Along the Watchtower where ...CONTEXT et al.ANNOTATIONSUncertainty 162 0.79 Conveys uncertainty about information described.I don't know what genre the song was, it was fairly calming and I feel like it couldve been on tiktok but I don't really know.. . .I heard like in a billion videos 6 years ago.

Table 2 :
Overview of retrieval experiments on Wasabi, using Elasticsearch (BM25).query might harm effectiveness.Our keyword selection method though fails to outperform other queries except for Text on S@10.