Large Language Model Augmented Exercise Retrieval for Personalized Language Learning

We study the problem of zero-shot exercise retrieval in the context of online language learning, to give learners the ability to explicitly request personalized exercises via natural language. Using real-world data collected from language learners, we observe that vector similarity approaches poorly capture the relationship between exercise content and the language that learners use to express what they want to learn. This semantic gap between queries and content dramatically reduces the effectiveness of general-purpose retrieval models pretrained on large scale information retrieval datasets like MS MARCO. We leverage the generative capabilities of large language models to bridge the gap by synthesizing hypothetical exercises based on the learner's input, which are then used to search for relevant exercises. Our approach, which we call mHyER, overcomes three challenges: (1) lack of relevance labels for training, (2) unrestricted learner input content, and (3) low semantic similarity between input and retrieval candidates. mHyER outperforms several strong baselines on two novel benchmarks created from crowdsourced data and publicly available data.


INTRODUCTION
Modern personalized education systems typically leverage the power of machine learning models to estimate user skill levels [7] Figure 1: Exercise retrieval for learner directed language learning and our proposed solution, multilingual Hypothetical Exercise Retriever (mHyER).At a high level, learners are allowed to provide any natural language input, and the goal is to retrieve exercises relevant to that input.Our method utilizes large language models to perform zero-shot retrieval.
and adaptively serve exercises to learners [8,15,37].Adaptivity, while a critical part of any personalized education system, is a passive form of personalization from the learner's point of view: While exercises are tailored to an estimate of the learner's skill level, this customization occurs behind the scenes, with no opportunity for learners to take initiative in shaping the learning process.In this paper, we study a complementary form of learner initiated personalization in the context of online language learning.In particular, learners are given the ability to explicitly request learning content from an education system, which returns relevant exercises from a fixed catalog for the learner to do.
This type of learner initiated personalization can be viewed as a form of self-directed learning, where learners take initiative over the learning process.Self-directed learning has been shown to increase learner performance across multiple topics [18,19,22,23], improve learner motivation [26], and create more cohesive learner environments [13].Online language learning is a natural setting for self-directed learning, as people learn languages for very personal reasons: Some learn for fun, while others have specific goals, such as preparing for an international trip or developing language skills for business.Different reasons for learning lead to different needs for exercise content: Someone learning to write in a business setting may want extra practice with grammar or politeness, whereas the learner learning for a vacation may want exercises about hotels or transportation.Beyond highly personalized motivations for learning, online language learners do not have immediate access to instructors who can plan learning material to target weaknesses.As such, there is an inherent need for online language learners to have some degree of self-direction in order to get the most out of their learning experience.
With the goal of allowing language learners to tailor an online learning experience to their own needs, we formalize the task of exercise retrieval for learner directed language learning and evaluate machine learning models for this task.The goal of this task (Figure 1) is to retrieve relevant exercises from a set of existing exercises based on a learner's input.In this setting, collecting relevance labels (i.e., pairs of learner inputs and relevant exercises) is particularly challenging, as learners will typically be presented with only a small number of exercises for any given input.As a result, we consider the zero-shot setting, where we do not have access to relevance labels for training.While many off-the-shelf models exist for text-based retrieval, we show that direct similarity search (i.e., retrieving exercises that are the most similar to the user input in the representation space) with these models suffers from a semantic similarity gap between how users describe their learning objectives and exercise content.To overcome this gap, we leverage structure inherent to exercises and the generative capabilities of large language models.Specifically, we make the following contributions.
• We propose the new task of exercise retrieval for learner directed language learning in Section 2 and discuss how learner inputs give rise to a fundamental challenge in this task.
• We present our zero-shot retrieval approach, mHyER, in Section 3, and illustrate how augmenting retrieval with LLMs helps overcome the pitfalls of direct similarity search.• With no existing benchmarks for this task, we create two novel benchmarks in exercise retrieval with both crowdsourced data from learners of a popular language learning app and publicly available Tatoeba data.We evaluate our method against several strong dense retrieval baselines in Section 4 and empirically show that mHyER outperforms relevant baselines by a significant margin.

Related work
Exercise retrieval is naturally connected to the broad field of information retrieval, and in particular, dense retrieval [17,20].Dense retrieval focuses on retrieving documents based on similarity measured in a learned representation space.Zero-shot retrieval, or retrieval without training on task-specific relevance information, is of particular relevance to our task.Such methods typically rely on a supervised pretraining stage [31,33,40], where models are trained on large scale retrieval datasets, such as MS MARCO [2].However, such supervised pretraining ultimately depends on the existence of suitable labeled datasets, which are not always readily available [16].The rise of large language models (LLMs) with strong zero/few-shot performance in new domains has resulted in a line of research integrating LLMs into the retrieval pipeline.Such approaches typically rely on some combination of specialized prompting and synthesizing retrieval datasets to retrain retrieval models [3,9,29,39].Our approach takes particular inspiration from HyDE [11], which utilizes a LLM to synthesize a hypothetical document, which is used then used with a pretrained encoder to retrieve documents via nearest neighbors.A fundamental step in any retrieval method is the representation space used for similarity comparisons.For the task of exercises retrieval, we focus on learning sentence embeddings, where pretrained language models such as BERT [10] or RoBERTa [21] serve as strong foundations.Contrastively learning sentence representations, which leverage techniques used in the image domain [4,14], has become especially popular.The goal of contrastive learning is to learn a representation space where similar items ("positive pairs") are pulled close together while dissimilar items ("negative pairs") are pulled far apart in an unsupervised manner.In the image domain, positive pairs are formed by applying data augmentation, such as cropping or rotating an image.Such techniques are not directly transferable to natural language, resulting in a long line of methods [5,6,12,16,36] studying contrastive sentence embeddings.Of particular interest to the language learning setting is multilingual contrastive learning [34], where positive pairs can be taken as the same sentence in two different languages.In all, mHyER can be viewed as a combination of multilingual contrastive learning [34] and HyDE [11].
Personalized education systems often gauge a learner's skill level via Knowledge Tracing [7] in order to tailor exercise difficulty level.As a result, a variety of contemporary machine learning methods [1,25,27,30,32,38] have been developed to track learner skill level from historical data.Such methods demonstrate strong empirical success and thus have been leveraged to adaptively recommend exercises to learners [15,37] or even generate new exercises based on skill level [8].This adaptivity can be viewed as a complementary piece to the problem of exercise retrieval directed language learning that we study in this paper: learner initiated personalization can leverage existing tools from adaptivity to ensure exercises are both relevant and at the right skill level.On the other hand, adaptive systems can benefit from explicit learner direction.For example, we can view the learner input of "past tense verbs" as the learner explicitly saying they are not comfortable with past tense verbs, and use this information in skill estimation.

PROBLEM SETUP
The goal of exercise retrieval for learner directed language learning is to retrieve relevant exercises for a learner given a text input from the learner describing what they want to learn.In particular, we assume that learner is taking a language learning course, which consists of two languages: the "first language" (i.e., a language they already know) and the "second language" (i.e., the language they are learning), which we refer to as L1 and L2, respectively. 1The learner completes exercises, which are drawn from a fixed set of  exercises E = { 1 , . . .,   } that are at an appropriate level for the learner.We can view this set of  exercises as samples from an unknown exercise distribution, which captures characteristics (style, length, content, etc) of exercises.For simplicity, we limit our attention to translation exercises, in which a learner translates an L1 sentence  (L1)  to the L2, with one correct L2 answer  (L2)  available as an example of a correct translation.The learner provides some input , and our objective is to retrieve the  (unique) exercises that are the most relevant based on input  in a zero-shot manner.That is, without using any labeled relevance data for training, we want to retrieve  unique exercises  ★ 1 , . . .,  ★  that maximize probability  that an exercise is relevant conditioned on learner input : (1)

Learner inputs.
The core of the personalized experience in this problem setting is allowing learners to provide an input describing what they want to learn with no restrictions on input content, resulting in large number of potential input types.For example: • Topics: Learners can request exercises that teach vocabulary relevant to a particular topic.Inputs such as "words about animals" or "countries" are such examples.• Grammar: Learners can request exercises teaching grammatical concepts, such as "non-present tenses" or "irregular plurals".• Culture: Learners can request to review culture-specific aspects of language, such as idioms, slang, or region-specific quirks.For example, a learner learning Spanish may want to learn about "voseo", a region-specific grammatical concept in South America.• Learning process: Learners can request exercises that help with particular parts of the process of learning a language, such as "words that are hard to spell" or "sentences for firstyear students".
Learner inputs of these types result in what we call a referential similarity gap: Under modern text-based retrieval models, how learners express their learning objectives (i.e., the learner input ) is not considered similar to what it is referring to, i.e., the content of the exercises  (L1) and  (L2) .We explore this gap in greater detail in Section 3.3.

METHOD
In this section, we present multilingual Hypothetical Exercise Retriever (mHyER), our zero-shot exercise retrieval framework, and show that it overcomes the pitfall of direct similarity search in learned representation spaces.
3.1 Baseline: direct search with similarity spaces.
The backbone of text-based retrieval is a vector space representation of text that reflects some notion of similarity between different pieces of text.Forming these representation spaces remains a core part of text-based retrieval, with early methods such as BM25 [28] formed representations via word frequency.Such methods struggle to generalize as their representation spaces are formed based on counting exact or near text matches.To improve generalization, contemporary methods for text-based retrieval typically train a model   (parametrized by  ) that maps natural language inputs (from the space of all text inputs T ) to some -dimensional vector space:   : T → R  .Such models are referred to as encoders, and learn representations of text called embeddings.That is, if  ∈ T is some text, then   () is its embedding representation.Because exercises are typically short sentences or sentence fragments, we focus on encoders specifically geared towards learning sentence embeddings in this work.
Harnessing the vast availability of text data, contemporary encoders are typically neural networks trained such that texts with similar content are more similar in the representation space under some measure, like cosine similarity.That is, if  1 ,  2 ∈ T are similar in content, then sim(  ( 1 ),   ( 2 )) is large (and positive).This similarity space suggests a natural approach for retrieving exercises: Pass each exercise   through the model   to obtain its embedding representation   (  ). 2 Then, when a learner provides an input , pass  through the model to obtain   () and return the  exercises with largest cosine similarity sim(  (  ),   ()).As we see in Section 3.3, direct similarity search often retrieves sentences featuring "language about language", which are often irrelevant to the learner's input.This leads us to leverage the generative abilities of LLMs, as we discuss next.

mHyER: augmenting direct search with generative capabilities.
If large quantities of relevance data were available, we could train a model to approximate the relevance probability in Equation 1by learning a representation space where learner inputs and relevant exercises are considered similar and then performing direct search.However, input relevance data is unlikely to be available at the scale necessary to train such a model.Instead, we propose mHyER, visualized in Figure 2, which after a multilingual contrastive training stage, retrieves exercises in a two-step manner.First, we sample a set of   hypothetical exercises from the exercise distribution conditioned on the learner input.We call these sampled exercises our retrieval candidates.In principle, we do not have access to this exact distribution, but we can efficiently approximate sampling via LLM.Second, we use the retrieval candidates to perform similarity search via -nearest neighbors.mHyER is inspired by two complementary methods: the multilingual contrastive learning approach of [34], and the HyDE retrieval method of [11].We now discuss both the training 2 We slightly abuse notation here and write   (  ) to mean either   ( ).The choice to compare against the representation of the L1 or L2 sentence is explored in Section 4.
Figure 3: TSNE visualization of exercise, learner input, and GPT-4-synthesized retrieval candidate representations in the representation space of a trained mBERT encoder (left).Learner inputs concentrate in the language about language region (top right), making direct similarity search sub-optimal.Retrieval candidates bridge the referential similarity gap between learner inputs and exercise text and are close in similarity to exercises that meet the learner's specifications (bottom right).and retrieval stages in greater detail.
Stage 1: Learning a multilingual similarity space.While we operate in a setting where no explicit learner relevance data is provided, the multilingual nature of our exercises implies that a certain structure should exist in our representation space.Namely, the sentence  (L1)  in L1 should be similar to its translation  (L2)  in L2.To ensure this structure is reflected in our representation space, we take inspiration from [34] and utilize multilingual contrastive learning, an unsupervised approach that aims to learn a representation where similar items (called positive pairs) are closer together and dissimilar items (called negative pairs) are far apart.For exercise   , the contrastive loss L  with a mini-batch of   sentence pairs is where  is the user-set temperature parameter and sim (•, •) is the cosine similarity: In this work, rather than train a sentence encoder from scratch, we follow the commonly accepted practice of initializing our encoder with existing BERT-based checkpoints and contrastively finetuning these checkpoints on exercise data.
Stage 2: Sampling retrieval candidates and exercise retrieval.
A core component of mHyER is sampling from the exercise distribution conditioned on the learner input.While we cannot sample directly from this distribution, we can approximate sampling with a LLM.In particular, we prompt the LLM with a fixed a description of the exercise distribution and instruct the LLM to synthesize hypothetical exercises based on this description and based on a learner's input.Crucially, we can synthesize exercises without requiring any labeled examples, i.e., we do not embed examples of inputs and relevant exercises in the prompt.To retrieve exercises, the LLM synthesizes  ℎ hypothetical exercises, which we denote ẽ1 , . . ., ẽ ℎ .We then encode these hypothetical exercises via   to obtain  ℎ vectors   ( ẽ1 ), . . .,   ( ẽ ℎ ).To retrieve exercises, we retrieve the  exercises that have the highest similarity score compared to the average of the  ℎ vectors: 1 =1   ( ẽ ).We use GPT-4 [24] in this work, but in practice, any LLM of sufficient capacity can be used.

Bridging the referential similarity gap with mHyER.
In an effort to better understand the task of retrieving exercises from learner inputs, we crowdsourced a small dataset of learner inputs from users of Duolingo, a popular language learning app.
We then contrastively finetune mBERT with roughly 40,000 real exercises from the app, spanning 4 different language courses.To get a sense of how contrastively learned similarity spaces reflect learner inputs and exercise text, we visualize our collected data, along with a subsample of the exercises, via TSNE in Figure 3.This visualization reveals a fundamental referential similarity gap between learner inputs and exercise text: How learners describe what they want to learn occupies a distinct part of the representation space, characterized by explicit use of words or phrases about language (e.g., "verbs", "past tense", "adjectives").We refer to this region as the "language about language" region.As a result, direct similarity search yields exercises that similarly contain words explicitly about language.As shown in Figure 3, the input "past tense verbs" is most similar to exercises about language (e.g., "I explained the new words to him").Figure 3 also highlights how synthesizing retrieval candidates helps bridge this referential similarity gap by "translating" the learner's input (which is typically expressed in "language about language") to a hypothetical in-distribution exercise whose content satisfies the learner input.We provide concrete examples of learner inputs and synthesized retrieval candidates in Table 1.

DATASETS AND EXPERIMENTAL RESULTS
In this section, we first give an overview of two novel datasets specifically for the task of learner directed language learning.We then compare mHyER against a variety of baselines on these datasets.

Datasets
Duolingo Relevance (DuoRD) Dataset.To evaluate our method, we collected a small scale dataset of 61 learner inputs from learners of Duolingo, a popular language learning app.For each input, we asked the learner to rate 15 exercises as relevant or irrelevant to their input, resulting in 915 total exercises rated.
Exercises were sourced a pool of approximately 40,000 sentence pairs across four distinct courses.To ensure that the dataset was not skewed too heavily towards relevant or irrelevant responses, we utilize a sampling approach.Using mHyER, we retrieved the top 555 exercises in terms of similarity.To form the set of 15 exercises for the learner to rate, we select the top 5 scoring exercises deterministically (Tier 1).From the next 50 highest scoring exercises, we randomly select 5 exercises uniformly at random without replacement (Tier 2).We repeat this sampling again, randomly drawing 5 exercises from the remaining 500 exercises (Tier 3).We observe that 64% of exercises from Tier 1 were rated as relevant, 50% from Tier 2, and 34% from Tier 3, resulting in 49% of all exercises rated as relevant.
Tatoeba Tags dataset.To test our method on a larger scale, we construct a retrieval dataset from Tatoeba, a public database of sentences and their translations.We begin by noting that when sentences are uploaded to Tatoeba, they are often tagged by grammatical concepts, language specific concepts, or topics.For example, the sentence "The brown bear is an omnivore" is tagged with "animals" and the sentence "That way I kill two birds with one stone" is tagged with "idiomatic expression".We treat each of these tags as a learner input, and deem an exercise relevant if it has been tagged accordingly.While per sentence tags are not necessarily exhaustive, they provide useful signal for evaluating retrieval approaches with typical retrieval metrics as well as binary classification metrics, as we discuss in the Section 4.2.We form 3 benchmarks for evaluation, collectively referred to as the Tatoeba Tags dataset: • English benchmark: only English sentences with 139 tags and 89,392 sentences.• Spanish from English benchmark: Spanish-English sentence pairs with 114 tags in Spanish and 49,258 pairs.• English from Spanish benchmark: Spanish-English sentence pairs with 108 tags in Spanish and 46,837 pairs.
To form the benchmarks, we collect all tags corresponding to the benchmark, filter out tags and sentences containing profanity, merge similar tags together, and then perform benchmark specific language and content processing.We then keep only the tags with more than 20 sentences and download the corresponding sentences.The benchmark specific processing is done to better align the benchmark with how learners would interact with real-world language learning courses.Specifically, we perform both language and content processing.For language processing, we translate all tags (which appear in a variety of languages) to the L1.This is done to emulate the learning process: we use tags as a stand-in for learner inputs, which are likely to be the learner's L1.For content processing, we remove tags that do not make sense in the context of a particular learning direction.For example, a Spanish speaker learning English would not input "voseo" (a Spanish grammatical concept), nor would an English speaker learning Spanish input "British English".

Evaluation procedure and metrics
For the DuoRD dataset, we treat the 915 exercises that have been rated for some learner input as the exercise set.Because each of the 915 exercises was not assigned a relevance rating for every learner inputs, we cannot use typical information retrieval metrics such as Recall or Precision.As a result, we treat evaluation on this dataset as a binary classification problem, where the goal is to predict whether an exercise is relevant or irrelevant.To evaluate methods, we use area under the receiver operating characteristic curve (AUC) and accuracy.To compute AUC, for each retrieved exercise, we compute a score equal to the similarity measure between the retrieval candidate and all exercises.We then aggregate relevance labels and scores across all learner inputs to define the ROC curve.To compute accuracy, we compute the scores as in AUC, and set a threshold such that any exercise above the threshold is deemed relevant and vice versa.Because the similarity score ranges between -1 and 1, we set the threshold by sweeping over [−1, 1) in increments of 0.1.We then report the highest accuracy among all thresholds in the sweep.
For the Tatoeba Tags dataset, because we have a notion of relevance, as indicated by the presence of a tag, we utilize Precision@, which is a common metric in information retrieval that reports the fraction of the  retrieved exercises that are relevant.To compute Precision@, we retrieve  sentences per learner input (i.e., tag) and record the fraction of the  retrieved sentences tagged with the learner input tag.Because the tagging of Tatoeba sentences is not exhaustive, the absolute values of reported Precision@ may be low, but relative performance still indicates how methods would perform if tagging was comprehensive.In light of this, we again follow the evaluation approach of the DuoRD dataset and report AUC.
When performing evaluation in both datasets, we can retrieve exercises in two distinct ways.We can synthesize retrieval candidates in the L1 and perform similarity search on the L1 exercise texts.Alternatively, we can synthesize retrieval candidates in the L2 and perform similarity search on the L2 exercise texts (example translations).As a result, we report AUC, accuracy, and precision@ in both the L1 and L2 setting.
Table 1: Examples of collected learner inputs and retrieval candidates synthesized based on the learner input via GPT-4.For a variety of input types, GPT-4 is able to bridge the referential similarity gap by synthesizing text that closely resembles real exercise text while incorporating the concept that the learner wants to learn.

Input
Synthesized retrieval candidates

Past tense
They went to the concert last night.Did you finish your project on time?
We didn't have any coffee this morning.
She cooked a delicious meal for us.He had never seen such a beautiful sunset.
Were they able to solve the problem?
Future tense She will be moving to France next year.I won't attend the party tonight.When will you finish the project?
They'll be studying for the exam tomorrow.In five years, I'll have my own business.We're going to plant a garden this summer.

Present progressive verbs
Are you studying for the test?She's preparing dinner for tonight.They're practicing their dance routine.
He's not listening to the lecture.I'm writing a letter to my friend.
The cat is chasing its tail.

Baselines
For both the DuoRD dataset and Tatoeba tags dataset, we evaluate mHyER against direct similarity search using BERT and mBERT [10], as well as the following BERT-based models: Contriever, mContriever [16], and SimCSE [12].In particular, we use the BERT base (110 million parameters) variant of each of the above methods.These methods achieve strong unsupervised performance in a variety of retrieval and semantic text similarity tasks.BERT and mBERT were trained in a self-supervised manner by using masked language modeling and next sentence prediction objectives, with the only difference being the training data (only English for BERT and a multilingual corpus for mBERT).Contriever and mContriever propose two new approaches in contrastively tuning BERT: (1) utilizing an inverse-cloze task and independent cropping as means of forming positive pairs and (2) utilizing a Momentum encoder as described in [16] to ensure better representation of negative items.Contriever is initialized with BERT and trained on English CCNet [35] and Wikipedia data, whereas mContriever was initialized with mBERT and trained on multiple languages in CCNet.We also consider supervised variants of Contriever and mContriever, which are finetuned on the MS MARCO [2], a large scale retrieval dataset.SimCSE uses dropout to create synthetic positive pairs for the contrastive loss by passing the same sentence through the encoder with different random dropout parameters.Starting with BERT, SimCSE is trained on Wikipedia data.

Direct similarity search vs. mHyER: A qualitative case study
Before we present our full experimental results, we first present examples of inputs and retrieved sentences on the English benchmark of the Tatoeba Tags dataset.To qualitatively gauge the difference between direct similarity search and mHyER, we provide examples of retrieved exercises for a small number of inputs in Table 2.We present the top three retrieved exercises measured in terms of similarity score for both direct similarity search and mHyER, using mBERT finetuned on Tatoeba data as our similarity space.The input "Subject verb agreement" highlights the "language about language" phenomena: Instead of retrieving exercises containing correct subject verb agreement, direct similarity retrieves exercises in the "language about language" part of the similarity space.These exercises contain words such as "words" and "verb".On the other hand, mHyER is capable of bridging the gap between input and exercises, retrieving exercises that focus on ensuring sentences with plural objects have the right verb form.The "Preference" input illustrates an example of a nebulous input, as the learner wants exercises that have to do with expressing preferences.However, direct similarity search returns exercises explicitly about "choice", whereas mHyER retrieves exercises that have learners practice expressing preference in more natural settings.The last two inputs, "Cooking" and "Sports", illustrate instances where direct similarity search yields exercises that too literally match the input.Aside from retrieval quality, we observe that retrieved results from direct similarity search also suffer from sentence length bias.In contrastively learned similarity spaces, it has been empirically observed that the length of a sentence is implicitly encoded in the representation of a sentence, meaning sentences of a similar length are more likely to be considered similar [36].The retrieved exercises from direct similarity search shown in Table 2 clearly exhibit this bias whereas those retrieved via mHyER exhibit a higher variation in length.We confirm that this phenomena holds for all inputs in the Tatoeba Tags English benchmark by retrieving the top 3 exercises across all 139 tags with both direct similarity search and mHyER.For each exercise, we record its length (measured in number of Table 2: Examples of exercises retrieved with direct similarity search and mHyER for the same input on the Tatoeba Tags English Benchmark.Direct similarity search is not capable of bridging the fundamental referential similarity gap between learner inputs and exercise content, as illustrated by "Subject verb agreement", "Second person", and "Colloquial" inputs.In settings where learners ask about specific topics, direct similarity search returns exercises that most literally match the learner input, as shown with the "Preference", "Cooking" and "Sports" inputs.On the other hand, mHyER retrieves exercises well aligned with the learner input.characters).As shown in Figure 4, exercises retrieved with mHyER are longer on average, aligning remarkably well with the global sentence length distribution.On the other hand, direct similarity search yields sentences that are notably shorter on average.This empirical observation highlights that generating in-distribution retrieval candidates allows us to retrieve sentences of varied length that track well with our set of exercises.

Experimental results
In this section, we present our experimental results on the DuoRD dataset and Tatoeba Tags dataset.For both datasets, we consider two starting points for fine-tuning the BERT embedding model: Unsupervised pretraining, where we contrastively train a BERT checkpoint that has been pretrained in an unsupervised manner, and supervised pretraining, where we start with a BERT checkpoint that has been pretrained on MS MARCO [2] For all experiments, we take the [CLS] representation as the sentence representation, except when working with Contriever and mContriever, where we use their custom mean pooling 3 .For all experiments with mHyER, we adopt the training setup of [34], which is adapted from [12], including all default hyperparameters.For retrieval, we synthesize  ℎ = 10 hypothetical retrieval candidates from GPT-4 and perform nearest neighbors search with the averaged embedding of all  ℎ candidates.
DuoRD dataset.The evaluation results of baselines and mHyER on the DuoRD dataset are presented in Table 3.For both unsupervised and supervised settings, we contrastively finetune BERT checkpoints on the full 40K exercises in the DuoRD dataset.In the unsupervised pretraining setting, we start our contrastive finetuning with two multilingual checkpoints: mBERT and mContriever.In this setting, mHyER outperforms all relevant baselines in both AUC and accuracy, with mHyER mContriever achieving the best performance among all methods.mHyER mContriever results in 36.8% and 40.2% AUC gains over mContriever and mBERT, respectively.It is notable that direct similarity search generally fails to perform well, highlighting that the gap between learner inputs and relevant exercises: BERT, mBERT, and mContriever baselines fail to even achieve an AUC of 0.5 corresponding to random guessing, reinforcing the fact that direct similarity search cannot overcome the fundamental mismatch between how learners describe what they want to learn and exercise content.In the supervised pretraining setting, we start our finetuning from the Contriever-sup and mContriever-sup checkpoints, which were finetuned on labeled MS MARCO data.mHyER once again outperforms all baselines, with mHyER mContriever−sup as the best performing method.Here, supervised pretraining modestly improves the performance of 3 See https://huggingface.co/facebook/contriever for further details.
direct similarity search, suggesting that supervised pretraining can lessen the referential similarity gap in a limited manner.The improvement due to supervised pretraining is less pronounced when utilizing mHyER, with even one instance of decreased accuracy.This suggests synthesized retrieval candidates bridge the gap to the point where further improvement is difficult.
Tatoeba Tags dataset.The evaluation results of baselines and mHyER on the Tatoeba Tags dataset are presented in Table 4. On this dataset, we experiment with contrastive finetuning on out-ofdistribution data.This experiment was inspired by empirical observations from finetuning mBERT.In particular, we contrastively finetuned mBERT on the Spanish from English benchmark (denoted es-from-en) and the English from Spanish benchmark (denoted en-from-es), as well as the 40K out-of-distribution sentence pairs from the DuoRD dataset (which contains English-Spanish pairs).
We observe that finetuning on the DuoRD dataset outperforms finetuning on in-distribution data.This surprising observation leads us to finetune Contriever and mContriever checkpoints with the DuoRD dataset in both the unsupervised and supervised settings.
In the unsupervised setting, we once again observe poor performance from direct similarity search baselines and sizable increases in performance when using mHyER: Up to 39% increases in AUC and more than doubling the performance of precision@15 between the best mHyER method and best direct similarity.We observe similar gains in the supervised pretraining setting.Methods that use Contriever (pretrained only on English data) typically perform better when retrieving in English, whereas methods using mContriever typically perform better in multilingual settings.

Ablation study
The two key steps in mHyER are multilingual contrastive pretraining and synthesizing retrieval candidates.To characterize the relative contributions of each step, we create variants of mHyER performing direct similarity search after contrastive pretraining or retrieving with GPT-synthesized retrieval candidates with a nonfinetuned encoder (i.e., HyDE [11]).As shown in Table 5, the combination of both stages yields the best performance in the vast majority of cases.Utilizing only synthesized retrieval candidates results in the larger increases in precision compared to contrastive finetuning, while the opposite is true for AUC.This suggests that the two steps drive performance increases in complementary ways: Contrastive finetuning changes the similarity space such that relevant exercises are closer to learner inputs at a global level, resulting in increases in AUC (which measures a global ranking of predicitions).However, direct similarity search still cannot overcome referential similarity gaps, and hence, increases in precision@15 are low relatively.Meanwhile, synthesizing retrieval candidates directly improves retrieval quality, resulting in higher retrieval quality, but does not change representations, resulting in relatively lower increases in AUC.

DISCUSSION
In this paper, we introduce the problem of exercise retrieval for learner directed language learning and highlight an important challenge in this setting: how learners express what they want to learn and exercise content are fundamentally semantically different.The effects of this referential similarity gap are especially pronounced when attempting to retrieve exercises via direct similarity search: even models supervised on MS MARCO, a large scale retrieval dataset, struggle to bridge this referential similarity gap.As a result, we propose mHyER, a zero-shot retrieval approach that leverages the generative capabilities of pretrained LLMs to synthesize relevant in-distribution sentences which are then used to retrieve exercises.We form two novel benchmark datasets by collecting human responses and processing publicly available data.mHyER outperforms several strong baselines, including ones trained in a supervised fashion.
Future work.mHyER lays the methodological foundation for selfdirected online language learning.Many interesting directions of future work exist, ranging from investigating different learning areas to methodological extensions that accommodate labeled relevance information.We discuss several of these directions below.mHyER provides concrete methodology that can enable future investigations into the effects of self-directed learning on long-term learning outcomes and curriculum design at scale.Self-directed learning also play a role in improving other components of personalized education systems.For example, a learner input into a self-directed learning system can be viewed as an indicator of a self-perceived weakness, which would provide a powerful form of supervision for estimating user skill levels.Studying how inputs and outputs of complementary parts of a unified personalized education system is an important direction of future work.
Another interesting avenue of future work is investigating if analogous "language about language" phenomena appear in settings other than language learning.We hypothesize that such phenomena exist in one form or another across all learning settings.For example, how learners describe what they want to review in math (e.g., "right angles") exhibits a similar fundamental mismatch with exercise text (e.g., "Compute the length of the hypotenuse of this triangle").If such gaps exist, methods capable of bridging the referential similarity gap, like mHyER, will be required across different learning settings.Characterizing the degree to which such gaps appear, as well as how such gaps differ, in different learning settings remains important and open work.
From a system design and learner experience perspective, developing machine learning methods to retrieve relevant exercises based on learner inputs is a foundational piece of any self-directed language learning system.However, serving a set of exercises that maximizes relevance may not lead to the best learner experience.Instead, the objective of exercise retrieval can be made more flexible: Instead of retrieving  exercises that maximize relevance, we retrieve all exercises with scores that exceed some pre-determined threshold.Then, this set of relevant exercises can be re-ranked based on additional criteria, such as difficulty level (with information from Knowledge Tracing-based parts of the system) or diversity (in terms of difficulty or length).Regardless of precise objective (top  vs. all relevant exercises), the referential similarity gap persists, making mHyER especially suitable for this initial retrieval step.
Methodologically, mHyER was designed explicitly with the goal of zero-shot retrieval.However, opportunities to collect learner relevance feedback grow as self-directed learning systems get implemented.Such feedback can then be used to train retrieval methods.Investigating how to effectively use limited learner feedback to help retrieval methods bridge the referential similarity gap remains an open question.Additionally, extensions of mHyER to learning settings with multi-modal exercises is direction of future work.Using newly developed multi-modal models to measure similarity in different domains, such as images or audio, can unlock a richer learning experience for learners.

Figure 2 :
Figure 2: mHyER consists of two stages.Contrastive finetuning (left) is employed as a training stage to optimize our semantic similarity space for multilingual exercises.Then at retrieval time (right), a large language model is employed to synthesize hypothetical retrieval candidates.These retrieval candidates are then used in direct similarity search to retrieve exercises.

Figure 4 :
Figure 4: Length of the top 3 retrieved exercise sentences, measured in number of characters, for direct similarity search and mHyER.Exercises retrieved via direct similarity search are inherently biased in length, with a majority of exercises being relatively short.Using mHyER results in exercises of more varied length.This variation in length aligns well with the global distribution of exercises, showing that mHyER effectively translates learner inputs to the in-distribution exercises.
, a large scale retrieval dataset that covers different tasks, such as passage ranking and keyphrase extraction.In all settings, mHyER [model] denotes mHyER with starting with [model] as its initial checkpoint for contrastive training.[model]-sup indicates [model] was trained in a supervised manner.We emphasize that at no point in training mHyER is labeled training data for exercise retrieval used; sup only indicates MS MARCO was used to train the initial BERT checkpoint.

Table 3 :
Evaluation results on the DuoRD dataset.mHyER[model]indicates that contrastive finetuning was employed with [model] as the initial checkpoint.+DuoRD dataset denotes that the DuoRD dataset was used for contrastive finetuning.In all cases, mHyER outperforms relevant baselines dramatically.

Table 4 :
Evaluation results on the Tatoeba Tags dataset.mHyER[model]indicates that contrastive finetuning was employed with [model] as the initial checkpoint.+[dataset name] denotes that [dataset name] data was used for contrastive finetuning.In all cases, mHyER outperforms relevant baselines dramatically, with large gains coming from finetuning on out-of-distribution data.

Table 5 :
Ablation results on the Tatoeba Tags dataset.We experiment by removing either the contrastive finetuning step or the retrieval candidate synthesis step.+GPT indicates that retrieval candidates were used with no contrastive finetuning, whereas +DuoRD dataset indicates that direct similarity search was used after contrastively finetuning on the DuoRD dataset.In a vast majority of cases, contrastive finetuning and retrieval candidate synthesis boost performance, with retrieval candidates generally contributing more.