Knowledge-enhanced Agents for Interactive Text Games

Communication via natural language is a key aspect of machine intelligence, and it requires computational models to learn and reason about world concepts, with varying levels of supervision. Significant progress has been made on fully-supervised non-interactive tasks, such as question-answering and procedural text understanding. Yet, various sequential interactive tasks, as in text-based games, have revealed limitations of existing approaches in terms of coherence, contextual awareness, and their ability to learn effectively from the environment. In this paper, we propose a knowledge-injection framework for improved functional grounding of agents in text-based games. Specifically, we consider two forms of domain knowledge that we inject into learning-based agents: memory of previous correct actions and affordances of relevant objects in the environment. Our framework supports two representative model classes: reinforcement learning agents and language model agents. Furthermore, we devise multiple injection strategies for the above domain knowledge types and agent architectures, including injection via knowledge graphs and augmentation of the existing input encoding strategies. We experiment with four models on the 10 tasks in the ScienceWorld text-based game environment, to illustrate the impact of knowledge injection on various model configurations and challenging task settings. Our findings provide crucial insights into the interplay between task properties, model architectures, and domain knowledge for interactive contexts.


INTRODUCTION
Communication through natural language is a crucial aspect of machine intelligence [7].The recent progress of computational language models (LMs) has enabled strong performance on tasks with limited interaction, like question-answering and procedural text understanding [6,20,24].Recognizing that interactivity is an essential aspect of communication, the community has turned its attention towards training and evaluating agents in interactive fiction (IF) environments, like text-based games, which provide a unique testing ground for investigating the reasoning abilities of LMs and the potential for AI agents to perform multi-step real-world tasks in a constrained environment.For instance, in Figure 1, an agent must pick a fruit in the living room and place it in a blue box in the kitchen.Text-based games use text instead of graphics, sounds, or animations to create interactive stories, and can include adventure, puzzle-solving, and role-playing themes; text-based games allow us to study models' abilities to perform functional grounding, separate from, e.g., the problem of multimodal grounding that is inherent in more-complex robot simulation environments [11].Recently developed text-based games, such as TextWorld [9] and ScienceWorld [30], have quickly become popular, inspiring a variety of methods.To succeed in these games, agents must manage their knowledge, reason, and generate language-based actions that produce desired and predictable changes in the game world.IF games can be formulated as Partially Observable Markov Decision Processes (POMDPs), a category of sequential decision-making challenges under uncertainty.POMDPs encompass scenarios with only partially observable states and where the effects of actions are uncertain.Thus, IF games can be modeled using reinforcement learning (RL)-with states, actions, observations, transitions, and rewards [21].Observations correspond to text descriptions from environment and states are based on descriptions of agent and item locations, inventory contents, and surroundings.Given their natural language formulation, text-based games can also be tackled by LM approaches.The pros and cons of these two modeling paradigms are complementary.RL approaches function online and offer the advantage of modeling multistep transitions, but they can become challenging to optimize if reward structure and state information lack sufficient signals for effective learning.LMs offer flexibility in choosing subsequent actions, possess vast semantic knowledge, and can be advantageous for generating high-level, natural language instructions; yet, they operate within rigorous constraints in input size and do not support multi-step interactions natively.
Prior work has shown that RL-and LM-based agents struggle to reason about or to explain science concepts in IF environments [30], which raises questions about these models' ability to generalize to unseen situations beyond what has been observed during training [19].For example, while tasks such as 'retrieving a known substance's melting (or boiling) point' may be relatively simple, 'determining an unknown substance's melting (or boiling) point in a specific environment' can be challenging for these models.To improve generalization, it may be effective to incorporate world knowledge, e.g., about object affordances; yet, no prior work has investigated this direction.In addition, existing models struggle to learn effectively from environmental feedback.For instance, when examining the conductivity of a specific substance, the agent must understand that it has already obtained the necessary wires and the particular substance so that it then proceeds to locate a power source.Therefore, there is a need for a framework that can analyze and evaluate the effectiveness of different types of knowledge and knowledge-injection methods for text-based game agents.
In this paper, we design such a framework to augment existing text-based game agents with additional knowledge.We perform knowledge injection for two complementary paradigms based on training objectives: (1) online policy optimization through rewards, including pure RL [13] and enhanced RL with Knowledge Graphs (KGs) [1], and (2) single-step offline prediction, including both pretrained LM [23] and instructions-tuned [22] LM.We consider these two model classes because they are representative of the existing approaches for text-based games, which allows us to investigate how different model paradigms respond to knowledge injection techniques.We experiment with two types of additional knowledgenamely, task history and object affordances knowledge.We evaluate the effectiveness of our proposed framework on the diverse set of 10 elementary school science tasks of the ScienceWorld environment [30].The results illustrate that knowledge injection exerts a more favorable influence on single-step offline prediction models, i.e., LMs.Also, adding affordance knowledge is more beneficial than historical knowledge.Our contributions are as follows: (1) We investigate the role of knowledge injection in learning-based agents for semi-Markov interactive text games.We specifically focus on injecting memory about previous correct actions and the affordances of the relevant objects in the agent's scene.(2) We integrate our injection strategies in two model paradigms, each with two variants: RL ('pure' RL and KG-enhanced RL) and language modeling (pre-trained and instructions-tuned).We devise multiple injection strategies to enrich the informationas part of existing inputs, as new inputs, or as KG relations.(3) We perform experiments on diverse tasks of ScienceWorld environment to provide insights on the impact of including affordance knowledge and action memory across different architectures, tasks, and knowledge-injection strategies.Our extensive experiments advance the understanding of how external knowledge can produce better action selection in text-based games.

RELATED WORK
Reinforcement Learning for Text-based Games has been a popular idea due to the conventional formulation of text-based games as Markov decision processes.A common challenge in these games is the combinatorially large action space, which makes it difficult to find a good policy.Carta et al. [5] proposed an approach to achieve alignment through functional grounding, where an agent uses an LM as a policy to solve goals through online RL.Madotto et al. [26] introduced a new exploration and imitation-based agent to play text-based games, which can be seen as a testbed for language understanding and generation.The proposed method uses the exploration approach of Go-Explore [10] for solving games and trains a policy to imitate trajectories with high rewards.eXploit-Then-eXplore (XTX) [28] is a multi-stage episodic control algorithm that separates exploitation and exploration into distinct policies, guiding the agent's action selection at different phases within a single episode.Yao et al. [32] proposed a Contextual Action LM to generate a compact set of action candidates at each game state and combine it with an RL agent to re-rank the generated action candidates.The Deep Reinforcement Relevance Network (DRRN) model [13] uses a separate Gated Recurrent Unit (GRU) for processing action text into a vector which is used to estimate a joint Q-Value Q(o, a) over the observation  and each action .Our work injects knowledge into the DRRN model to enhance agents' understanding of the game world.While these works have typically relied on the memory of the single previous action taken, regardless of its utility, our approach distinguishes itself by taking into account the memory of all previous actions that generated a positive reward.Thus, our agents obtain better performance by using this information to reinforce correct decision-making and avoid repeating past mistakes.LMs for Text-based Games used in works such as Swift [22], Re-Act [33], and SayCan [4] have revealed the feasibility of autonomous decision-making agents.Swift [22] is a model that takes into account the environment state and the history of the last ten actions as input strings for the LM.Additionally, they pursued model training through a supervised approach.ReAct [33] enables LMs to generate subgoals within action planning by incorporating a virtual 'think' action.This method necessitates human annotators to furnish instances of 'thinking, ' outlining subsequent subgoals and furnishing comprehensive action trajectories.SayCan [4] integrates an LM and a value function, aligning with grounding affordances.Using historical and current context as textual inputs, SayCan generates a ranked list of actions, grounding LMs through value functions reflecting action success likelihood.While Swift and SayCan retain a record of action history, the contribution of this information is not systematically studied.Moreover, they do not include world knowledge like object affordances.Knowledge-injection in Text-based Game Agents has been used to enhance the performance of RL and LM agents.Ahn et al. [4] identify potential actions using an LM and assign scores to these actions based on their likelihood of success in a given environment, which can be seen as an implicit affordance information.
Swift [22] introduced an additional layer of knowledge by incorporating the history of the previous ten actions within the episode.
Several works [1,2,15,31] have used KGs as an extra knowledge source to provide a structured representation of the game world, which can be used to guide agents' decision-making.

FRAMEWORK FOR KNOWLEDGE INJECTION IN TEXT-BASED GAME AGENTS
In most text-based games, the agent's input is comprised of three primary elements: the observation of the environment (obv), the contents of the agent's inventory (inv), and the task description (desc).These elements give the agent the context to make informed decisions and progress through the game.Based on these inputs, the agent is presented with a set of valid actions that it can perform, such as moving to a new location, interacting with objects in the environment, or using items in its inventory.Through these interactions, agents explore the game world, solve puzzles, and advance the story.In this section, we detail approaches for improving agents' downstream performance, thereby improving agents' coherence in action generation, their contextual awareness, and their abilities to learn effectively from the interactive environment.We consider two types of knowledge for enriching the inputs and two representative model classes as subjects for knowledge injection with their corresponding variants and knowledge-injection strategies.

Input Enrichment with Extra Knowledge
We expect that the raw inputs from the environment (observation, inventory, and task description) make it challenging for the agent to act coherently and learn from its mistakes.To improve the coherence and learning process, we enrich the apparent inputs with two complementary knowledge types: action memory and affordances.Action Memory.Historical knowledge is necessary for an AI agent to learn how to predict future steps based on a sequence of steps that it has taken previously.The historical knowledge could be in the form of all past actions picked by the model or the sequence of correct actions chosen by the model.Our analysis shows that preserving the past correct actions is a superior approach because it helps to reinforce successful strategies and prevent the model from repeating unsuccessful ones.Hence, we preserve the memory of previous correct actions (MCA) taken by the agent in the current episode as input for all our models.Moreover, the memory can be short-term (within an episode) or long-term (across episodes).We focus on short-term memory from the current episode.MCA is determined by the environment feedback.If an action yields a reward, then it is considered correct.Therefore correct actions cannot be fed to the agent initially, but are instead stored in memory as the agent progresses through the (train/test time) episode.Affordance Knowledge.Essentially, affordances are the set of possible actions allowed in a particular state of the environment.Within the field of perceptual psychology, they are seen as a central tool through which living beings categorize their environment [12].We expect that affordances can help models learn better by listing the possible interactions with the objects around them.Unlike historical knowledge, the environment does not provide the affordances, and they need to be retrieved from external sources.For this purpose, we use ConceptNet [27] and obtain its capableOf and usedFor relations for the objects in a given IF game episode. 1 The obtained affordances are then aggregated with the original environment inputs.For the example in Figure 1, we inject information that an apple affords being eaten, and a box can contain objects.

Knowledge Injection in Methods
We support two complementary paradigms based on training objectives: (1) online policy optimization through rewards using reinforcement learning (RL), where we frame the task as a POMDP; and (2) single-step offline prediction achieved through supervised training, approached as a language modeling task.

Online Policy Optimization through Rewards (RL Methods).
(1) Pure RL-based Model.We employ DRRN [14], due to its strong performance across challenging interactive text-based environments [13].DRRN leverages a GRU to encode the current game state into a vector as shown in Figure 2. It uses a separate GRU to encode each of the valid actions into a vector and then combines the action vector with the game state vector through an interaction function to compute the Q-value (Q   ), which estimates the total discounted reward expected if that action is taken.The policy   is learned by maximizing the expected cumulative reward where ⊙ signifies concatenation,  is GRU encoder of size F×F, , F is the embedding dimension of size 128, and    is the Q-value for each valid action   .
(2) RL-enhanced KG Model: As a knowledge-augmented RL agent, we used the Knowledge-augmented Actor-Critic (KG-A2C) model [1].For KG-A2C, in addition to the textual representation of the game state, the agent also builds a dynamic KG representing the state space by parsing the textual descriptions using OpenIE [3].KG's symbolic representation of the game states can help effective reasoning about the next course of action.The overall model architecture is shown in Figure 3.Each of the textual inputs is encoded with a GRU, and the KG is separately encoded with KG embeddings and Graph Attention network [29].In addition, the model takes into account the total score obtained so far through the binary score encoding.Formally, KG-A2C produces KG encoding (  ), input encoding (  ), and score encoding (  ): where ⊙ signifies concatenation,  is GRU encoder of size 100×100.W and b are weights and biases,    is the attention weights, and ℎ  is the node feature vector.  is the binary score encoding of the total score obtained so far with a shape of 1x10, which is calculated using the cumulative reward attained up to the present moment.
The reward is first converted to a binary format with a length of 9, to which a '0' is added to the beginning in case the cumulative reward is positive, and a '1' is added if the reward is negative.The final state info vector    is calculated by concatenating the three input representations, and it is then used to generate actions for the agent.Overall, the model is trained with the actor-critic policy gradient.Instead of sampling directly from the valid action space, the policy network generates action templates and then populates the templates with objects from the knowledge graph.Thus, to make the learning more effective, KG-A2C also adds three auxiliary losses to encourage the model to generate valid actions, i.e., actions that would cause the game state to change: where L T , L O , and L E are template loss, object loss, and entropy loss respectively. ∈ Valid(  ) is a valid action,  ∈ Valid  (  ) is valid template,  ∈ Valid(  ) is a valid object, and  is a state.Knowledge Injection.As baseline, we use a modified version of KG-A2C, where we utilize a single golden action sequence provided by the environment as the target, even though there may exist multiple possible golden sequences.We found this target to perform better than the original target of predicting a valid action.We devise the following knowledge-injection strategies to incorporate memory of correct actions and affordance knowledge for KG-A2C: 1. mca: On top of the baseline, we incorporate all previously correct actions by using a separate GRU encoding layer and concatenate the output vector along with other output representations.2. aff: The KG component in the KG-A2C model provides us with a convenient way to add more knowledge.In particular, we directly add the affordance knowledge into the KG as additional triples on top of the baseline model.For example, given the existing relation in the KG (living room, hasA, apple) we can add the affordance relation: (apple, usedFor, eating).In this way, the KG encoding network can produce a more meaningful representation of the game state and potentially guide the model to produce better actions.In our experiments, we compare this approach to adding affordance knowledge using a separate GRU encoding layer, similar to the DRRN case.3.aff ⊕ mca: We include both affordances in the KG and the memory of all previous correction actions with a separate GRU encoding layer.

Single-step Offline Prediction (LM Methods).
(1) Pre-trained LM:.We employed the RoBERTa [23] pre-trained LM due to its strong performance on various procedural understanding and commonsense reasoning tasks [20].RoBERTa is a transformer-based, encoder-only model trained using masked language modeling.Due to its large size, we choose offline fine-tuning to train the agent.Here we view the task as multiple-choice QA.At each step, the current game state is treated as the question and must predict the next action from a set of candidates.Similar to RL agents, the model is given the environment observation (  ), inventory (  ), and task description () at every step.Then we concatenate it with each action and let the LM select the action with the highest score.Given the large set of possible actions, we only randomly select  distractor actions during training to reduce the computational burden, the LM is trained with cross-entropy loss to select the correct action.At inference time, the model assigns scores for all valid actions, and we use top-p sampling for action selection to prevent it from being stuck in an action loop.Knowledge Injection.We formalize three knowledge-injection strategies for the baseline RoBERTa model (Figure 4): 1.mca: Here, we enable the LM to be aware of its past correct actions by incorporating an MCA that lists them as a string, appended to the original input.Due to token limitations of RoBERTa, we use a sliding window with size , i.e., at each step, the model sees at most the past  correct actions.2.aff: We inject affordance knowledge into the LM by first adapting it on a subset of the Commonsense Knowledge Graph [18] containing object utilities [17].We adapt the model via an auxiliary QA task following prior knowledge injection work [34].We use pretraining instead of simple concatenation for input due to the substantial volume of affordance knowledge triples, which cannot be simply concatenated to the input of RoBERTa due to limited input length.Pre-training on affordances through an auxiliary QA task alleviates this challenge, while still enabling the model to learn the relevant knowledge.We then finetune our task model on top of the utility-enhanced model, as described in the baseline.3.aff ⊕ mca: This variation simply combines mca and aff.
(2) Instruction-tuned LM:.We utilized the Swift model [22], which is based on the Flan-T5 [8] instruction-following architecture.The training follows a Seq2Seq methodology, wherein the input comprises state information, and the desired outcome is the correct action.The encompassed state information integrates task and environmental data: "desc -step number -score -action history -obv -inv -visited rooms -What action should you do next?"(Figure 5).The action history contains the last ten performed actions, each with the respective environmental reward, e.g., "go to outside (+16) -> You move to the outside."Knowledge Injection.The Swift model inherently integrates the historical context of the preceding ten actions.Notably, in contrast to the three previously examined models that exclusively consider the history of the last ten correct actions, the Swift model adheres to its original design by encompassing the entire history of the ten previous actions.To establish a comparable baseline model to the methodology applied in the preceding three architectures, we omit the action history from the Swift model.The unaltered variation of Swift is herein denoted as the 1.mca version.Additionally, incorporation of affordance into the baseline model yields the 2.aff model.Similarly, integration of affordances within the mca version led to the formation of the 3.aff ⊕ mca: model.These affordances are introduced into the primary input sequence immediately following the inventory data and preceding information about visited rooms.

EXPERIMENTAL SETUP 4.1 Task and Evaluation Metrics
ScienceWorld is a virtual representation of the world in an intricate text-based environment in English, with a variety of objects, actions, and tasks [30].It includes ten connected locations with 218 unique objects such as instruments, electrical components, plants/animals, substances, containers, and everyday objects like furniture, books, and paintings.There are 25 high-level actions, with up to 200,000 possible combinations per step, only a few of which have practical applications.ScienceWorld has 10 tasks with a total set of 30 sub-tasks.Due to the diversity within ScienceWorld, each task functions as an individual benchmark with distinct reasoning abilities, knowledge requirements, and varying numbers of actions needed to achieve the goal state.Moreover, each subtask has a set of mandatory objectives that need to be met by any agent (such as focusing on a non-living object and putting it in a red box in the kitchen).The rewards for completing these tasks are highly quantized for learning purposes to guide the agent toward preferred solutions.Namely, for each performed action, the ScienceWorld environment provides a numeric score (reward) and a boolean indication of whether the task has been completed.The agent can take up to 100 steps (actions) in each episode, and its final score is scaled to fall between 0 and 100.Its score improves when both the episode goal and its sub-goals are achieved.The evaluation for an episode concludes and the cumulative score is returned when the agent receives information from the environment that the task has been completed or the limit of 100 steps is reached.
For experimentation purposes, we selected a single representative sub-task from each of the 10 tasks.The numbers in brackets in the 'Task' column of Table 1 signify the original ScienceWorld subtask number out of 30. 2 All evaluation results in this paper are averaged over three model runs on the test dataset.

Implementation and Modeling Details
Following the original methods, we use task-specific training for DRRN, KG-A2C, and RoBERTa, resulting in the creation of 10 distinct models for the 10 tasks.In contrast, Swift is trained once using the entire training dataset.While we conducted experiments with KG-A2C and RoBERTa to develop a unified model for a more 2 Please refer to [30] for more information about the tasks and their train-test splits.fair comparison, the outcomes were detrimental to the performance.Hence, we use the same setup of DRRN and KG-A2C as in ScienceWorld.DRRN is trained with a learning rate of 1 −4 , an embedding dimension of 128, and a hidden dimension of 128.KG-A2C uses a learning rate of 3 −3 , a dropout rate of 0.2, an embedding dimension of 50, and a hidden dimension of 100.DRRN and KG-A2C utilized eight parallel environments to speed up the training process.These parameter values have been taken from the original ScienceWorld paper.For RL models we perform training for 40,000 steps as we were able to reproduce the baseline results with 5% of the original ScienceWorld paper.For the RoBERTa model, we use roberta-large for all of the experiments.For training, we use 3 epochs, 3 a learning rate of 2 −5 , 4 randomly selected distractors, and a batch size of 1.For RoBERTa's MCA variants, we use a window size of  = 5.For Swift, we use flan-T5-base with a learning rate of 1 −4 and a batch size of 6.The maximum source and target lengths are set to 1,024 and 16, respectively.For Swift model, we used 8 training epochs following the original paper.Environment.We used two identical servers, each with an Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz, featuring 40 cores and 256 GB of RAM.We also utilized eight NVIDIA RTX A5000 GPUs (per server) to accelerate the training and inference process.

RESULTS & ANALYSIS 5.1 Effect of Knowledge Injection
Overall results.Table 1 compares our best model with baseline: in 34 out of 40 cases, our knowledge injection strategies improve over the baseline models.Among these cases, the most successful strategy is including affordances, which obtains the best results in 15 cases, followed by including MCA (8 cases).Including both knowledge types together led to the best results in 11 cases.The positive effect of adding affordances is confirmed in Table 2, which shows that including affordances improves the selection of the subsequent best action in 63% (25 out of 40) of cases.While the integration of affordances has a positive overall impact on the agents' action selection, in another 13 cases including affordances harms the model performance.Including the memory of previous correct actions taken by the agent also effectively enhances the decision-making capabilities of the architectures under consideration, though to a lesser extent compared to including affordances (Table 2).Given the varying effectiveness of affordances and MCA, we next study the performance variations across models and tasks.
Performance variations across architectures.To study further the isolated effect of different types of injected knowledge, we compare the model performance with and without knowledge injection for the four models.The RL-based DRRN model benefits from affordances most consistently, with 9/10 tasks showing the best performance after affordances are included (Table 1), leading to a 4% relative increase in performance (Table 2).The DRRN model relies on exploring the action space to learn optimal policies, and providing affordance information allowed the model to narrow down the search space and focus on actions that lead to successful outcomes.In terms of the overall impact across tasks, the LM variants, RoBERTa and Swift, benefit the most on average from  including affordances, leading to a relative increase of 48% and 8% respectively, over the baselines.Affordances improve the score of the KG-A2C method in 6/10 cases, yet, the overall improvement over this baseline is marginal.For DRRN and KG-A2C, in slightly over half of the cases, integrating MCA improves performance in selecting the next best action by 2% relative to the baseline.Interestingly, MCA improves over the RoBERTa baseline by approximately 8% in relative terms, despite only helping in 3/10 tasks.Furthermore, eliminating the action history proves advantageous for Swift.
Performance variations across tasks.Table 2 shows granular performance per task for all four models with their corresponding knowledge-injection variants.While we note that the impact of knowledge varies across tasks, in most cases, the performance is boosted by either of the knowledge-injection strategies.We note that task 3 (Electricity) is the only one where both knowledge injection strategies help across all architectures.Here, the DRRN and KG-A2C models experience an average increase of around 10% (relative) in performance, while the RoBERTa and Swift models show an average of 100% and 14% relative increase in performance.An example goal in task 3 is to power a red light bulb using a renewable power source, which requires that the agents understand the affordances of the electrical circuitry involved and the renewable energy sources that can be used to power it.The affordances provide the agent with valuable information that the light bulb is capable of generating light.Furthermore, the agent acquired the ability to remember its prior successful selection of a light bulb, which facilitated the subsequent selection of the wire and solar panel while avoiding the repetition of its prior choice.
Meanwhile, we observe that tasks 8 and 10 require biological knowledge, while the affordances retrieved from ConceptNet contain information like 'dog capableOf {bark, guard}' that are not informative for inferring the lifespan or life stages of a dog.Alternatively, the addition of affordance significantly improves the performance of the RoBERTa model in tasks 2 and 7, leading to 13x and 3x better performance, respectively.Moreover, RoBERTa with affordances achieves perfect scores 14 and 9 times for tasks 4 and 7, respectively, which is rare, especially given the relatively large sequences of correct actions in the ScienceWorld tasks.Notably, tasks 4 and 7 have an average length of less than 10, indicating that the model performs well in shorter tasks.The highest improvement for RoBERTa happens on task 7, which has the shortest sequence of correct actions on average.Swift ( ⊕ ) experienced a substantial performance improvement (with a 3.5x increase over the baseline) on task 3, where the agent successfully achieved the goal state in 40% of the episodes.Moreover, in 96% of the cases for task 4, the affordance variant of Swift was able to get a perfect score of 100%.These results strongly suggest the application of techniques that tailor the learning process to the specific task at hand, like metalearning [16], to empower the system to intelligently discern and apply the most suitable knowledge for optimal performance and adaptability.

Effect of Affordances
Given that affordances are a more effective knowledge-injection strategy than including the MCA, we perform a case study of injecting affordances in different models and we compare ways to inject affordances into KG-A2C.
Case study.Figure 6 presents a case study regarding the models' ability to incorporate affordance information for task 4. We opted for this task of finding a non-living object given the relatively high performance of all models on it.We see that the affordance 'wire capable to connect' enhanced RoBERTa's comprehension of wires as non-living objects, yielding a positive environment reward at step 8.The LMs, as well as DRRN, also utilized the affordances associated with the table object (e.g., 'table is capable of support') to identify it as a non-living object.The affordances associated with the term box (such as 'box is used for contain' and 'box is used for hold') enhanced the LMs' grasp of the box's attributes, facilitating the execution of the final action.While both LMs benefited from the affordance knowledge, RoBERTa required 63 steps to finish the episode, while Swift completed the task in just five steps.This supports our experimental finding that, compared to Swift, RoBERTa takes more time to pick the correct action.The RL agents (DRRN and KG-A2C) faced challenges in achieving perfect scores within this sub-task.KG-A2C struggled to reach the intended destination (the bedroom), often navigating to other locations and performing arbitrary actions.While DRRN managed to reach the bedroom and obtained a slightly better score, it encountered difficulty locating the box despite the provision of affordances.This case study suggests that LMs such as RoBERTa and Swift apply affordance knowledge more effectively than RL methods for such tasks.
Optimal way to inject affordances.We have chosen KG-A2C to conduct the ablation study, as it has a larger number of modular components (KG, graph attention, and actor-critic module), which can be flexibly manipulated for experimentation.Moreover, KG-A2C benefits the least from affordance injection.We explore multiple variations of injecting affordance knowledge into KG-A2C: by adding it as input into the observation, inventory, and description, creating a separate GRU encoding layer for affordance, and adding affordance to the KG itself.We evaluate the performance of each  method on three sub-tasks: easy (task 4), medium (task 6), and hard (task 5), based on the number of actions and the performance of the baseline models.The results in Figure 7 consistently suggest that the incorporation of affordances as part of the KG performs better than including them as part of the other components (e.g., description) or encoding them separately.A possible explanation is that by adding affordances to the KG, we allow the agent to have a more structured and separate representation of the environment, which in turn helps the agent make more informed decisions.Adding affordances as strings concatenated to inputs or adding a separate encoding layer hurts performance; we think that these methods cause information overload or interference with the original inputs, thus confusing the agent.The separate encoding layer introduces additional complexity to the architecture, making it harder for the agent to learn and generalize, especially considering the limited data size.Meanwhile, we note that an alternative approach to incorporate affordances is via self-supervision via auxiliary tasks, which brings significant improvement for some tasks in the case of RoBERTa, and suggests an avenue for RL-LM integration.

CONCLUSIONS AND OUTLOOK
This paper investigated whether current AI agents can use knowledge injection in semi-Markov text-based games to act coherently and improve their ability to learn from the environment through enhanced contextual awareness.We proposed to inject knowledge about affordances and keep a memory of previous correct actions on diverse architectures.Through rigorous evaluation, we showed improvement over the four baseline models across ten elementary school tasks.Among our injection methods, affordance knowledge was more beneficial than the memory of correct actions.The variable effect across tasks was frequently due to the relevance of the injected knowledge to the task at hand, with certain tasks (e.g., task 3: electricity) benefiting more from the injection.Injecting affordances was most effective via KGs; incorporating them as raw inputs increased the learning complexity for the models.The insights into the usage of knowledge injection for improving the performance of RL and LMs in complex IF games have potential implications for interactive applications beyond the gaming domain, including customer service chatbots and personal assistants.
As the resulting models' performance is still far from ideal, we envision two future directions toward more coherent and efficient models.First, our results suggest that the models have complementary strengths and weaknesses: the RL model performed the best on the task Matter (task 1), the KG-augmented model yielded the best performance on the task of Measurement (task 2), and the language models outperformed the others on Biology I (task 5), Biology II (task 7), and Biology IV (task 10).Inspired by this insight, we propose to enhance the performance of the LMs by incorporating an RL policy network [35].Second, few-shot prompting of large LMs has recently shown promise on reasoning tasks, as well as clear benefits from interactive communication and input clarification [25].Exploring their role in interactive tasks, either as solutions that require less training data or as components that can generate synthetic data for knowledge distillation to smaller models, is another promising future direction.

Figure 1 :
Figure 1: Illustration of an Interactive Fiction (IF) game, where an agent must perform the task of picking a fruit (e.g., an apple) then placing it in a blue box in the kitchen.

Figure 2 :
Figure 2: DRRN architecture, enhanced with the memory of previous correct actions and object affordances.

Figure 6 :
Figure 6: Actions taken by affordance models on Task 4. Blue = step index, green = cumulative score, and yellow = correct action.

Figure 7 :
Figure 7: Effect of five ways to add affordances in KG-A2C.

Table 2 :
Baselines () comparison with knowledge-injected model configurations ("": Affordance; "m": Memory of Correct Actions), based on average cumulative reward across task variants.Bold signifies better performance over baseline.

find a non-living object) & Variation: 292 [Affordance Model] Your task is to find a(n) non-living thing. First, focus on the thing. Then, move it to the yellow box in the bedroom. Affordances: table
is capable of chip, table is used for support, box is used for contain, box is used for hold, box is used for seat, box is capable of assemble, box is capable of empty, _ _ _, wire is capable of connect, wire is capable of corrode, wire is capable of cut, wire is capable of crisscross, wire is capable of dangle.