Exploring Large Language Models for Trajectory Prediction: A Technical Perspective

Large Language Models (LLMs) have been recently proposed for trajectory prediction in autonomous driving, where they potentially can provide explainable reasoning capability about driving situations. Most studies use versions of the OpenAI GPT, while there are open-source alternatives which have not been evaluated in this context. In this report1, we study their trajectory prediction performance as well as their ability to reason about the situation. Our results indicate that open-source alternatives are feasible for trajectory prediction. However, their ability to describe situations and reason about potential consequences of actions appears limited, and warrants future research.


INTRODUCTION
Human-robot interaction (HRI) is pivotal in integrating advanced robotic technologies into our daily lives, where the explainability of these systems plays a crucial role in ensuring that interactions are intuitive, safe, and benefcial for all.This is particularly evident in autonomous vehicles, where understanding human behavior, such as predicting pedestrian movements and interpreting other drivers' actions, is fundamental.Trajectory prediction for autonomous vehicles involves encoding information gathered from the surroundings for generating safe and feasible trajectories.Rule-based methods, while ofering interpretability, often struggle to handle the diversity of real-world driving scenarios.Conversely, data-driven models excel by learning from extensive human-driving behavior datasets but are criticized for their 'black box' nature, compromising their interpretability [16].Both the rule-based and learning-based methods lack the inherent common sense reasoning of human driving, limiting their efectiveness in addressing rare and complex driving situations.This highlights the necessity for models that not only imbue common sense reasoning but also strike a balance between explainability and adaptability in trajectory prediction.
Recent literature shows eforts [8] to infuse human-like reasoning into autonomous vehicles, drawing inspiration from the capabilities of LLMs.One such strategy is re-imagining trajectory prediction as a language modeling problem.This method converts motion planner inputs, like detection and prediction outcomes, into unifed language tokens.LLMs then process these tokens, articulating future driving trajectory waypoints as natural language descriptions and fne-tuning these models for specifc tasks.Another strategy hierarchically employs LLMs within closed-loop environments, where the system generates queries infuenced by current observations and past experiences.These queries then direct the decision-making process, with the system continually assessing and learning from its decisions, enhancing its ability to respond appropriately in future scenarios [2,14].
Modeling trajectory prediction using LLMs is key in human-robot interaction for understanding and anticipating human behavior, ensuring efcient, safe, and intuitive collaboration between humans and robots.Existing LLMs methods use versions of OpenAI's GPT, for trajectory prediction.However, there are several open-source alternatives available that haven't been assessed for trajectory prediction.These alternatives, potentially ofering diverse approaches and methodologies, remain unexplored and untested for their efcacy and applicability for trajectory prediction.To this end, in this report, we explore the LLMs for ego-vehicle trajectory prediction problem.We explore some open-source alternatives of OpenAI's GPT for trajectory predictions and how to fnetune them by using adapters for trajectory prediction tasks.Figure 2 illustrates the framework for using the LLMs for the trajectory prediction task.
In our experimental analysis, we aim to answer the following research questions: (1) Can open-source models that run on a single GPU achieve similar results for trajectory prediction as querying the Ope-nAI API? (2) When using LLMs for trajectory prediction, can we beneft from the general knowledge acquired by the model and provide meaningful explanations of the situation for a human?

PROBLEM FORMULATION
Motion planning in the context of autonomous driving aims to devise a future trajectory, denoted as , that ensures safety and comfort.The trajectory, represented by , is defned as a sequence of waypoints corresponding to distinct timestamps: ∈ ×2 , denoted as = {( 1 , 1 ), ..., ( , )}, Here,( , ) represents the two-dimensional waypoints corresponding to the vehicle's location at the timestamp .The trajectory prediction inputs encompass the historical waypoints, along with the outputs from perception and prediction systems.These outputs include, for example, detected object bounding boxes and projected trajectories indicating their future movements.To conceptualize   trajectory prediction as a problem within the domain of large language modeling, the trajectory can be represented as a sequence of words that concisely describe it.
Here, represents the -th word in the sequence, obtained through the application of a large language tokenizer, represented by .By adopting this linguistic representation, the trajectory prediction problem can subsequently be reformulated as a language modeling problem:

=1
Here, ˆ and correspond to the words from the predicted trajectory ˆ for ego-vehicle and the human driving trajectory , respectively.LLMs can efectively generate trajectories by maximizing the probability associated with the occurrence of words derived from the human driving trajectory .

EXPERIMENTS
The application of zero-shot prompting in LLMs for trajectory prediction yields sub-optimal outcomes [9,13].To address this, our experiments involve fne-tuning LLMs specifcally for the downstream task of trajectory prediction.However, fne-tuning is computationally intensive and time-consuming because of the large model size.A more efcient strategy involves the use of adapters to train the model on domain-specifc data while maintaining the LLMs in a frozen state, which represents an advantageous design choice adopted in this study.

Experimental Setup
We conduct Parameter-Efcient Fine-Tuning (PEFT) [7] using a combination of fve models and two adapters.Our implementation incorporates the following open-source models from the Hugging-Face Transformers library [15].We chose OpenAI's GPT-2, as well as four recently proposed models with 7B parameters that can be trained on a single GPU.
2 metric is opted as an evaluation metric.The average L2 error is determined by calculating the distance between corresponding waypoints in the predicted and ground-truth trajectories.This metric efectively refects the extent to which a predicted trajectory approximates a human-driving trajectory.The input prompt requests the generation of a waypoint (, ) each 0.5 seconds.Therefore, we evaluate 2 for 2, 4, 6 waypoints from the predicted trajectory for the 1, 2, and 3 seconds measures.

PEFT with LoRA
For fne-tuning with LoRA, for both Llama-7B and Llama-7B-Chat models, satisfactory results were observed after just three training epochs.However, extending the training duration led to a decrease in the quality of results produced by the Llama-7B model due to overftting.The outcomes of fne-tuning with LoRA are detailed in Table 1.Among the models tested, the application of LoRA fnetuning techniques yielded the most accurate results for the Llama-7B and Llama-7B-Chat models.These two models outperform the results reported in [8], based on the L2 metric.The other three models also achieve good results for the L2 metric for the predicted trajectories.The results for GPT-Driver are the ones reported in their paper [8].In Figure 3, we provide examples of the model output for all fve models, fne-tuned with LoRA.Although the predicted trajectories are not too far from the ground-truth, the reasoning about the situation and understanding of the environment are not always consistent.We also experimented with asking follow-up questions to the model, such as "Are there any other vehicles on the road?" or "What would be the efects of a diferent meta action?", but the responses did not contain precise information.Figure 1 shows an example scene from the nuScenes dataset [1] with plotted the ground-truth trajectory and the trajectories predicted from our fne-tuned models, as well as the bounding boxes of the detected vehicles for this scene.

PEFT with P-tuning
In the context of P-tuning, the results for most models and data instances did not yield meaningful outputs.These were characterized either by text in an inconsistent format, or by formats that were nearly correct but lacked numerical values for the trajectory, as illustrated in Figure 4(b) and Figure 4(e).The fndings for Llama-7B and Llama-7B-Chat, when fne-tuned with P-tuning, are presented in Table 2. Notably, a high incidence of empty trajectories resulted in subpar performance in the L2 metric.Outputs from the other three models were not meaningful, and hence their results are not included.A single-word alteration in the prompt can dramatically afect the model's performance [6].P-tuning incorporates trainable continuous prompt embeddings with discrete prompts.With a given discrete prompt, P-tuning appends continuous prompt embeddings to the discrete tokens, feeding them into the language model.The P-tuning approach was unsuccessful in our case for trajectory prediction, as specifc prompts are crucial for achieving accurate outcomes.Altering certain words alters the states and environmental observations, leading to incorrect outputs.The GPT-2 model predictions ofer insights into this issue in Figure 4(a).
(a) The U.S. Department of Defense is developing a new system to detect and track the movement of a U.S. military plane, the Pentagon said Tuesday.The system, called the Joint-Missioned Tracking System (JMT), will be deployed to the U.S. military's Joint Expeditionary Force (JEF), which is planning to deploy to the region in the next two years.The JMT is a system that can detect and track a plane's fight path, and can also track the plane's trajectory.

Failure Analysis
For all models, a certain number of predictions resulted in empty trajectories.This issue was particularly observed with the Mistral-7B model, where the majority of instances failed to yield trajectories in the correct format.The models were missing a correct trajectory prediction due to several reasons: • Empty output message.
• Messages that deviate from the prescribed output format.An illustration of this can be found in Figure 4(b).• These messages adhere to the output format yet fail to include a trajectory.Instead, they might provide information about the environment or other scene participants.An example is detailed in Figure 4(c).• Messages containing a complete trajectory but in a format that does not align with the expected standard are excluded from consideration as they do not constitute a valid output in general cases.• This involves trajectories that contain fewer than six predictions.In such scenarios, we attempt to evaluate the predictions based on the available data and compare them for the corresponding number of steps.An example is provided in Figure 4(d).

CONCLUSIONS AND FUTURE WORK
This research highlighted the efective use of open-source LLMs in the feld of trajectory prediction.Through detailed experimental analysis, it was shown that when these open-source LLMs were fne-tuned for specifc downstream tasks, they yielded results comparable with their counterparts.
In the context of HRI, the use of LLMs for driving tasks would potentially allow the models to reason and provide explanations about the driving situation.This work serves as a pioneering step in employing open-source LLMs for trajectory prediction.While it does not introduce a novel learning method for training adapters in LLMs, it paves the way for future research in this direction, potentially exploring innovative training techniques and applications in trajectory prediction and beyond.

Figure 2 :
Figure 2: Illustrates the pipeline for deploying LLMs to predict the trajectory for autonomous vehicles.It includes fnetuning the LLMs with adapters on prompt database generated by ofshelf detection and prediction algorithms.The output language output is converted back to planned trajectories using decoder.Note: the decoder is a simple regressive extraction of trajectories from language outputs.

Figure 3 :
Figure 3: Example for assistant message output from all tested models, fne-tuned with Low-Rank Adaptation (LoRA).

Figure 4 :
Figure 4: Failure cases of output generated from our models.(a) P-tuning changed one too many words from the input prompt.(GPT-2 with P-tuning); (b) Message not following the expected format.(Llama-7B with P-tuning); (c) Message in the correct format, but with missing trajectory.(Zephyr-7B with LoRA); (d) Trajectory with less than 6 predicted waypoints.(Mistral-7B with LoRA);

Table 2 :
Results from PEFT fne-tuning of the Llama2 models with P-tuning.