Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention

Powerful generative Large Language Models (LLMs) are becoming popular tools amongst the general public as question-answering systems, and are being utilised by vulnerable groups such as children. With children increasingly interacting with these tools, it is imperative for researchers to scrutinise the safety of LLMs, especially for applications that could lead to serious outcomes, such as online child safety queries. In this paper, the efficacy of LLMs for online grooming prevention is explored both for identifying and avoiding grooming through advice generation, and the impact of prompt design on model performance is investigated by varying the provided context and prompt specificity. In results reflecting over 6,000 LLM interactions, we find that no models were clearly appropriate for online grooming prevention, with an observed lack of consistency in behaviours, and potential for harmful answer generation, especially from open-source models. We outline where and how models fall short, providing suggestions for improvement, and identify prompt designs that heavily altered model performance in troubling ways, with findings that can be used to inform best practice usage guides.


INTRODUCTION
Large Language Models (LLMs), such as ChatGPT, have rapidly emerged as powerful generative tools that can be used by non-AI experts in a wide variety of tasks.According to the latest available data, ChatGPT currently has around 180.5 million users worldwide [11], with an unknown percentage of these users being children and a lack of air-tight age verification in most countries.In early 2023 headlines rapidly appeared regarding the potential for children to exploit LLMs to do their homework for them, and the issues this posed to education [24].Less talked about were other tasks that children could turn to LLMs for, such as providing a private advice source regarding their online interactions.There have already been suspected cases where adults interactions with AI-chatbots have resulted in negative and harmful outcomes [42], and whilst LLMs have a host of potential positive applications, such as teaching children supportive self-talk [20], tragic cases can be expected to occur as LLM use becomes a standard practice in modern society.Children may be particularly vulnerable to misusing AI and not understanding the possible outcomes from interacting with these generative models, especially when sharing personal and sensitive information, a phenomenon that is already occurring [33].For example, they may buy into LLM 'hallucinations', an effect where a model produces outputs that seem plausible but which are not factually correct.It may become necessary for children to be taught how to interact with AI safely [40].However, LLM creators and researchers must also work to ensure the safety of these generative models for child-oriented tasks, especially those with the most room for negative outcomes, such as when a model is handling queries about mental health and online safety topics.
This paper focuses on the issue of online grooming and the potential application of LLMs for spotting concerning interactions and generating helpful context relevant advice.With children already using LLMs for everyday tasks such as educational purposes, it can easily be imagined that children may turn to LLMs for advice about online interactions, making it a necessity for publicly accessible LLMs to be prepared for this use case and to perform in a manner that is ideally helpful, but at the very least not harmful.Therefore, in a series of experiments involving the evaluation of over 6,000 LLM interactions, this paper explores the performance of 6 LLMs on three related but distinct tasks: Providing general non-contextual online safety advice, identifying online grooming in conversations between decoy children (i.e., adults posing as children online) and real predators, and generating targeted context-specific advice for the child participant in these conversations.Further, we investigate the impact of prompt design, to cover factors such as how LLM performance differs when the model has access to a full chat transcript versus a secondhand description of the events in the chat, how LLMs alter responses to questions apparently asked by children, and whether LLMs identify online grooming risks without specific mention of this risk.Our results can be used to inform best-practice use guides, and to identify potential weak spots in generative models intended for use by children.

RELATED WORK 2.1 Large Language Models
Large Language Models (LLMs) [25,34], sometimes referred to as Pre-trained Language Models (PLMs) [14,21,27], are an advanced form of Language Model (LM) [2,10,32] that train deep learning algorithms on massive amounts of data, with up to billions of parameters, allowing for exceptional performance in a vast array of Natural Language Processing (NLP) tasks.They have quickly become integral to Natural Language Generation (NLG) tasks, a challenging sub-category of NLP that focuses on text generation from a wide array of input data forms.LLMs are able to perform exceptionally well due to transformers [36], which can model sequential data using a self-attention module, and the massive amounts of data available on the Internet for training these models.Popular LLMs also utilise in-context learning [4] and Reinforcement Learning from Human Feedback (RLHF) [6,45], making their performance improve even more over time.
A specific type of NLG task is Question-Answering (QA), where a model must have a backlog of knowledge beyond the input sequence to generate an answer comparable to that of a human with prior experience, knowledge, and semantic inferring capabilities.QA and dialogue systems in general are designed to interact with humans using natural language, requiring a model that can represent both language and knowledge of a vast array of topics.Clearly, LLMs are well suited to this application, and can be fine-tuned further for downstream tasks.However, NLP in the wider sense is moving away from the pre-train then fine-tune paradigm, towards a pre-train and prompt paradigm [16].Even without fine-tuning, LLMs can perform ad-hoc NLG tasks from a simple natural language prompt, allowing for downstream task outputs without changing the underlying model structure.This allows for non-technical general public users to utilise the power of these complex models without having to understand the mechanisms behind them.

Prompt Engineering
Prompt engineering [5,13,15,18,23,39] has emerged as a method for constructing prompts to allow LLMs to work at their maximal effectiveness, directing the generated output to be as relevant and helpful as possible.Therefore, LLMs need to be evaluated for not only their performance on a task, but also for the factors that affect this performance on the prompt level.Prompt engineering has already been used by researchers to explore LLMs for a wide variety of tasks, such as text-to-image generation [3,17], human-AI cowriting tasks [8,30], medical applications [19,38], programming [9], and many more.
Whilst prompt engineering is quickly becoming a hot-topic in the LLM research arena, it is unlikely that all non-AI experts will catch on to this phenomenon, especially children who may be completely unaware of the way the LLM they are interacting with is producing output.Recent research has found that even adults struggle with 'prompt literacy', with many factors causing barriers to effective prompt design [44].As prompt design heavily impacts LLM performance, it is important to factor in that non-AI experts may struggle to improve prompts.This makes it not sufficient for LLMs to be evaluated purely by experts.Future LLM evaluations need to include non-experts in the discussion, to improve LLM safety and performance for all users.However, due to the sensitive nature of our research, centred around online grooming, it would not be ethical to involve child users in these evaluations.

Online Child Safety
LLMs are already being utilised by researchers in child education [1,43], but LLMs could also prove to be a powerful tool in online child safety, with the potential to spot harmful behaviour in online interactions and to disseminate context relevant, easily understandable, and helpful advice.Recent research has explored the topic of AI for child-oriented tasks, such as using LLMs to help them discuss their feelings [31], research on age-appropriate AI [37], and Conversational Artificial Intelligence (CAI) systems for interactive storytelling [7].Other studies have focused on non-AI child-oriented research around child online safety, such as the idea of self-regulation [12,41] and interventions [22,26,35].However, due to the rapid emergence of LLMs in modern society, there is a gap that needs to be bridged between online child safety and LLM research.This is a research area that will need extensive and rapid exploration, to protect children using LLMs from harmful behaviour, and to examine the potential uses of LLMs in child online safety applications.

EXPERIMENT DESIGN
To explore the efficacy of LLMs for child safety tasks in the cyberspace, 6 popular open-and closed-source LLMs were prompted to test for their suitability in three related tasks: providing general online safety advice, spotting online grooming, and providing advice given online grooming conversations.To evaluate the effects of prompt design, 3 prompt variation factors were explored that we deemed to be the most relevant for these tasks: given context vs. described (testing how well the LLM extracts context from given conversation snippets, and the effect of removing this processing step by providing a concise summary of the conversation instead of the raw text), direct vs. indirect Point-of-View (POV) of the participant asking the prompt (either being given indirectly as a bystander to the situation or directly from the child), and prompt specificity (either explicitly mentioning online grooming in the prompt or leaving the prompt as a more general advice question).
Figure 1 shows the experiment design flow.The general online safety advice task resulted in 4 prompts exploring the prompt design paradigms of specificity and indirect vs. direct.The spotting online grooming task resulted in 32 prompts, as there were 4 prompt templates, with each template applied to 8 scenarios (i.e., 8 chat snippets / chat descriptions).Lastly, the online grooming advice task resulted in 64 prompts, as it had 8 prompt templates exploring all three prompt design paradigms, with each template applied to the 8 To test for consistency in performance, each prompt was given to each model 10 times, resulting in a total of 6000 answers collected.
Each answer was then evaluated on predetermined rubrics, with scores averaged over the 10 runs.
Answer feedback was not provided during testing to avoid biasing models to improve throughout.Repeated runs of prompts were done by starting new 'conversations' (i.e., a new LLM interaction), with further prompts being given within the same conversation.Only the information available in the chat snippet was provided in the prompt.However, it was given that one participant is a child and one is an adult.Not including this information would completely change the context of the conversations, and given the use case of children asking LLMs for advice about their conversations, it is fair to assume this context would be available.
The LLaMA 2 13B-chat model was chosen as the chat model was more applicable to this use case, and 13B was a middle ground between the three available sizes (7B, 13B, and 70B).Mistral AI specifically did not tune their models for safety to allow users to test and refine moderation based on individual use cases.However, they do provide a guard railing tutorial.This was not used, as the purpose of this experiment was to explore the basic behaviours of these models for a child-oriented task.Tokenisation of prompts improves performance, but as this is an unlikely step for children to take, the prompt was given as a string.This can lead to unintended prompt additions, an interesting phenomenon which is taken into consideration in evaluation.The Mistral instruct model was chosen as it was fine-tuned using a variety of publicly available conversation datasets, making it more applicable to QA use cases.
3.1.2Data.Conversation snippets were taken from Perverted Justice (PJ) transcripts, which are conversations between decoy children (i.e., adults posing as children online) and real predators where the sting operation resulted in a conviction.
We selected PJ transcript snippets representing various specific contexts, such as 'discussing meeting up', 'discussing sexual topics', and 'discussing talking on the phone'.This process yielded 8 conversation snippets between different predator and decoy child chat participants.These snippets varied in their riskiness, with some snippets (S1,5,7,8) overtly containing sexual topics, others (S3,4) containing 'flirty' messages, and others (S2,6) containing less clearly offensive topics, but that are still inappropriate when considering they are between a child and an adult.To provide described context, these snippets were described based on the context and information available in the snippet.

Evaluation
To remove subjectivity, three rubrics were created to evaluate the LLM answers given, as detailed in Table 1.One rubric measured how easy it was to get an answer from the LLM, referred to as 'responsiveness'.The other two rubrics evaluated the quality of the answer, one for spotting online grooming, and the other for the general online safety task and the advice task.These rubrics provide an objective quantitative evaluation, but must be taken into consideration alongside the qualitative assessments of LLM behaviours, detailed in Section 4.
An important note is that LLMs sometimes failed to answer in some runs while responding in others.Therefore, the average quality of answers only reflects the times a model did answer, making it important to consider responsiveness alongside quality.In addition, all LLMs were given the chance to improve their answer via further prompting.If answer quality improved, this informed a higher quality score, but reduced the responsiveness rating due to further prompting.If the response did not improve, then the original answer was evaluated and the responsiveness did not reflect further prompting.

General Online Safety Advice
To test the efficacy of the 6 LLMs for the task of providing general online safety advice, 4 prompts were given: 2 asking for general online safety advice, and 2 asking for advice specific to avoiding online grooming, with one from each pair being indirect (i.e., what advice would you give the child), and the other being direct (i.e., what advice would you give me).All models performed fairly well, showing mostly expected behaviour with the prompt variations (i.e., more specific advice when online grooming was specified

Spotting Online Grooming
The 4 prompts for the task of spotting online grooming, and the quantitative evaluations of responsiveness and the average quality scores achieved for these prompts across the 8 scenarios (S1-S8), are shown in Table 2.However, qualitative assessment also produced a number of observations, detailed below.
Cautious behaviour: The closed-source models, especially Claude 2, in general exercised a lot more caution than the open-source models, often providing red flags from a conversation but stopping short of definitively finding a risk of online grooming.Some models added caveats to their answers, (e.g., 'I am not an expert in online safety or child protection, but I can offer some general observations based on the provided conversation snippet').PaLM 2 occasionally avoided the question altogether, instead giving generic advice about spotting and avoiding online grooming.ChatGPT 3.5 was extra cautious in declaring no risk of online grooming, always making sure to caveat this conclusion, outlining factors to consider to assess the situation more thoroughly.These caveats should be standard practice but could go further by telling the child to get a second opinion from a trusted adult.
Inconsistency: All closed-source models showed some inconsistency in whether they would produce an answer or not within runs of prompts for the same scenario, especially in mid-level risky conversations.LLaMA 2 also occasionally exhibited this behaviour, but only for the most risky conversations.ChatGPT never needed further prompting, either answering or refusing, whereas PaLM 2 and Claude 2 were inconsistent in whether they required further prompting.Further, Claude 2 sometimes showed inconsistency in the quality of answer that further prompts yielded.Of the closedsource models, ChatGPT 3.5 was the most inconsistent in identifying a risk of grooming in low-level risky conversations.The two open-source models, LLaMA 2 and Mistral, were often inconsistent in their answers, especially around the mid to low-level risky conversations, resulting in a wide range of quality scores.In some runs they would firmly find a risk of online grooming, providing solid reasoning, while in others they would deny any risk, providing arguments that contradicted the model's previous reasoning in support of the opposite conclusion.Even within an answer they could show inconsistency, listing red flags indicative of grooming and then confidently concluding there were no red flags.LLaMA 2 provided some dangerously poor answers, but also produced answers that were even more compelling than the closed-source models'.Mistral was more inconsistent than LLaMA 2, and less often produced good answers.Its reasoning was the most clearly contradictory in runs of the same prompt and scenario, such as suggesting in one run that a child was not being groomed because they had knowledge about safe sex practices, and in another run claiming that the same child's fear of getting pregnant showed a lack of understanding about contraception which made them more vulnerable to manipulation.
False information: Particularly when the context was given, some models hallucinated information.PaLM 2 sometimes referred to false events, and sometimes made assumptions without evidence, such as stating that the groomer in one scenario was pretending to be younger than they were.Similarly, LLaMA 2 had a tendency to confidently assert things it could not have known, such as claiming that the identity of the adult in S1 with the username 'armysgt1961' must be fake, saying 'this is not the behaviour of an army solider'.LLaMA 2 and Mistral had a tendency to make up information, inventing that the conversation was taking place in a public online space, referencing events that never occurred, and even fabricating information like the child's name or age.Interestingly, LLaMA 2 was the only model that reported inappropriate emojis in some conversations (there were no emojis present in any of the scenarios).
Unconvincing evidence: All models sometimes provided unconvincing evidence in their answers in support of either conclusion.Interestingly, PaLM 2 and ChatGPT had some overlap in the unconvincing evidence they provided, potentially indicating an overlap in their inference capabilities.In some runs LLaMA 2 provided entirely irrelevant and unconvincing evidence, such as, 'the child participant appears to be relatively vulnerable and open to manipulation, based on their language and responses (e.g., "lol" and    "cool"), which could be seen as an attempt to sexualize the interaction'.LLaMA 2 also sometimes repeated evidence within the same answer in different terms.Parts of Mistral's answers were often vague and unrelated to the question, e.g., suggesting that platforms should implement age verification measures.In some answers Mistral provided nonsensical reasoning, e.g., 'the fact that the adult participant is calling the child "wekend" suggests that they may have a lack of understanding of appropriate language and boundaries' -this was untrue and confusing (the child was actually saying 'i finally get a wekend to myslef').

Harmful behaviours:
The open-source models showed the most potential for harmful behaviours, with ChatGPT 4, PaLM 2, and Claude 2 never producing harmful answers, and ChatGPT 3.5 providing some harmful answers in low level risky conversations when it misinterpreted the context.LLaMA 2 showed some clear indications of harmful performance, such as concluding a conversation contained no grooming because the participants were 'making each other feel good', and finding another conversation was perfectly fine as the child was 'keen to speak to the adult'.For the direct prompts it provided some very harmful reasoning, such as suggesting the adult was more knowledgeable and experienced than the child, so the child should listen to the adult.Mistral more often provided harmful answers, sometimes even blaming the child for the inappropriate interaction.Regarding an adult who sent a nude photo to the decoy child, Mistral said, 'while the exchange between the adult and child does involve some sexual content (the nude photo), it's also important to note that the child initiated the conversation and requested the photo in the first place'.There were many similarly dubious answers.Mistral often struggled to identify any inappropriate content in a scenario, even when it was abundant.
Indirect vs. direct: Both ChatGPT's and PaLM 2 would directly address the child when the prompt was direct, but wouldn't noticeably change their delivery in terms of tone.Claude 2, LLaMA 2, and Mistral were less consistent in who the answer addressed, sometimes direct and other times indirect.LLaMA 2 sometimes used emojis when addressing the child, in an attempt to produce more 'child friendly' answers.Mistral often behaved weirdly, sometimes answering from the POV of the child (e.g., 'No, I am not at risk of online grooming'), and then reverting back to indirect POV halfway through the answer (e.g., 'It is important for children to be cautious').When the prompt was indirect and the ChatGPT models wouldn't answer, they would always cite content guideline violations, but when the prompt was direct they would sometimes provide text instead, such as, 'I'm really sorry, but I can't assist with that question', indicating that they treated the direct prompts differently.
Given vs. described context: When the chat snippets were given, all models occasionally extracted details from some conversations incorrectly, especially misinterpreting who said what in the conversation, leading to false and irrelevant evidence in generated outputs.Working from a described context, models made fewer mistakes overall and provided more convincing evidence, suggesting that the models could interpret the conversation much more easily when the summarising step was done for them.However, LLaMA 2 and Mistral still sometimes analysed the situation incorrectly when the context was described.Mistral's behaviour was the most notably improved with a described context, generally providing much more coherent and less harmful answers.

Online Grooming Advice
There were 8 prompts for the task of providing advice given online grooming conversations.The quantitative evaluations for the prompts where online grooming was not specified (Prompts 9, 11, 13, and 15) across the 8 scenarios (S1-S8), are shown in Table 3. Prompts 10, 12, 14, and 16 differed only in specifying the risk of online grooming, and are discussed inline below.
Advice generation behaviours: The closed-source models and LLaMA 2 varied in the context-specificity of their advice.In some conversations, they would only give general online safety advice, which was helpful but not specific to the context.When a prompt was indirect and the transcript was given, all models tended to use vague language rather than giving clear steps to follow in the specific context.Where ChatGPT 3.5 would provide general online safety advice, ChatGPT 4 would often include a preamble describing the situation as concerning, suggesting that the paid model analysed the situation more thoroughly.ChatGPT 4 also often found red flags in conversations that ChatGPT 3.5 thought were harmless.Claude 2 sometimes initially refused to answer or gave vague advice; with further prompting it would occasionally provide good advice, but lacking any clear steps to follow.Claude 2 was fairly inconsistent in both responsiveness and quality, but was never harmful.Mistral often provided advice that wasn't strictly relevant to the context, producing answers on safe sex advice without referencing the scenario, and answers that were otherwise irrelevant to the task.It behaved the most inconsistently of all the models, and performed the worst in general.At best, it would give vague online safety advice, and at worst it would provide explicitly harmful advice.
Misinterpretations: For some of the low-risk transcripts, Chat-GPT 3.5 and PaLM 2 would misinterpret the context as being about having friends over or a child complaining about chores, resulting in irrelevant advice.This shows that, where online grooming risk is not mentioned in the prompt, some LLMs can misidentify grooming conversations as harmless, subsequently providing unhelpful advice.In some conversations, ChatGPT 4 avoided mistakes made by 3.5, giving a better analysis of the context.LLaMA 2 also sometimes misinterpreted low-risk transcripts, but was less consistent than the closed-source models in its analysis.In one conversation about an adult coming over to a child's house, it analysed the situation as a friend coming over, yet in another run it analysed the situation as the child running away from home.LLaMA 2 and Mistral sometimes gave irrelevant advice with no clear connection to the transcript.Mistral misinterpreted the transcripts in both high-risk and low-risk conversations, often providing harmful advice due to its misunderstanding of the context.
Bad behaviours: LLaMA 2 and Mistral sometimes hallucinated false information from transcripts, as observed elsewhere, sometimes leading to harmful advice.PaLM 2 also sometimes provided harmful advice when working from transcripts, e.g., telling a child to double check that their parents were gone before having guests over.PaLM 2 and Mistral occasionally provided poorly targeted advice, e.g., PaLM 2 telling a child to never leave their drink unattended, or Mistral telling the child to 'communicate openly and honestly with their adult partners', and telling the child it would be rude to change their mind about an adult coming over.Mistral also sometimes gave irrelevant but harmless advice.For some scenarios, Mistral provided bad advice in almost every run.
Specified vs. not: In general, having 'online grooming' specified in the prompt reduced models' tendencies to misinterpret transcripts, prompting more relevant advice.However, it often caused models to provide only generic advice on spotting / avoiding online grooming, rather than commenting on the scenario.ChatGPT 4 and PaLM 2 sometimes gave better answers with a less specific prompt, presumably as controls stopped them from analysing the situation, and they defaulted to more general advice.Claude 2 also showed signs of guard railing affecting answer quality.Most other models improved their answer quality with a more specific prompt, though LLaMA 2 decreased in responsiveness.
Interestingly, when combined with descriptions rather than transcripts, the effects of specificity were different for all models, indicating that the combinations of these prompt variations is an important factor in performance, and that the combination is more important overall than any prompt variation in isolation.Using transcripts, Claude 2 gave worse answers when 'online grooming' was specified, but when combined with descriptions specificity improved its answers.
Indirect vs. direct: When working with transcripts, the direct prompt improved responsiveness for ChatGPT 3.5 and Claude 2, had no effect for ChatGPT 4, and declined for PaLM 2, LLaMA 2, and Mistral.For PaLM 2 and LLaMA 2 this was due to the models' guard-railing stopping them from answering as consistently, but for Mistral this was due to more model misbehaviours for the direct prompt.Worryingly, overall answer quality worsened for all models using direct prompts.When using descriptions, the direct prompt caused all models apart from PaLM 2 to improve in overall responsiveness but decline in answer quality.The implication is that claiming to be a child caused the models to answer the question more frequently, but the answers they gave were overall worse in quality -this effect was also seen in the identification task results.This is a worrying trend, as ideally a child should receive even clearer answers.The direct prompts observably caused less confident behaviours in the models.
Given vs. described context: As in the identification task, context descriptions helped models avoid misinterpretations of context from transcripts.For ChatGPT it eliminated this behaviour entirely, but for PaLM 2, LLaMA 2, and Mistral it only reduced occurrence.However, some models would no longer answer for scenarios they had addressed from transcripts, indicating that the description made it more clear how inappropriate the interaction was, and triggered guard-railing.Mistral greatly improved when working with descriptions, resulting in more consistent answers and fewer harmful answers.LLaMA 2 was also much less likely to produce harmful answers when the context was described.However, the described context did not eliminate this behaviour for either model.

Discussion
Optimal prompt variations: For the identification task, the best overall responsiveness scores came from different prompts for each model.Conversely, the best overall answer quality scores came from Prompt 6 (description, indirect) for all models.This is in some ways unsurprising, as the description removes a processing step.However, the better performance of the indirect prompt is worrying, suggesting that children may get sub-optimal performance if asking questions themselves.
To obtain online grooming advice, the best overall responsiveness scores again came from differing prompts, whereas the best overall quality scores showed more consistency across various models.For both ChatGPTs and LLaMA 2, Prompt 13 (description, no specificity, indirect prompt) yielded the best scores.Mistral also achieved its highest overall answer quality score for Prompt 13, but jointly with Prompt 14 (specified).Claude 2 performed better with a transcript rather than a description.PaLM 2 differed most from the others, performing best with descriptions, specificity and a direct prompt, making it a more promising candidate for child LLM use cases than other models.
Model comparisons: There was a clear difference in the performance of ChatGPT 3.5 and 4, with ChatGPT 4 often performing better in any given task.The models were also affected differently by the prompt variations.For riskier conversations, ChatGPT 4 would sometimes start generating an excellent answer only to remove it upon completion, showing a potential that is not being consistently employed.Both models only answered consistently for low-risk conversations containing less overtly inappropriate content.Both ChatGPT models either answered or didn't (i.e., never needed further prompting), but had inconsistent responsiveness, and were greatly limited by what they considered content violations.Initially, PaLM 2 would provide context-relevant content after further prompts, but in later tests it would protest it had no access to the initial prompt.This made it less helpful than it originally was, as further prompts could only result in generic responses with no reference to the transcripts, suggesting that a model update greatly limited its efficacy.However, PaLM 2 was more useful in some ways than the ChatGPT models when encountering topics it wished to avoid, as PaLM 2 could at least be further prompted to give generic advice, rather than terminating the conversation entirely.In some runs PaLM 2 showed promising behaviours that other models did not, often providing additional tips for parents to help keep their children safe online, giving additional advice to the child, and providing links to useful and relevant resources such as NCMEC.Interestingly, LLaMA 2 often performed the spotting online grooming task better in the online grooming advice prompts than it did in the task relevant prompts.
Troubling behaviours: PaLM 2 sometimes acted as if the provided advice was not coming from an LLM, such as starting an answer 'As an adult. . .'.This behaviour was not observed in the other closed-source models.LLaMA 2 also did this, sometimes going even further than PaLM 2, with positions such as 'As an experienced CPS worker. . .'.This is clearly misinformation, and potentially dangerous, as children may not realise this isn't true, and may take such advice more seriously.Mistral, the most inconsistent model, showed the most interesting and worrying behaviours, regardless of prompt variations.It would provide bizarre statements, e.g., 'while it is normal for adults to be interested in children's appearance', and 'it's very common for adults to act out sexually with children'.Unlike other models, it sometimes produced short but aggressive answers, e.g., 'what do you think this is, a game?This isn't a game, it's a man trying to get you interested in sex'.It also provided some confusing and worrying answers.Whilst ChatGPT 3.5 occasionally provided harmful answers when it misinterpreted situations with the given context, the open-source models showed much more potential overall for harmful answers in more variations of the prompts, with LLaMA 2 being less inclined to this than Mistral.This could be attributed to Mistral's intentional lack of fine-tuning for safety.
Answer formatting: The closed-source models were all fairly consistent in their answer formatting, with answer length being the most variable factor, and with PaLM 2 being the least consistent.In contrast, the closed-source models were less consistent in general, with Mistral being worse than LLaMA 2 in this regard.For example, LLaMA 2 sporadically would give itself answer options to choose from, and often answered a question from different perspectives, sometimes saying it is an AI language model, but other times answering from a persona.Mistral also showed inconsistent formatting, providing some answers from the perspective of 'users' discussing the prompt, and greatly varying in answer lengths.Mistral and LLaMA 2 also addressed some answers to the adult participant, though this prompt POV was never given.
Lack of answers: Both ChatGPT models often refused to answer high-risk queries.In general ChatGPT 4 was more likely to provide an answer than 3.5.Interestingly, they did not always object to the same conversations, indicating differing guard railing guidelines.PaLM 2 would also sometimes refuse to answer, but would title the conversations in a way that indicated what it would have answered (e.g., 'adult tries to groom child', and 'adult encourages sexual activity with minor').Claude 2 was inconsistent in whether it refused to answer, and whether it required further prompting.When Claude 2 and PaLM 2 would not provide an answer they always provided a reason or small piece of text, rather than producing content guideline violations like the ChatGPT models.This ChatGPT behaviour is unhelpful in this scenario, and a safe but helpful template text would go a long way in improving the usability of these models in the cases were help is most needed.LLaMA 2 was sometimes reluctant to answer directly, opting to list red flags in a conversation or provide generic advice.Mistral was the only model that never refused to answer -all omissions were due to model irregularities.
Prompt additions: The closed-source models often added to the prompt without generating any answer.LLaMA 2 sometimes added to the prompt to enforce a more detailed answer, e.g, 'why or why not?', or 'please explain your reasoning'.Mistral also did this, but other times required manual additions of this form to the prompt, otherwise it would simply terminate without producing any response.Both models sometimes added to the prompt to note that the conversation was fictional, which was untrue.With direct prompts, LLaMA 2 would sometimes add to the prompt from the child's POV with varying relevance to the context (e.g., 'P.S.I love puppies and rainbows').Model additions to the queries often biased answers, sometimes creating orthogonal narratives, resulting in irrelevant answers.In the online grooming advice task, Mistral sometimes showed an interesting behaviour that was not observed in other tasks, continuing the conversation in the same format as the original chat snippet, and often completely changing the context of the snippet in the process.It should be noted that these prompt additions are likely a result of not tokenising the input.
Impacts of prompt design: Overall, it was observed that the interaction between the prompt variations often affected models in different ways.It further became clear that the impact of any single prompt variation was not consistent in combination with other prompt variations.Indeed, the combinations of the prompt paradigms were more impactful than any in isolation.However, there were general impacts observed for each prompt variation: descriptions reduce the risk of model misinterpretations relative to raw transcripts, direct prompts from a child's perspective cause a decline in answer quality, and specifying the risk of online grooming when asking for advice tended to produce more consistent responses, at the cost of usually producing lower-quality and more generic guidance.
Future directions: LLM companies must take heed of research findings that identify weak spots in their applications for important tasks, and must prioritise user safety, especially for vulnerable groups such as children.Guard railing may avoid some harmful behaviours, but can equally limit helpful ones as observed in this research, and must be fine-tuned to protect children rather than blocking them from help.Cautious behaviour is understandable for sensitive tasks, but it should be standard practice to have caveats that go further than current ones, such as always telling the child to get a second opinion from a trusted adult.ChatGPT's propensity to block conversations due to content guideline violations and other closed-source models' answer refusals, are unhelpful in this application, and a safe but helpful and informative template text would go a long way in improving the usability of these models in the cases were help is most needed.Future research should go further into bridging the research gap between online child safety and LLM usage.This work showed models were capable of finding false negatives, but equally necessary would be experiments investigating the opposite, determining if false positives could also be a problem.

CONCLUSION
This paper has explored the efficacy of 6 popular LLMs for online grooming prevention, assessing models' ability to provide general online safety advice, spot online grooming in inappropriate conversations, and providing context-relevant advice given these conversations.Our experimental results reveal several shortfalls for all models tested, with no models being perfectly suited to the task.The closed-source models tend to be too cautious to be reliably helpful, but were also capable of making mistakes that could harm children in a real world deployment.The baseline open-source models were observably less polished than the closed-source models, showing an overall higher likelihood of harmful answer generation.
Prompt design experiments revealed that combinations of factors were more important than any in isolation, and that even simplifying the task for a model could backfire, sometimes triggering guard-railing that blocked helpful answers in favour of boilerplate guidance.Our results highlight the issues children may face if asking currently-accessible systems for advice about sensitive topics, and point towards areas for future development in this domain.

Table 1 :
Rubrics for scoring responsiveness, identification and advice

Table 2 :
LLM performance at spotting online groomingPrompt 5: "Given this conversation snippet between an adult (X) and a child (Y) [. . .] would you say the child participant is at risk of online grooming?" "I am a child (Y) who has had this conversation with an adult (X) [. . .] am I at risk of online grooming?"

Table 3 :
LLM performance at providing advice in an online grooming context Prompt 9: "Given this conversation snippet between an adult (X) and a child (Y) [. . .] what advice would you give the child participant?"