Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

This article presents a comprehensive and practical guide for practitioners and end-users working with Large Language Models (LLMs) in their downstream Natural Language Processing (NLP) tasks. We provide discussions and insights into the usage of LLMs from the perspectives of models, data, and downstream tasks. First, we offer an introduction and brief summary of current language models. Then, we discuss the influence of pre-training data, training data, and test data. Most importantly, we provide a detailed discussion about the use and non-use cases of large language models for various natural language processing tasks, such as knowledge-intensive tasks, traditional natural language understanding tasks, generation tasks, emergent abilities, and considerations for specific tasks. We present various use cases and non-use cases to illustrate the practical applications and limitations of LLMs in real-world scenarios. We also try to understand the importance of data and the specific challenges associated with each NLP task. Furthermore, we explore the impact of spurious biases on LLMs and delve into other essential considerations, such as efficiency, cost, and latency, to ensure a comprehensive understanding of deploying LLMs in practice. This comprehensive guide aims to provide researchers and practitioners with valuable insights and best practices for working with LLMs, thereby enabling the successful implementation of these models in a wide range of NLP tasks. A curated list of practical guide resources of LLMs, regularly updated, can be found at https://github.com/Mooler0410/LLMsPracticalGuide. An LLMs evolutionary tree, editable yet regularly updated, can be found at llmtree.ai.


INTRODUCTION
In recent years, the rapid development of Large language Models has been revolutionizing the field of natural language processing [12,128,131].These powerful models have shown great potential in addressing a variety of NLP tasks, ranging from natural language understanding (NLU) to generation tasks, even paving the way to Artificial General Intelligence (AGI).However, utilizing these models effectively and efficiently requires a practical understanding of their capabilities and limitations, as well as the data and tasks involved in NLP.
To provide a guide for partitioners and end-users, this work focuses on the practical aspects of working with LLMs in downstream NLP tasks.This guide aims to provide practical advice on why or why not to choose LLMs for a given • Knowledge-intensive tasks.Leverage the extensive knowledge stored in LLMs for tasks requiring domainspecific expertise or general world knowledge.
• Reasoning ability.Understand and harness the reasoning capabilities of LLMs to improve decision-making and problem-solving in various contexts.

PRACTICAL GUIDE FOR MODELS
This section provides a brief introduction to state-of-the-art LLMs.These models differ in their training strategies, model architectures, and use cases.To provide a clearer understanding of the LLM landscape, we categorize them into two types: encoder-decoder or encoder-only language models and decoder-only language models.In Figure 1, we show the detailed evolution process of language models.From the evolutionary tree, we make the following interesting observations: a) Decoder-only models have been gradually dominating the development of LLMs.At the early stage of LLMs development, decoder-only models were not as popular as encoder-only and encoder-decoder models.However, after 2021, with the introduction of game-changing LLMs -GPT-3, decoder-only models experienced a significant boom.Meanwhile, after the initial explosive growth brought about by BERT, encoder-only models gradually began to fade away. 1 From a practical standpoint, we consider models with less than 20B parameters to be fine-tuned models.While it's possible to fine-tune even larger models like PlaM (540B), in reality, it can be quite challenging, particularly for academic research labs and small teams.Fine-tuning a model with 3B parameters can still be a daunting task for many individuals or organizations.
Fig. 1.The evolutionary tree of modern LLMs traces the development of language models in recent years and highlights some of the most well-known models.Models on the same branch have closer relationships.Transformer-based models are shown in non-grey colors: decoder-only models in the blue branch, encoder-only models in the pink branch, and encoder-decoder models in the green branch.The vertical position of the models on the timeline represents their release dates.Open-source models are represented by solid squares, while closed-source models are represented by hollow ones.The stacked bar plot in the bottom right corner shows the number of models from various companies and institutions.
b) OpenAI consistently maintains its leadership position in LLM, both currently and potentially in the future.Other companies and institutions are struggling to catch up with OpenAI in developing models comparable to GPT-3 and the current GPT-4.This leadership position may be attributed to OpenAI's steadfast commitment to its technical path, even when it was not widely acknowledged initially.d) LLMs exhibit a tendency towards closed-sourcing.In the early stages of LLM development (before 2020), the majority of models were open-sourced.However, with the introduction of GPT-3, companies have increasingly Table 1.Summary of Large Language Models.
e) Encoder-decoder models remain promising, as this type of architecture is still being actively explored, and most of them are open-sourced.Google has made substantial contributions to open-source encoder-decoder architectures.However, the flexibility and versatility of decoder-only models seem to make Google's insistence on this direction less promising.
We also briefly summarize the characteristics and the representative LLMs of each type in Table 1.

BERT-style Language Models: Encoder-Decoder or Encoder-only
As natural language data is readily available and unsupervised training paradigms have been proposed to better utilize extremely large datasets, this motivates the unsupervised learning of natural language.One common approach is to predict masked words in a sentence while considering the surrounding context.This training paradigm is known as the Masked Language Model.This type of training allows the model to develop a deeper understanding of the relationships between words and the context in which they are used.These models are trained on a large corpus of texts using techniques such as the Transformer architecture and have achieved state-of-the-art results in many NLP tasks, such as sentiment analysis and named entity recognition.Notable examples of Masked Language Models include BERT [28], RoBERTa [65], and T5 [84].MLMs have become an important tool in the field of natural language processing due to their success in a wide range of tasks.

GPT-style Language Models: Decoder-only
Although language models are typically task-agnostic in architecture, these methods require fine-tuning on datasets of the specific downstream task.Researchers found that scaling up language models significantly improves the few-shot, even zero-shot performance [16].The most successful models for better few-shot and zero-show performance are Autoregressive Language Models, which are trained by generating the next word in a sequence given the preceding words.These models have been widely used for downstream tasks such as text generation and question answering.
Examples of Autoregressive Language Models include GPT-3 [16], OPT [126], PaLM [22], and BLOOM [92].The game changer, GPT-3, for the first time, demonstrated reasonable few-/zero-shot performance via prompting and in-context learning, thus showing the superiority of autoregressive language models.There are also models such as CodeX [2] that are optimized for specific tasks such as code generation, BloombergGPT [117] for the financial domain.The recent breakthrough is ChatGPT, which refines GPT-3 specifically for conversational tasks, resulting in more interactive, coherent, and context-aware conversational for various real-world applications.

PRACTICAL GUIDE FOR DATA
In this section, we'll be discussing the critical role that data plays in selecting appropriate models for downstream tasks.The impact of data on the models' effectiveness starts during the pre-training stage and continues through to the training and inference stages.
Remark 1 (1) LLMs generalize better than fine-tuned models in downstream tasks facing out-of-distribution data, such as adversarial examples and domain shifts.
(2) LLMs are preferable to fine-tuned models when working with limited annotated data, and both can be reasonable choices when abundant annotated data is available, depending on specific task requirements.
(3) It's advisable to choose models pre-trained on fields of data that are similar to downstream tasks.

Pretraining data
Pre-training data plays a pivotal role in the development of large language models.As the foundation of remarkable capabilities [5,47] of LLMs, the quality, quantitative, and diversity of pre-training data influence the performance of LLMs significantly [124].The commonly used pretraining data consists of a myriad of text sources, including books, articles, and websites.The data is carefully curated to ensure a comprehensive representation of human knowledge, linguistic nuances, and cultural perspectives.The importance of pretraining data lies in its capacity to inform the language model with a rich understanding of word knowledge, grammar, syntax, and semantics, as well as the ability to recognize context and generate coherent responses.The diversity of pretraining data also plays a crucial role in shaping the model's performance, and the selection of LLMs highly depends on the components of the pretraining data.For example, PaLM [22] and BLOOM [92] excel in multilingual tasks and machine translation with an abundance of multilingual pretraining data.Moreover, PaLM's performance in Question Answering tasks is enhanced by incorporating a considerable amount of social media conversations and Books corpus [22].Likewise, code execution and code completion capabilities of GPT-3.5 (code-davinci-002) are amplified by the integration of code data in its pretraining dataset.In brief, when selecting LLMs for downstream tasks, it is advisable to choose the model pre-trained on a similar field of data.

Finetuning data
When deploying a model for downstream tasks, it is essential to consider three primary scenarios based on the availability of annotated data: zero, few, and abundant.In this section, we provide a succinct overview of the appropriate models to employ for each scenario.Zero annotated data: In scenarios where annotated data is unavailable, utilizing LLMs in a zero-shot setting proves to be the most suitable approach.LLMs have been shown to outperform previous zero-shot methods [120].Additionally, the absence of a parameter update process ensures that catastrophic forgetting [49] is avoided since the language model parameters remain unaltered.
Few annotated data: In this case, the few-shot examples are directly incorporated in the input prompt of LLMs, which is named as in-context learning, and these examples can effectively guide LLMs to generalize to the task.As reported in [16], one-shot and few-shot performance make significant gains, even matching the performance of the SOTA fine-tuned open-domain models.And LLMs' zero/few-shot ability can be improved further by scaling [16].Alternatively, some few-shot learning methods are invented to enhance fine-tuned models, such as meta-learning [56] or transfer learning [88].However, performance might be inferior compared to using LLMs due to fine-tuned models' smaller scale and overfitting.Abundant annotated data: With a substantial amount of annotated data for a particular task available, both fine-tuned models and LLMs can be considered.In most cases, fine-tuning the model can fit the data pretty well.Although, LLMs can be used to meet some constraints such as privacy [99].In this scenario, the choice between using a fine-tuned model or a LLM is task-specific and also depends on many factors, including desired performance, computational resources, and deployment constraints.
In a brief summary: LLMs are more versatile w.r.t. the data availability, while fine-tuned models can be considered with abundant annotated data.

Test data/user data
When deploying LLMs for downstream tasks, we often face challenges stemming from distributional differences between the test/user data and that of the training data.These disparities may encompass domain shifts [132], out-of-distribution variations [31], or even adversarial examples [82].Such challenges significantly hinder fine-tuned modes' effectiveness in real-world applications.They fit into a specific distribution and have a poor ability to generalize to OOD data.
However, LLMs perform quite well facing such scenarios because they do not have an explicit fitting process.Moreover, recent advancements have further enhanced the ability of language models in this regard.The Reinforcement Learning from Human Feedback (RLHF) method has notably enhanced LLMs' generalization capabilities [77].For example, InstructGPT demonstrates proficiency in following various instructions for a wide range of tasks and occasionally complying with instructions in different languages, even though such instructions are scarce.Similarly, ChatGPT exhibits consistent advantages on most adversarial and out-of-distribution (OOD) classification and translation tasks [109].Its superiority in understanding dialogue-related texts led to an impressive performance on the DDXPlus dataset [101], a medical diagnosis dataset designed for OOD evaluation.

PRACTICAL GUIDE FOR NLP TASKS
In this section, we discuss in detail the use cases and no use cases for LLMs in various downstream NLP tasks and the corresponding model abilities.And in Figure 2, we summarize all discussions into a decision flow.It can be a guide for a quick decision while facing a task.

Traditional NLU tasks
Traditional NLU tasks are some fundamental tasks in NLP including text classification, named entity recognition (NER), entailment prediction, and so on.

Remark 2
Fine-tuned models generally are a better choice than LLMs in traditional NLU tasks, but LLMs can provide help while requiring strong generalization ability.

No use case.
In most natural language understanding tasks, such as tasks in GLUE [106] and SuperGLUE [105], fine-tuned models still have better performance, if such tasks come with rich well-annotated data and contain very few out-of-distribution examples on test sets.For different tasks and datasets, the gap between small fine-tuned models and LLMs varies.
In text classification, on most datasets, LLMs perform slightly worse than fine-tuned models.For sentiment analysis, such as on IMDB [69] and SST [94], fine-tuned models and LLMs perform equally well.For toxicity detection, which is another iconic text classification task, the gap is much larger.All LLMs cannot perform well on this task, and on CivilComments [13] even the best one is only better than random guessing [59].On the other hand, most popular fine-tuned models can obtain much better performance [33].and the Perspective API 3 is still one of the best for detecting toxicity.This API is powered by a multilingual BERT-based model, which is tuned on publicly available toxicity data and several smaller single-language CNNs distilled from this model.This might be due to the fact that toxicity is defined by subtle nuances in linguistic expressions, and large language models are unable to accurately comprehend this task solely based on the provided input.
In information retrieval (IR) tasks, LLMs are not widely exploited yet.One major reason is that IR tasks are fundamentally different from others.There's no natural way to transform the thousands of candidate texts into a few/zero-shot form which is required by LLMs.The existing evaluation results on MS MARCO(regular/TREC) [73] show that methods based on fine-tuned models have better performance [59].In this evaluation, the LLMs rank passages in an unorthodox way, which requires the LLMs to produce probabilities for passages one by one.
For some low-level intermediate tasks, which are not intended for regular users but rather for high level tasks, such as named entity recognition (NER) and dependency parsing, there's not enough result coming from LLMs, because the most current evaluation of LLMs focuses on practical tasks.According to available evaluation results, for the NER task, CoNLL03 [89] is still a challenge for LLMs [81], where the performance of fine-tuned models is around as twice as LLMs.These intermediate tasks may vanish soon because LLMs can take over high-level tasks without the help of those intermediate tasks (e.g.dependency parsing for coding tasks; NER for some text generation tasks).
In brief, for most traditional NLU tasks, a fine-tuned model is a better choice in terms of the performance on benchmark datasets and the computational cost.The scale of LLMs is usually 10× or even 100× larger than fine-tuned models.
One possible cause for the inferior performance of LLMs on certain tasks can be the design of instructions/prompts.
Transforming input from tasks like IR and sentence labeling into a few/zero-short instruction form is non-trivial.There may be better ways to adapt language models to traditional NLP tasks in the future.On the other hand, the upper limit of capabilities of fine-tuned models is not reached, and some methods like FLAN-tuning [67] can further boost the performance on NLU tasks.Another interesting finding is that on NLU tasks, after fine-tuning, masked language models, like T5 [85], are better than most auto-regressive language models at the same scale, while some recent results imply that this gap can be bridged by scaling [22].
4.1.2Use case.However, there are still some NLU tasks suitable for LLMs.
One of the representative tasks is miscellaneous text classification [59].In contrast to classic domain-specific text classification tasks such as sentiment analysis, miscellaneous text classification deals with a diverse range of topics and categories that may not have a clear or strong relationship with one another.It's closer to real-world cases and hard to be formatted for using fine-tuned models.Another is the Adversarial NLI (ANLI) [74].It is a difficult dataset composed of adversarially mined natural language inference questions in three rounds (R1, R2, and R3).LLMs have shown superior performance on ANLI, especially on the R3 and R2.Both examples demonstrate the exceptional ability of LLMs to generalize well on out-of-distribution and sparsely annotated data in traditional NLP tasks, surpassing that of fine-tuned models.We've discussed this in the section above 3.3.

Generation tasks
Natural Language Generation broadly encompasses two major categories of tasks, with the goal of creating coherent, meaningful, and contextually appropriate sequences of symbols.The first type focuses on converting input texts into new symbol sequences, as exemplified by tasks like paragraph summarization and machine translation.The second type, "open-ended" generation, aims to generate text or symbols from scratch to accurately match input descriptions such as crafting emails, composing news articles, creating fictional stories and writing code.

Remark 3
Due to their strong generation ability and creativity, LLMs show superiority at most generation tasks.
4.2.1 Use case.Generation tasks require models to have a comprehensive understanding of the input contents or requirements and a certain level of creativity.This is what LLMs excel at.
For summarization tasks, although LLMs do not have an obvious advantage over fine-tuned models under traditional automatic evaluation metrics, such as ROUGE [60], human evaluation results indicate that humans tend to prefer the results generated by LLMs [38,127] compared to that of fine-tuned models.For example, on CNN/DailyMail [71] and XSUM [72], fine-tuned models like Brio [66] and Pegasus [125] have much better performance than any LLMs w.r.t.
ROUGE, but LLMs like OPT [126] perform far better in human evaluation considering all aspects including faithfulness, coherence, and relevance [127].This demonstrates the superiority of LLMs in summarization tasks.On the other hand, it implies that current summarization benchmarks don't contain summaries with high quality or the automatic metrics are not proper for the evaluation of summarization.
In machine translation (MT), LLMs can perform competent translation, although the average performance is slightly worse than some commercial translation tools [45] considering some automatic metrics like BLEU [78].LLMs are particularly good at translating some low-resource language texts to English texts, such as in the Romanian-English translation of WMT'16 [11], zero-shot or few-shot LLMs can perform better than SOTA fine-tuned model [22].This is mainly due to the fact that English resources compose the main part of the pre-training data.BLOOM [92] is pre-trained on more multi-lingual data, leading to better translation quality in both rich-resource and low-resource translation.
Another interesting finding is that BLOOM achieves good translation quality among Romance languages, even for translation from Galician, which is not included in the pre-training data.One reasonable explanation is that texts from some languages in the same language group can help the LLMs learn more from the similarity.If more multi-lingual texts can be added to the pre-training data, the translation capability may be improved further.
Additionally, LLMs are highly skilled in open-ended generations.One example is that the news articles generated by LLMs are almost indistinguishable from real news articles by humans [16].LLMs are remarkably adept at code synthesis as well.Either for text-code generation, such as HumanEval [18] and MBPP [7], or for code repairing, such as DeepFix [39], LLMs can perform pretty well.GPT-4 can even pass 25% problems in Leetcode, which are not trivial for most human coders [76].With training on more code data, the coding capability of LLMs can be improved further [22].
While performing well on such tasks, the codes generated by LLMs should be tested carefully to figure out any subtle bugs, which is one of the main challenges for applying LLMs in code synthesis.

No use case.
Fine-tuned models, such as DeltaLM+Zcode [118], still perform best on most rich-resource translation and extremely low-resource translation tasks.In rich resource machine translation, fine-tuned models slightly outperform LLMs [22,92].And in extremely low-resource machine translation, such as English-Kazakh translation, fine-tuned models significantly perform better than LLMs.

Knowledge-intensive tasks
Knowledge-intensive NLP tasks refer to a category of tasks that have a strong reliance on background knowledge, domain-specific expertise, or general real-world knowledge.These tasks go beyond simple pattern recognition or syntax analysis.And they are highly dependent on memorization and proper utilization of knowledge about specific entities, events, and common sense of our real world.
Remark 4 (1) LLMs excel at knowledge-intensive tasks due to their massive real-world knowledge.
(2) LLMs struggle when the knowledge requirements do not match their learned knowledge, or when they face tasks that only require contextual knowledge, in which case fine-tuned models can work as well as LLMs.on nearly all datasets, such as on NaturalQuestions [52], WebQuestions [9], and TriviaQA [46].On TriviaQA, even zero-shot LLMs is still much better [22].
The massive multitask language understanding (MMLU) [40] is also highly knowledge-intensive.It contains multiplechoice questions spanning over 57 different subjects and requires general knowledge of the model.It's pretty challenging even for LLMs, although the newly released GPT-4 [76] outperforms existing models by a considerable margin in English with a satisfactory 86.5% accuracy.
Also, some tasks in Big-bench [96], which are designed to probe LLMs and extrapolate their future capabilities, heavily relied on the memorization of real-world knowledge.In such tasks, the performance of some LLMs is better than the average level of humans, and even comparable to the best human performance.For example, the task Hindu_knowledge requires models to give facts about Hindu mythology, Periodic Elements require the capability of predicting the element name from the periodic table and Physics tests the physics knowledge of models by asking for the formula needed to solve a given physics problem.

No use case.
There are some other tasks requiring knowledge different from that learned by LLMs.The required knowledge is not that learned by LLMs about the real world.In such tasks, LLMs are not notably superior.
Some tasks only require the model to capture the self-contained knowledge in the contexts.The knowledge in the contexts from the input is enough for the model to make predictions.For these tasks, small fine-tuned models can work pretty well.One such task is machine reading comprehension (MRC).An MRC task provides several paragraphs and requires the model to predict the answer to questions based on these paragraphs.We've discussed MRC in the previous section because it's also a traditional NLU task.
Another scenario is that the knowledge within LLMs about real world is useless to the task, or even the required knowledge is counterfactual to the real world.As a result, the LLMs cannot work well on such tasks.In some cases, inconsistent knowledge may even make the LLMs worse than random guessing.For example, in Big-Bench, the Mnist ascii task requires the model to tell the digit represented by an ASCII art.The capability required by this task is nothing about real-world knowledge.Also, in the Inverse Scaling Phenomenon competition [70], the task redefine math redefines a common symbol and requires the model to choose between the original meaning and the meaning derived from the redefinition.What it requires contrasts to the LLMs' knowledge, thus LLMs even perform worse than random guessing.
As an alternative to real-world knowledge in LLMs, access to extra knowledge is allowed, and models can thus get enough knowledge for a task via retrieval augmentation.The basic idea of retrieval augmentation is to add an extra information retrieval step prior to making predictions, in which, some useful texts related to the task will be retrieved from a large corpus.Then, the model will make predictions based on both the input contexts and the retrieved texts.
With retrieved additional information, the closed-book task can become "open-book".In such a scenario, fine-tuned models are pretty good with much smaller sizes, because the required knowledge can be obtained by retrieving.For example, on NaturalQuestions [52], with extra corpus, retrieval augmented models [44,48] are much better than any other methods.

Abilities Regarding Scaling
Scaling of LLMs (e.g.parameters, training computation, etc.) can greatly empower pretrained language models.With the model scaling up, a model generally becomes more capable in a range of tasks.Reflected in some metrics, the performance shows a power-law relationship with the model scale.For example, the cross-entropy loss which is used to measure the performance for language modeling decreases linearly with the exponential increase in the model scale, which is also called 'scaling-law' [41,47].For some crucial abilities, such as reasoning, scaling the model has gradually transformed these abilities from a very low state to a usable state, and even approaching human capabilities.In this section, we provide an overview of the usage of LLMs in terms of the abilities and behaviors of LLMs along with scaling.
Remark 5 (1) With the exponential increase of model scales, LLMs become especially capable of reasoning like arithmetic reasoning and commonsense reasoning.
(2) Emergent abilities become serendipity for uses that arise as LLMs scale up, such as ability in word manipulation and logical ability.
(3) In many cases, performance does not steadily improve with scaling due to the limited understanding of how large language models' abilities change as they scale up.

Use
Case with Reasoning.Reasoning, which involves making sense of information, drawing inferences, and making decisions, is one of the essential aspects of human intelligence.It is challenging for NLP.Many existing reasoning tasks can be classified into commonsense reasoning and arithmetic reasoning.Arithmetic reasoning/problem solving.The arithmetic reasoning capability of LLMs benefits greatly from the scaling of model size.For GPT-3, the ability of two-digit addition only becomes apparent when the number of parameters exceeds 13B [16].Tasks to test arithmetic reasoning are trivial for humans and designed to challenge the capability of transferring natural language into mathematical symbols and multi-step inference.On GSM8k [26], SVAMP [79] and AQuA [61], LLMs, as generalists, have competitive performance with most methods which have task-specific designs.And GPT-4 overperforms any other methods [76], even some huge models particularly tuned for arithmetic problems [104].Nevertheless, it should be noted that, without the intervention of external tools, LLMs may occasionally make mistakes in performing basic calculations, although chain-of-thought (CoT) prompting [115] can significantly improve LLMs' ability in calculations.
Commonsense reasoning.Commonsense reasoning not only requires LLMs to remember factual knowledge but also requires LLMs to do several inference steps about the facts.Commonsense reasoning increases gradually with the growth of model size.Compared to fine-tuned models, LLMs keep the superiority on most datasets, such as StrategyQA [36] and ARC-C [25].Especially on ARC-C, which contains difficult questions in science exams from grade 3 to grade 9, GPT-4 has been close to the performance of 100% (96.3%) [76].

Use
Cases with Emergent Abilities.Scaling of models also endows the model with some unprecedented, fantastic abilities that go beyond the power-law rule.These abilities are called "emergent ability".As defined in [113], emergent abilities of LLMs are abilities that are not present in smaller-scale models but are present in large-scale models.This means such abilities cannot be predicted by extrapolating the performance improvements on smaller-scale models and the model suddenly gains good performance on some tasks once the scale exceeds a certain range.The emergent ability is typically unpredictable and surprising, leading to tasks that emerge randomly or unexpectedly.We examine concrete examples of the emergent abilities of LLMs and provide them as an important reference for deciding whether to leverage LLMs' emergent abilities.
Handling word manipulation is a typical emergent ability.It refers to the ability to learn symbolic manipulations, such as the reversed words [16], in which the model is given a word spelled backwards, and must output the original word.
For example.GPT-3 [16] shows the emergent ability for word sorting, and word unscrambling tasks.PaLM [22] exhibits the emergent ability on ASCII word recognition4 and hyperbaton 5 task.The logical abilities of language models tend to emerge as the model scales up, such as logical deduction, logical sequence, and logic grid puzzles.Additionally, other tasks, such as advanced coding (e.g., auto debugging, code line description), and concept understanding (e.g., novel concepts, simple Turing concepts), are also use cases with the emergent abilities of large language models.

4.4.3
No-Use Cases and Understanding.Although in most cases, as discussed above, larger models bring better performance, there are still many exceptions that should be considered when choosing the appropriate model.
On certain tasks, with the size of LLMs increasing, the performance begins to decrease, such as Redefine-math: tests whether language models are able to work with common symbols when they are redefined to mean something else; Intothe-unknown: requires the model to choose which piece of information would help answer a question; Memo-trap: asks an LM to write a phrase in a way that starts like a famous quote but ends differently 6 .This is also called Inverse Scaling Phenomenon.Another interesting phenomenon observed in the scaling of LLMs is called the U-shaped Phenomenon [114].
As the name implies, This phenomenon refers to that as LLM size increases, their performance on certain tasks initially improves but then starts to decline before eventually improving again, such as on: Hindsight-neglect: it tests whether language models are able to assess whether a bet was worth taking based on its expected value; NegationQA: this task takes an existing multiple-choice dataset and negates a part of each question to see if language models are sensitive to negation; Quote-repetition: it asks models to repeat back sentences given in the prompt, with few-shot examples to help it recognize the task.Hence the risk of diminishing performance should be noted and if the task is similar to those we just discussed, careful consideration should be given to whether or not to use huge LLMs.
Gaining a deeper understanding of emergent abilities, inverse scaling phenomenon and U-shape phenomenon in LLMs is essential for advancing research in this field.In a certain sense, the U-shape phenomenon suggests that small-scale models and huge-scale models make predictions with different internal mechanisms.From this perspective, the U-shape phenomenon can be seen as a transformation of the inverse-scaling phenomenon due to some emergent abilities from sufficiently large models [114].GPT-4 [76] exhibits a reversal of the inverse scaling phenomenon in some cases, such as on a task called Hindsight Neglect.The explanation for these behaviors of LLMs during scaling is still an open problem.Several hypotheses have been proposed.For emergent abilities, one explanation is that there may be multiple key steps for a task and the LLM cannot handle this task until it's large enough to handle every step, and another explanation is focused on the granularity of evaluation metrics [113].For inverse-scaling phenomenon and u-shape phenomenon, the explanations mainly focus on the model's over-reliance on information from its prior rather than the input prompts, valid but misleading few-shot examples, and distracting easier tasks within a hard task [114].

Miscellaneous tasks
This section explores miscellaneous tasks which cannot be involved in previous discussions, to better understand LLMs' strengths and weaknesses.
Remark 6 (1) Fine-tuned models or specified models still have their space in tasks that are far from LLMs' pretraining objectives and data.
(2) LLMs are excellent at mimicking human, data annotation and generation.They can also be used for quality evaluation in NLP tasks and have bonuses like interpretability.

No use case.
LLMs generally struggle with some tasks due to differences in objectives and training data.
Although LLMs have achieved remarkable success in various natural language processing tasks, their performance in regression tasks has been less impressive.For example, ChatGPT's performance on the GLUE STS-B dataset, which is a regression task evaluating sentence similarity, is inferior to a fine-tuned RoBERTa performance [130].The Regression tasks typically involve predicting a continuous value rather than a discrete label, posing unique challenges for LLMs.One primary reason for their subpar performance is the inherent difference between the language modeling objective and the regression task objective.LLMs are designed to predict the next word in a sequence or generate coherent text, with their pre-training focused on capturing linguistic patterns and relationships.Consequently, their internal representations may not be well-suited for modeling continuous numerical outputs.Besides, LLMs have predominantly been trained on text data, focusing on capturing the intricacies of natural language processing.As a result, their performance on multimodal data, which involves handling multiple data types such as text, images, audio, video, actions, and robotics, remains largely unexplored.And fine-tuned multimodal models, like BEiT [110] and PaLI [19], still dominate many tasks such as visual question answering (VQA) and image captioning.Nonetheless, the recently introduced GPT-4 [76] has taken the step in multimodal fusion, but there is still a lack of detailed evaluation of its capabilities.
4.5.2Use case.LLMs are particularly suitable for certain tasks.
LLMs are very good at mimicking humans, acting as a chatbot, and performing various kinds of tasks.The LLMspowered ChatGPT7 is surprising for its consistency, reliability, informativeness, and robustness during multiple utterances with humans.The human-feedback procedure plays an important role in acquiring such abilities LLMs can both act as a good annotator and data generator for data augmentation, such as in [27,29,99,121,122].
Some LLMs have been found as good as human annotators [37] in some tasks.And the collected texts from GPT-3.5 (text-davinci-003) have been used as human-like instruction-following demonstrations to train other language models [100].
LLMs can also be used for quality assessment on some NLG tasks, such as summarization and translation.On summarization tasks, GPT-4 as an evaluator achieves a higher correlation with humans than other methods with a large margin [64].Some other evaluators based on LLMs [34,50,64,108] also show good human alignment in more NLG tasks, especially compared with traditional automatic metrics.But the LLM evaluator may have a bias towards the LLM-generated texts [64].
Also, as we discussed above, some abilities of LLMs bring bonuses in addition to performance improvement, such as interpretability.The CoT reasoning ability of LLMs can show how an LLM reaches the prediction, which is a good interpretation on the instance level, while it also improves the performance.

Real world "tasks"
In the last part of this section, we would like to discuss the usage of LLMs and fine-tuned models in real-world "tasks".We use the term "tasks" loosely, as real-world scenarios often lack well-formatted definitions like those found in academia.
Many requests to models even cannot be treated as NLP tasks.Models face challenges in the real world from three perspectives: • Noisy/Unstructured input.Real-world input comes from real-world non-experts.They have little knowledge about how to interact with the model or even cannot use texts fluently.As a result, real-world input data can be messy, containing typos, colloquialisms, and mixed languages, unlike those well-formed data used for pre-training or fine-tuning.• Tasks not formalized by academia.In real-world scenarios, tasks are often ill-defined by academia and much more diverse than those in academic settings.Users frequently present queries or requests that do not fall neatly into predefined categories, and sometimes multiple tasks are in a single query.
• Following users' instructions.A user's request may contain multiple implicit intents (e.g.specific requirement to output format), or their desired predictions may be unclear without follow-up questions.Models need to understand user intents and provide outputs that align with those intents.
Essentially, these challenges in the real world come from that users' requests deviate significantly from the distribution of any NLP datasets designed for specific tasks.Public NLP datasets are not reflective of how the models are used [77].

Remark 7
LLMs are better suited to handle real-world scenarios compared to fine-tuned models.However, evaluating the effectiveness of models in the real world is still an open problem.
Handling such real-world scenarios requires coping with ambiguity, understanding context, and handling noisy input.Compared to fine-tuned models, LLMs are better equipped for this because they have been trained on diverse data sets that encompass various writing styles, languages, and domains.Additionally, LLMs demonstrate a strong ability to generate open-domain responses, making them well-suited for these scenarios.Fine-tuned models, on the other hand, are often tailored to specific, well-defined tasks and may struggle to adapt to new or unexpected user requests.They heavily rely on clear objectives and well-formed training data that specify the types of instructions the models should learn to follow.Fine-tuned models may struggle with noisy input due to their narrower focus on specific distributions and structured data.An additional system is often required as an assistant for fine-tuned models to process unstructured context, determine possible intents, and refine model responses accordingly.
Additionally, some mechanics such as instruction tuning [91,112] and human alignment tuning [77] further boost the capabilities of LLMs to better comprehend and follow user instructions.These methods improve the model's ability to generate helpful, harmless, and honest responses while maintaining coherence and consistency [77,91,112].While both methods can make LLMs better generalize to unseen tasks and instructions, it has been noticed that while human labelers prefer models tuned for human alignment [77] to models tuned with instructions from public NLP tasks, such as FLAN [112] and T0 [91].The reason may be similar to reasons for fine-tuned models' inferiority: public NLP tasks/datasets are designed for easy and automatic evaluation, and they can only cover a small part of real-world usage.
One of the main issues when it comes to real-world scenarios is how to evaluate whether the model is good or not.
Without any formalized tasks or metrics, the evaluation of model effectiveness can only rely on feedback from human labelers.Considering the complexity and cost of human evaluation, there's no massive and systematic comparison between fine-tuned models and LLMs yet.Nevertheless, the huge success and popularity of LLMs such as chatGPT, have confirmed the superiority of LLMs to some extent.

OTHER CONSIDERATIONS
Despite LLMs are suitable for various downstream tasks, there are some other factors to consider, such as efficiency and trustworthiness.Our discussion of efficiency encompasses the training cost, inference latency, and parameter-efficient tuning strategies for LLMs.Meanwhile, our examination of trustworthiness includes robustness & calibration, fairness & biases, potential spurious correlations, and the safety challenges in LLMs.

Remark 8
(1) Light, local, fine-tuned models should be considered rather than LLMs, especially for those who are sensitive to the cost or have strict latency requirements.Parameter-Efficient tuning can be a viable option for model deployment and delivery.
(2) The zero-shot approach of LLMs prohibits the learning of shortcuts from task-specific datasets, which is prevalent in fine-tuned models.Nevertheless, LLMs still demonstrate a degree of shortcut learning issues.
(3) Safety concerns associated with LLMs should be given utmost importance as the potentially harmful or biased outputs, and hallucinations from LLMs can result in severe consequences.Some methods such as human feedback have shown promise in mitigating these problems.

Efficiency
In real-world deployment, performance, cost, and latency are all important considerations, not just the performance of the models.While some parameter-efficient methods have been developed, practitioners must balance efficiency with effectiveness in the practice.
Cost.LLMs have grown increasingly larger in recent years, with models such as GPT-1, GPT-2, and GPT-3 featuring 117 million, 1.5 billion, and 175 billion parameters, respectively.The cost of training an LLM is heavily influenced by its size, with estimates suggesting that training the 11B parameter variant of T5 costs well over $1.3 million for a single run, while a single training run of GPT-3 175B requires $4.6 million [3].The energy consumption for training large models is equally impressive.The total energy consumption for training a transformer model with 6B parameters to completion is estimated to be around 103.5 MWh [30].Google reports that training PaLM consumed about 3.4 GWh in about two months [6].Furthermore, the dataset size also scales rapidly with the size of the model, with GPT-3 175B trained on 499 billion tokens [16].Another key metric that reflects the computing cost is Flops, with GPT-3 175B requiring 3.14 × 10 23 Flops, while a T5 11B model only requires 3.30 × 10 22 , which is 10 times less.In addition to these costs, hardware requirements are also substantial.OpenAI has collaborated with Microsoft on a supercomputer hosted in the Microsoft Azure cloud, consisting of 285k CPU cores and 10k high-end GPUs to support the training of large models.For users of the OpenAI API, pricing varies based on the model and usage, with options such as GPT-3.5-turbocharging $0.002 per 1k tokens for chat service.However, for users who require custom models, training costs $0.03 per 1k tokens, while usage costs $0.12 per 1k tokens [4].Therefore, for users who cannot afford such a large cost, such as small startups, individual users, etc., a small, fine-tuned model is a better and more reasonable choice.
Latency.Latency is a crucial factor to consider in real-world applications of LLMs.Inference time is a commonly used metric to measure latency, which is highly dependent on the model size, architecture, and token size.For instance, the inference time for the GPT-J 6B model is 0.077s, 0.203s, and 0.707s when the max token size is set to 2, 8, and 32, respectively.Additionally, when the max token size is fixed at 32, the inference time for the InstructGPT model (davinci v2) is 1.969s.As LLMs are often too large to be run on a single user's machine, companies provide LLM services via APIs.The API latency can vary depending on the user's location, and the average latency of the OpenAI API service for a single request can range from a few hundred milliseconds to several seconds.In scenarios where high latency is not acceptable, large LLMs may not be appropriate.For example, scalability is critical in many information retrieval applications.To deploy information retrieval systems on the web, search engines require very efficient inference for systems to be useful.The idealized denoised inference time for the InstructGPT davinci v2 (175B*) model is 0.21s per request (i.e., a query-passage pair to be scored), which is too slow for web search engines.
Parameter-Efficient Tuning.In practice, we may tune the model on some specific datasets.Parameter-Efficient Tuning (PET) is an efficient technique to tune a small portation of model parameters (or extra parameters) while freezing most parameters of the pre-trained LLMs.The main goal of PEFT is to greatly decrease the computational and storage costs while keeping the performance of the original models.The common techniques for PET are LoRA [42], Prefix Tuning [58], P-Tuning [62,63].As an illustration, the LoRA method maintains the weights of the pre-trained model and incorporates low-rank matrices into every layer of the Transformer architecture.This approach considerably minimizes the number of parameters that require training for subsequent tasks, thereby increasing overall efficiency.
Alpaca-LoRA 8 proposes integrating Low-Rank Adaptation (LoRA) into LLaMA-Alpaca, which enables runs LLaMA within hours on a single RTX 4090.All these PFT methods can be helpful either for fine-tuning a model to a specific task or tuning LLMs to meet special requirements like human alignment.

Trustworthiness
Given that LLMs are now involved in sensitive areas such as healthcare, finance, and law, it is crucial to ensure that they are trustworthy and capable of producing reliable output.
Robustness and Calibration.The accuracy and robustness of the LLMs are shown to have a very strong correlation [59].
The models that have high accuracy on the scenario also have good robustness.However, the robustness of the zero-shot becomes worse after being tuned on extra application-specific tasks data [116].This may due to overfitting, which leads to poor generalizability due to the extremely high complexity of the model and the limited training samples from downstream tasks [43].In a similar vein, it has been observed that fine-tuning a model can result in significant miscalibrations, owing to over-parameterization [51].Therefore, fine-tuned models may not be an optimal choice when robustness and calibration are critical considerations.However, human alignment has been found as a potential solution for enhancing model robustness.InstructGPT davinci v2 (175B*) has been shown to outperform other models in terms of robustness.On the other hand, achieving optimal calibration of the model depends on the scenario and adaptation procedure employed.
Fairness and Bias.LLMs have been shown to exhibit disparate treatment and impact, perpetuating societal biases and potentially leading to discrimination [10,17].To ensure fairness and equity for all users, it is crucial to address these issues in the development and deployment of NLP models.Disparities in performance between demographic groups can serve as an indicator of fairness problems.LLMs are particularly susceptible to fairness issues, as significant performance disparities have been observed across demographic categories such as dialect, religion, gender, and race [59].However, research has shown that aligning models with human instructions can improve LLM performance regardless of their size, with the InstructGPTmodel (davinci v2) exhibiting smaller performance disparities than other LLMs [23].
Spurious Biases.The shortcut learning problem has been observed in various natural language understanding tasks under the pretraining and fine-tuning paradigm, where models heavily rely on spurious correlations between input and labels in the fine-tuning data for prediction [31,35,98].For example, in reading comprehension tasks, fine-tuned models tend to focus on the lexical matching of words between the question and the original passage, neglecting the intended reading comprehension task itself [53].In contrast, large language models are not directly trained on fine-tuned datasets, which makes it less likely for them to learn shortcut features present in the fine-tuned dataset, thereby enhancing the model's generalization capabilities.However, LLMs are not infallible and may exhibit some shortcut learning during in-context learning.For example, recent preliminary studies have begun investigating the robustness of prompt-based methods in large-scale language models [111,129].One such study evaluates the few-shot learning performance of GPT-3 on text classification and information extraction tasks [129].and reveal that the examined LLMs are susceptible to majority label bias and position bias, where they tend to predict answers based on the frequency or position of the answers in the training data.Moreover, these LLMs exhibit common token bias, favoring answers that are prevalent in their pre-training corpus.Recent studies show that this positional bias can be mitigated by selecting proper prompts [68].In summary, while LLMs significantly reduce the shortcut learning problem prevalent in fine-tuned models, they still exhibit some shortcut learning issues and should be approached with caution when deploying them in downstream applications.

Safety challenges
LLMs have demonstrated their extremely strong capabilities in many areas such as reasoning, knowledge retention, and coding.As they become more powerful and human-like, their potential to influence people's opinions and actions in significant ways grows.As a result, some new safety challenges to our society should be considered and have caught lots of attention in recent works [75,76].
Hallucinations.The potential for LLMs to "hallucinate," or generate nonsensical or untruthful content, can have significant negative impacts on the quality and reliability of information in various applications.As LLMs become increasingly convincing and believable, users may develop an overreliance on them and trust them to provide accurate information in areas with which they are somewhat familiar.This can be particularly dangerous if the model produces content that is entirely false or misleading, leading to incorrect decisions or actions taken based on that information.Such outcomes can have serious consequences in many domains, such as healthcare, finance, or public policy, where the accuracy and reliability of information are critical.To mitigate these issues, reinforcement learning from human feedback (RLHF) is widely used [75,77] and LLMs themselves have been integrated into the loop [75].
Harmful content.Due to the high coherence, quality, and plausibility of texts generated by LLMs, harmful contents from LLMs can cause significant harm, including hate speech, discrimination, incitement to violence, false narratives, and even social engineering attack.The implementation of safeguards to detect and correct those contents can be mitigation [97].These LLMs can also have dual-use potential by providing required illicit information, leading to risks such as the proliferation of weapons [75] and even terrorism attack planning.It is crucial to ensure using these LLMs responsibly, with safeguards in place to prevent harm.Also, in existing work, feedback from humans plays an important role in getting rid of harmful outputs.
Privacy.LLMs can face serious security issues.An example is the issue of user privacy.It is reported that Samsung employees were using ChatGPT to process their work when they inadvertently leaked top-secret data, including the source code proper of the new program, internal meeting minutes related to the hardware, etc.The Italian data protection agency declared that OpenAI, the developer of ChatGPT, illicitly gathered personal user data, leading Italy to become the first government to prohibit ChatGPT over privacy concerns [1].

CONCLUSION AND FUTURE CHALLENGES
Recent advances in large language models have been revolutionizing the field of natural language processing.Effectively using LLMs requires understanding their capabilities, and limitations for various NLP tasks.This work presents a practical guide to working with LLMs for downstream NLP tasks.We first discuss prominent models like GPT-style and BERT-style architectures and the factors influencing their performance.We then explore using LLMs for downstream tasks, including knowledge-intensive tasks, NLU, and NLG tasks, as well as providing concrete examples of successes and limitations.This practical guide offers insights into LLMs and best practices for harnessing LLMs across NLP tasks.
We hope it would enable researchers and practitioners to leverage their potential and drive innovation in language technologies.
In the following, we figure out the future challenges of the LLMs: • Evaluation of proposed models on real-world "datasets".While existing deep learning models are primarily evaluated on standard academic datasets, such as ImageNet, which have been milestones in deep learning development.However, the limitations of standard academic datasets can not exactly reflect real-world performance.As models advance, it is crucial to assess them on more diverse, complex, and realistic data that reflect real-world needs.
Evaluating models on real-world "datasets", in addition to academic ones, will provide a more rigorous test of their capabilities, as well as a better understanding of their effectiveness in real-world applications.This ensures that the models are capable of addressing real-world challenges and delivering practical solutions.
• Model Alignment.Ensuring that increasingly powerful and autonomous models align with human values and priorities is essential.Methods must be developed to guarantee that these models behave as intended and do not optimize for undesirable outcomes.It is crucial to integrate alignment techniques from the start of the model development process.Model transparency and interpretability are also important factors for evaluating and ensuring alignment.Additionally, as we look toward the future, an even more daunting challenge looms: aligning superhuman systems.While this task is currently beyond our demands, it is important to consider and prepare for the potential implications of aligning such advanced systems, as they may present unique complexities and ethical concerns [8,15].• Safety Alignment.While discussion of AI existential risks is important, concrete research is needed to guarantee the safe development of advanced AI.This includes techniques for interpretability, scalable oversight and governance, and formal verification of model properties.Safety should be considered not just an add-on but an integral part of the model-building process.
• Performance Prediction with Scaling.It is difficult to anticipate how model performance will change as model size and complexity increases dramatically.Developing methods to better predict model performance after scaling up or as new architectures are developed would allow for more efficient use of resources and accelerated progress.Some possibilities include: training a smaller 'seed' model and extrapolating its growth, simulating the effects of increased scale or model tweaks, and benchmarking iterations of the model at different scales to build scaling laws.
These could provide insight into the performance of models even before they are built.
c) Meta contributes significantly to open-source LLMs and promotes research of LLMs.When considering contributions to the open-source community, particularly those related to LLMs, Meta stands out as one of the most generous commercial companies, as all the LLMs developed by Meta are open-sourced.

Fig. 2 .
Fig.2.The decision flow for choosing LLMs or fine-tuned models 2 for user's NLP applications.The decision flow helps users assess whether their downstream NLP applications at hand meet specific conditions and, based on that evaluation, determine whether LLMs or fine-tuned models are the most suitable choice for their applications.During the decision process in the figure, Y means meeting the condition, and N means not meeting the condition.The yellow circle for Y of the last condition means there's no model working well on this kind of application.

4. 3 . 1
Use case.In general, with billions of training tokens and parameters, LLMs have much more real-world knowledge than fine-tuned models.Closed-book question-answering tasks require the model to answer a given question about factual knowledge without any external information.It does require the memorization of real-world knowledge in the model.LLMs perform better