Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the visual and textual understanding capability of systems in the absence of training data. Recently, by converting the images into captions, information across multi-modalities is bridged and Large Language Models (LLMs) can apply their strong zero-shot generalization capability to unseen questions. To design ideal prompts for solving VQA via LLMs, several studies have explored different strategies to select or generate question-answer pairs as the exemplar prompts, which guide LLMs to answer the current questions effectively. However, they totally ignore the role of question prompts. The original questions in VQA tasks usually encounter ellipses and ambiguity which require intermediate reasoning. To this end, we present Reasoning Question Prompts for VQA tasks, which can further activate the potential of LLMs in zero-shot scenarios. Specifically, for each question, we first generate self-contained questions as reasoning question prompts via an unsupervised question edition module considering sentence fluency, semantic integrity and syntactic invariance. Each reasoning question prompt clearly indicates the intent of the original question. This results in a set of candidate answers. Then, the candidate answers associated with their confidence scores acting as answer heuristics are fed into LLMs and produce the final answer. We evaluate reasoning question prompts on three VQA challenges, experimental results demonstrate that they can significantly improve the results of LLMs on zero-shot setting and outperform existing state-of-the-art zero-shot methods on three out of four data sets. Our source code is publicly released at \url{https://github.com/ECNU-DASE-NLP/RQP}.

Comparison between existing prompting methods and our method on VQA tasks using frozen LLMs [6,56].
The images are first converted into captions.Prior studies proposed different strategies to select exemplars from training data like PICa [54] or generate synthetic exemplars like Img2Prompt [16].In contrast, our method focuses on question prompt generation, where self-contained questions are produced in an unsupervised manner such that LLMs can easily capture the intent of questions and fully exert their potential.

INTRODUCTION
Visual Question Answering (VQA) tasks require a system to answer a textual question about an image.Diverse studies focused on solving visual questions, the answer of which can be directly derived from the image [3], or questions requiring outside knowledge beyond the image content [32,49].Due to the enormous demand of manpower to annotate VQA datasets and the risk of human biases [1,7], there are quite a few studies proposing methods to solve zero-shot VQA tasks, where no image-question pair is provided for training [4,17,44].
To solve zero-shot VQA tasks, early studies developed methods to synthesize training data so that conventional VQA models can be trained on the synthetic data [7,8,17,44].Recently, Large Language Models (LLMs), which are trained on general text corpus, have shown excellent generalization capability on zero-shot tasks, such as information extraction [51] and logical reasoning [59].Inspired by the intriguing properties of LLMs, Yang et al. (2022) first proposed PICa, which transfers images into captions then a frozen off-the-shelf LLM is applied to answer the question based on the caption context.It not only saves the effort of pre-training a multimodal model, but also provides world-knowledge to answer the questions.Take the question in Figure 1 as an example, the image is converted into the caption "This is a blow dryer in a bathroom.".The question "What is the appliance the woman is holding used for?"should be answered with the caption as context.To guide LLMs to better understand the tasks, in-context examples are selected from the training data as the prompts.Soon after, another study proposes Img2Prompt [16], which generates synthetic question-answer pairs via template-based and neural question-generation methods based on the images, which has shown impressive performance on zero-shot VQA benchmark datasets.
We observe that most of the existing prompting methods on VQA tasks focus on developing different exemplar selection/generation strategies to help LLMs better comprehend the task thus enhancing its capacity.However, there is a demand of eliminating the semantic gap between captions and questions, which can be illustrated from two aspects: (1) The current methods entirely rely on the understanding capability of LLMs to resolve the ambiguity and infer the intent of the questions, which might involve unexpected bias [21,41].As we can see, in Figure 1, the question asks about "this appliance", which indicates "blow dryer".Due to the bias existing in LLMs, they may fail to parse the question correctly.(2) LLMs are brittle to ill-posed questions, especially under the zero-shot setting.In Figure 1, "the woman is holding" is irrelevant to the image.LLMs are sensitive to such noisy information and it may cause confusion to LLMs [58].In this case, disambiguating the question is of high demand.
Motivated by the observation, we present RQ prompts, which are Reasoning Question prompts for improving the understanding capability of LLMs under zero-shot VQA scenarios.Specifically, we design an unsupervised question edition module to convert original questions into self-contained questions by editing the segments of the question.We propose a search algorithm to generate the possible edited questions and rank them by a scoring function.The scoring function measures sentence fluency, semantic integrity and syntactic invariance.Eventually, the top-ranked reasoning question prompts are utilized to generate a set of candidate answers.Following the heuristic prompting in Prophet [43], where prompting is divided into answer generation and answer choosing steps, we encode both answer candidates and a confidence score to form answer heuristics for choosing.The confidence score takes both the confidence of the reasoning question prompt and the generated answer into consideration, which produces a comprehensive score for choosing.Our contribution can be summarized as follows: • We propose RQ prompts, which aim to improve zero-shot VQA tasks via LLMs by providing edited questions as prompts.
No extra data as well as supervision is needed for the RQ prompts generation procedure.• We design a novel confidence scoring function for the answer heuristics, which can comprehensively measure the answer candidates.• Reasoning question prompts generally improve existing baselines with absolute improvement ranging from 0.3 to 5.2 points.Our method achieves new state-of-the-art results on three out of four evaluated zero-shot VQA data sets.

RELATED WORK 2.1 VQA tasks
Given a textual question, VQA tasks require a system to answer the question by decoding the information from an image and even utilizing external knowledge.Several benchmark datasets [32,42,48,49], including complex reasoning questions, facilitate the development of this field.To incorporate with external knowledge, early methods turned to textual Knowledge Bases (KBs) and applied either graphbased [24,36,60,61] or transformer-based approaches [11,13] to introduce the KB information into the question answering module.Besides, multi-modal KBs are also leveraged to solve VQA tasks.Wu et al. (2022) combine Wikipedia, ConceptNet and Google images to supplement multi-modal knowledge.With the emergence of language models, researchers consider them as implicit KBs [43,54] and there are several studies [12,15,28,31] combining explicit and implicit knowledge to improve model's ability of handling visual questions.Recently, large language models impress people by their quantum leap of understanding and reasoning capabilities.Several studies [43,54] reformulate VQA tasks into a textual question answering task by converting the images into captions and apply incontext learning to activate the implicit knowledge in LLMs [6].In this paper, we discuss VQA tasks under zero-shot scenarios, which brings in new challenges to the tasks.

Zero/Few shot of VQA tasks
There is a line of work focusing on solving zero/few-shot VQA tasks.A general solution is to augment image-question pairs for training.Multi-modal pre-training models like CLIP [38] are frequently leveraged to generate synthetic question-answer pairs from images [4,7].
After that, a VQA model can be trained with the augmented data so that it can learn patterns and answer questions in the test set.

Prompting for Answer Choosing
Figure 2: The illustration of our prompting method that enables LLMs to perform VQA tasks with two-step reasoning.The blue blocks denote the modules with frozen parameters and the orange blocks denote the modules we propose to generate reasoning question prompts and answer heuristics.

Prompt Tuning of LLMs
Prompts are significant regarding the inference of LLMs.It helps guide the LLMs to activate the potentials of understanding and reasoning.A question could be part of the prompt.It should be well designed to fit the nature of the evaluated tasks [40].For example, pattern-verbalizer pair is one type of question prompt which maps diverse tasks into a word prediction task.Besides, there are some other prompts.An instructional prompt primarily contains a natural language description of the underlying task.Generally, a narrative sentence is annotated manually as the instruction prompt [35,50].Recently, researchers decompose a complex task into sub-tasks so that the multiple instruction prompts guide LLMs to handle subtasks step by step [22,49,52,57,59].An exemplar prompt guides LLMs by showing some examples from the training data.There are a number of studies proposing different strategies to select or generate good exemplar prompts for LLMs [6,20,29,34].Instead of discrete text, prompts could be in the format of continuous embeddings, researchers have developed diverse methods to learn better embeddings [25,37].Our work takes effort on improving the question prompt by eliminating the semantic gap between the original question and images for zero-shot VQA tasks.

METHODS 3.1 Overview
In this section, we introduce our prompting method for solving zero-shot VQA tasks.Following Prophet [43], which is a heuristic prompting framework, we also decompose the task into two steps as shown in Figure 2. In prompting step for answer generation, we convert an image into a caption with a frozen caption model [55] as the context of the given question.Particularly, we edit the question with an unsupervised method, namely Unsupervised Question Edition module, to transfer the original question into the reasoning question prompts.For each reasoning question prompt, we generate a candidate answer from a frozen LLM.In prompting step for answer choosing, we construct answer heuristics via Answer Heuristics Construction module based on the candidate answers generated above.Then a frozen LLM is required to choose correct answer among these candidates.Each candidate answer in the prompt is associated with a confidence score taking account of the confidence of both question prompts and answers.

Prompting for Answer Generation
To bridge the gap between the image captions and questions, we generate reasoning question prompts to avoid errors resulting from missing reasoning step.Then we generate candidate answers based on them.We define the reasoning question prompts should meet the following criterias: • The generated questions should not contain any ellipsis and ambiguity.In other words, they should be self-contained.Such that LLMs could easily understand the question without guessing the implicit information.Take the question in Figure 2 as an example, "this weather phenomenon" in question should be explicated by "clouds and lightning".• The self-contained question should be produced in the absence of supervision signal under a zero-shot setting.A neural network-based model is difficult to be applied as it requires a large volume of labeled data to learn how to generate a self-contained question.
To meet the above criterias, we propose an unsupervised method to edit the original question with the consideration of its image caption.There are two advantages of conducting edition on the original questions instead of generation: (1) It is controllable to revise the original questions by substitution.Only segments of the questions can be changed and the major semantics of the original questions is maintained.(2) Even without parallel labeled data, it is possible to conduct edition on the original question by a search algorithm holding a search objective.On this basis, we design an unsupervised question edition module to convert the original question into a reasoning question prompt.
Unsupervised Question Edition.Inspired from existing work on text simplification [23], we design an edit-based search algorithm to produce the reasoning question prompts by conducting substitution  operations on the constituency parse tree.As shown in Figure 3, "clouds and lightning" (NP) and "this weather phenomeon" (NP) are both considered as phrase-level constituents with the same root tag "Noun Phrase" based on the constituency parse trees.By replacing "this weather phenomenon" with "clouds and lightning", we can obtain a self-contained question "what unpleasant emotional does clouds and lightning often cause?".Given a caption and a question, our search algorithm iteratively performs edits to search for a candidate.Specifically, starting from the constituents of the caption, we consider all the constituents of the original question and conduct substitution to generate candidates.Each candidate will be measured by a scoring function considering the sentence fluency, semantics integrity and syntactic invariance.The candidate with the score higher than a threshold can be saved and further edited.The detailed search algorithm is displayed in Algorithm 1.
Algorithm 1 Search Algorithm of Reasoning Question Prompts 1: procedure QE(, ) ⊲  and  are the parse trees of captions and original questions, respectively. 2: for  = 1, ...,  do ⊲ : number of constituents in  4: for  ′ in S ℎ do 6: for  = 1, ...,  do ⊲ : number of constituents in if  > ( () − ) then 10: return S Next, we present our scoring function.To evaluate the quality of the candidate, we consider the following aspects comprehensively: • LM Score.We employ a probabilistic language model (LM) to measure the language fluency of a candidate, which is widely applied in unsupervised text compression and simplification tasks [19,33].As the training objective of LMs is to maximize the likelihood of sentences, a fluent sentence would have a higher joint probability, which can be denoted as , where   is -th token in Q and  is the length of the sentence.
• Semantic Integrity.To avoid the dramatic change to the semantics of the original question after edition, we employ cosine similarity to measure the meaning preservation, where the sentence embedding is computed as the weighted average of tokens in sentences.We denote it as   ( Q) =  ( Q, ).• Syntactic Invariance.Since we would like to ensure the alternative constituents can hold the same syntactic attributes as the original one.This could maintain the syntactic structure of the original question and effectively avoid grammatical confusion.We identify whether the root tags of these constituents are same or not, which can be denoted as The overall scoring function is the product of the above aspects: where the weights  and  denote the importance of LM score and semantic integrity, respectively.It is worth that syntactic invariance is a hard indicator function.It only accepts the case when the replaced root tag is unchangeable so there is no importance weight needed.As we can see,  ( Q) is a scalar that indicates how likely Q can act as a good reasoning question prompt for .Eventually, we obtain a set S that contains  reasoning question prompts.
Prompt Design.With the generated  reasoning question prompts, we construct the prompts for answer generation by concatenating the caption and each reasoning question prompts.Following prior studies on prompt tuning [16,43,54], we construct the prompt with the consideration of instruction, context and questions: Instruction: Please answer the question according to the contexts.

Context: [caption].
Question: [reasoning question prompt]. Answer: We will feed the  prompts into LLMs in turn and greedy decoding on LLMs is performed on each prompt.This results in  candidate answers with their confidence scores.
In Figure 2, different reasoning question prompts capture different objects in the image such as "clouds and lightning" and "this clock", they can cover possible intents of the original question, which helps LLMs to decode answers with diverse reasoning paths.This strategy has similar principle as Chain-of-Thought [22,57,59], which explicates the intermediate reasoning chains of the questions and makes it easier for LLMs to parse the question and do complicated reasoning.After prompting for answer generation, we obtain two candidate answers, that are "fear" and "anxiety", which correspond to the two reasoning question prompts.

Prompting for Answer Choosing
Once we obtain multiple candidate answers, we construct prompts to let LLMs choose final answers among these candidate answers, which are known as heuristics-enhanced prompts in Prophet.This facilitates the LLMs to narrow down the range of answers.We follow this strategy but define different confidence scores in Answer Heuristics Construction module.
Answer Heuristics Construction.Starting from the generated candidate answers based on different reasoning question prompts, we define the confidence score of the candidate answer below: where where  (  ) denotes the confidence score for answer   , which reminds LLMs to focus more on the candidate answers with higher scores.We consider the answer generated by this prompt as the final answer.
Compared with the two-stage prompting method of Prophet, our method is different in the way of generating and scoring answer candidates, which is rooted in our different motivation.Prophet generates answer candidates by including frequent answers from training set, which is to replay the answer prediction in the training data.Our method generates answer candidates by full-filling the original questions with possible intents, which is to shorten the semantic gap between images and questions under the zero-shot setting.It is worth noting that even though the prompting method is designed for the zero-shot VQA task, we can still insert in-context examples behind the instructional prompt if it is needed.

EXPERIMENTS
In this section, we evaluate reasoning question prompts on zeroshot VQA tasks and compare with existing methods.Furthermore, we perform comprehensive analysis to interpret its performance under different scenarios.We also conduce ablation study on important design choices and show some qualitative examples.

Experimental Setup
Datasets.We evaluate reasoning question prompts on OK-VQA [32], A-OKVQA [42] and VQAv2 [14], which contains image-question pairs that are derived from COCO datasets [27].The questions in these datasets require perception to the image.Some of them even require commonsense beyond the image to answer.Specifically, OK-VQA1 contains 5, 046 test questions.A-OKVQA2 contains 1, 100 and 6, 700 questions for validation and testing, respectively.VQAv23 is a large dataset, We leverage the validation set of VQAv2 for evaluation, which contains 214, 354 questions.For evaluation measurement, we follow their official evaluation metrics to measure the performance.
Comparable Methods.As our reasoning question prompts can collaborate with any LLMs, we evaluate our methods with different LLMs as backbones.Notably, existing methods like PICa and Img2Prompt are prompting methods to provide exemplars prompts for VQA tasks.We consider their methods as baselines then include our reasoning question prompts and observe if there is any performance improvement brought by the involvement of our method.
Besides, we compare our method with other pre-trained zeroshot VQA methods, such as Flamingo [2], Frozen [45] VL-T5 [9], FewVLM [18] and VLKD [10].These methods aim to propose different pre-trained multi-modal models on large-scale vision-language datasets, which can be easily adapted to new VQA challenges without training.
Implementation Details.For the LM used in the unsupervised question edition module, we leverage a pre-trained LM model from existing work, which is a two-layer, 256 dimensional recurrent neural network with gated recurrent unit (GRU) [23] fine-tuned Table 1: Zero-shot evaluation on VQAv2, OK-VQA, and A-OKVQA.The first section contains zero-shot methods with LLMs which utilize no training data but may synthesize some exemplars.The middle section contains zero-shot methods with end-to-end training on other multi-modal data.The last section contains few-shot methods with LLMs.The numbers in brackets denote the improvement gain brought by our reasoning question prompts.The results with ⋄ denote the baselines we implement methods by ourselves.Otherwise, we copy results from their original papers.1) are set to 0.3 and 1, respectively.We set  as 0.5 to avoid overwhelming question reasoning prompts.If there is a maximum limit number for the generated reasoning question prompts, we sort all candidate reasoning question prompts and select the top- based on their scores.More details about the unsupervised edition module can be found in Appendix A.1.

Method
Regarding LLMs, to show the generalization capability of our reasoning question prompts, we conduct experiments on different LLMs with different sizes, including open source OPT4 , GPT-3 5and BLOOM. 6.Regarding different baselines, such as Img2Prompt 7 , PICa 8 , we follow their official implementation to convert images into captions via either VinVL-base pre-trained checkpoint 9 or BLIP 10 and generate exemplar prompts via either CLIP 11 or finetuned T5-large model 12 .Notably, we implement a light version of Img2Prompt on VQAv2 dataset due to our computation limitation, the details of which can be found in Appendix A.2.

Main Results
We display our main results in Table 1.We have the following observations based on it: Overall effect of reasoning question prompts.Our reasoning question prompts can improve the performance of zero-shot VQA methods on most baselines.The absolute improvement ranges from 0.3 to 5.2 points.The largest gain is on A-OKVQA validation set with PICa baseline, where the absolute improvement is 5.2 points.This is a setting without any exemplar, which indicates the potential of reasoning question prompts under the scenarios with no access to any VQA data.Even with some exemplars which are synthetically generated, reasoning question prompts can still improve the zero-shot VQA methods via LLMs.As we can see, even with some syntactic exemplars generated by Img2Prompt model, there is general improvement from reasoning question prompts.We observe the similar effect of RQ prompts with different LLMs, the results of which are displayed in Appendix A.3.
Comparison with other methods.Compared with existing zeroshot methods, reasoning question prompts with Img2Prompt {GPT-175B} baseline can outperform all the existing zero-shot VQA methods and achieve the new state-of-the-art results on zero-shot evaluation with frozen LLMs on three out of four data sets.Even though we cannot defeat Img2Prompt {OPT-175B} on VQAv2 validation set, our reasoning question prompts can still bring in perfromance gain on our light re-implmentation results.We notice that there are some competitive comparable methods like pre-trained VQA method Flamingo {80B} and few-shot LLMs-based method Prophet {GPT-175B} , which lead to higher results than ours.The former method is computationally expensive, which is pre-trained on billion-scale multi-modal datasets.The latter one makes use of VQA training samples, which can obtain more guidance directly from the training data.

More Analysis of RQ Prompts
Effect of  for RQ Prompts on Different Shot Numbers.As the results in Table 1 have the mixed effect of shot number and LLMs, to analysis the effect of reasoning question prompts on different shot numbers, we control the other settings unchangeable and see how the performance changes with the increasing number of shot.The results are displayed in Table 2 (a)-(c).As we can see, the performance gradually improves with the increasing .During answer generation, the more reasoning question prompts generated, the more likely that we can recall the correct answers.We can observe the largest performance gain on the zero-shot setting.This indicates that a reasoning question prompt is more likely to help when the guidance to the question is little.Providing LLMs with selfcontained questions, which explicate the intermediate reasoning to LLMs, can fully activate the potentials of LLMs.When the shot number becomes 16, the gain from reasoning question prompts becomes least visible.
Effect of  for RQ Prompts on Different LLMs.We further control the shot number as 0 and test the effect of  for reasoning question prompts on LLMs with different parameter sizes.The results are displayed in Table 2 (d)-(f).Similarly, the performance gain increases with the increasing .Furthermore, with the increase of model size, the performance gain becomes large.Regarding GPT-Neo-2.7B,the performance increase brought by reasoning question prompts is not so obvious, which only has around 1 point improvement.Regarding OPT 30B, the performance increase brought by reasoning question prompts becomes around 3 points.This is because larger LLMs usually contain more knowledge to answer a question.After implicit intent is resolved by the reasoning question prompts, we can take full advantage of its knowledge to answer questions correctly.
Ablation Study.We further evaluate the performance on different prompt design strategies and the results are displayed in Table 2.If we eliminate the two-stage prompting, and simply choose the answer with highest  () in Equation (2) as the final answer.We have around 1 point drop.This indicates that the step of answer choosing is needed.It provides LLMs a chance to review the original question with the consideration of candidate answers.During prompting for  answer generation, we omit the aspects of scoring function in turn, the results indicate that all the aspects are important for generating a reasoning question prompt.Among them, syntactic invariance is most significant aspect which measures the consistency of the substitution segments.A replaced constituent with a different syntactic tagging easily leads to a chaotic sentence that cannot be understood by LLMs.LM score and semantic integrity are also helpful in terms of measuring the sentence fluency and semantic integrity of the sentences.
During prompting for answer choosing, we omit the candidate construction and simply include the candidate answer without their confidence scores, which results in a performance drop.After changing the confidence score to   (| Q), the performance decreases, which indicates the importance of our answer heuristics construction.
Case Study.We display some cases in Figure 5 to investigate how our reasoning question prompts work in zero-shot VQA tasks.Example (a) contains an image of a blow dryer, the generated caption is "This is a blow dryer in a bathroom".The visual question is "What is the appliance the woman is holding used for?".As we can see, this is an ill-posed question as there is no woman shown on the image.The result of LLMs without any reasoning question prompt is "cutting hair", which may caused by the unexpected bias of the LLMs.Based on the caption and question, we generate reasoning question prompts such as "What is the appliance a blow dryer used for?" and "What is the appliance a bathroom is holding used for?", which successfully bridge the gap between the image and the question, thus LLMs can predict correct answer.Similarly, in example (b), there is a gap between "the child eating" and the image.The queried objective is not explicitly mentioned in the question, so LLMs must infer the object that the question is asking about.Reasoning question prompts such as "What is dishes in front of her?" and "What is a cup with food in dishes in front of her?" explicate the queried object so that LLMs can easily understand the question and return the correct answers.More cases can be found in Appendix A.4.

CONCLUSION
In this paper, we investigate zero-shot VQA tasks via LLMs, where images are first converted into captions then LLMs answer questions based on the caption contents.We propose a way to generate reasoning question prompts, which can help explicate the intermediate reasoning step of a question and eliminate the semantic gap between the question and the caption.The experiments show that reasoning question prompts improve existing zero-shot VQA methods with different LLM backbones and achieve a new state-ofthe-art performance on multiple zero-shot VQA data sets.

A APPENDIX A.1 Details about Unsupervised Question Edition
For each sentence, we use CoreNLP 13 to construct the constituency tree and Spacy 14 to obtain the part-of-speech and dependency tags of the words.
The syntax-aware LM model we used takes words, POS tags and dependency tags as the input, which can be denoted as: where v() is the word embeding, p() is the POS tag embedding and d() is the dependency tag embedding.The dimensions of POS tag and dependency tag are 150, and the dimension of word embedding is 300.w is fed into the LM [23], which enables a LM to be sensitive to the sentence structure.We directly take the checkpoint 15 of the syntax-aware LM model in prior study [23] on text simplification as the initialization.This checkpoint is initially trained on WikiLarge datasets.We fine-tune the model with questions in OK-VQA test data, so that the LM model can be quickly adapted to VQA domains.Regarding fine-tuning, we use the Stochastic Gradient Descent algorithm with 0.4 as the dropout rate and 32 as the batch size.

A.2 Details of Re-implementation of Img2Prompt on VQAv2 Dataset
We re-implement the source code of Img2Prompt to generate synthetic examples.Due to the large size of the VQAv2 dataset and our limited computational resource, we implement a light version.Specifically, Img2prompt leverages BLIP to generate captions from a given image and conduct image-question matching.In the official implementation setting, they sample 10 image patches and then generate 100 question-relevant captions, from which they can produce 30 question-answer pairs.For us, we sample 10 image patches for each image but simply generate 20 captions by adjusting the number of the generation.Based on these 20 captions, we subsequently generate 10 question-answer pairs.These question-answer pairs are utilized as the exemplar prompts for answer generation and answer choosing.Therefore, there might be some information loss in our implementation as some important exemplars might be filtered out.

A.3 RQ Prompts with Different LLMs
To verify the scaling effect of reasoning question prompts with different LLMs, we conduct experiments on A-OKVQA validation set having Img2Prompt as baselines but with different LLMs.The result is displayed in Table 3. Specifically, we evaluate on GPT-3.5 175, GPT-Neo 2.7B [5], BLOOM 7.1B [39], GPT-J 6B [47] and OPT-125M [56].As we can see, the performance of zeros-shot VQA tasks is affected by the size of LLMs.A LLM with larger model size usually results in a better performance, which is also verified in prior paper [16].Importantly, including reasoning question prompts 13 https://stanfordnlp.github.io/CoreNLP/ 14https://spacy.io/ 15https://github.com/ddhruvkr/Edit-Unsup-TScan always improve the performance, which further verifies the generalization capability of our method.

A.4 Case Study
is a blow dryer used for?C: A red bird; Q: What is the breed?A: Parrot

Figure 1 :
Figure1: Comparison between existing prompting methods and our method on VQA tasks using frozen LLMs[6,56].The images are first converted into captions.Prior studies proposed different strategies to select exemplars from training data like PICa[54] or generate synthetic exemplars like Img2Prompt[16].In contrast, our method focuses on question prompt generation, where self-contained questions are produced in an unsupervised manner such that LLMs can easily capture the intent of questions and fully exert their potential.
What unpleasant emotional does clouds and lightning often cause?

Figure 3 :
Figure 3: The generation process of reasoning question prompts in unsupervised question edition module.Both the question and caption are transformed into constituency parse trees.The phrase-level constituents in the caption correspond to the different objects in the image, which are shown with different colors.They would be utilized to substituent segments of the original question to form a complete self-contained question.The yellow shades indicate that we substituent these constituents to form a reasoning question prompt.

Figure 4 :
Figure 4: (a)-(c) denote evaluation of RQ prompts on the test set of OK-VQA having OPT 6.7B as the LLMs but with different shot numbers.(d)-(f) denote evaluation of RQ prompts on the test set of OK-VQA having shot numbers equal to 0 but with different LLMs.We display results of prompting for answer generation and answer choosing.X-axis denotes the value of  and y-axis denotes the accuracy.When it comes to prompting for answer generation, we report the maximum accuracy among all the reasoning question prompts.

Question:Figure 5 :
Figure 5: Examples in A-OKVQA validation set the prediction of which are incorrect originally but correct with RQ prompts.
We display more examples in OK-VQA test set to show how the reasoning question prompts work in zero-shot VQA tasks.The displayed examples are predicted by Img2Prompt+QR prompt {OPT-30B} and the original predictions without QR prompts are incorrect.In the Figure, the edited segments of the question are highlighted with red color and the correct predicted answers are highlighted with green color.Caption: A child's bedroom with pink and white decor.Question: Is this a room for a boy or girl?GT answer: girl Prompting for Answer Generation: Q1: Is pink and white decor a room for a boy or girl?A1: It is room for boy.Q2: Is this pink and white decor for a boy or girl?A2: girl Q3: Is this a room for pink and white decor?A3 : no Prompting for Answer Choosing: Question: Is this a room for a boy or girl?Candidates: girl(0.51); it is room for boy(0.21);no(0.27)Predicted Answer: girl Example 1

Context Prompting for Answer Generation Answer Heuristics Construction
( Q) is the probability that we generate the Q based on normalized  ( Q) over  prompts and   (| Q) is the probability of the generated  based on Q via LLMs.Since different reasoning question prompts may lead to the same answer, we can have  candidate answers, where  ≤ .As we can see, the confidence score takes both confidences of question edition and answer generation into account, which comprehensively depicts the likelihood of a candidate answer for answering choosing.

Table 2 :
Performance on A-OKVQA validation set having Img2Prompt as baselines but with different prompt designs.

Table 3 :
Zero-shot performance A-OKVQA validation set having Img2Prompt as baselines but with different LLMs.Δ denotes the performance gain brought by QR prompts.