A Lightweight Constrained Generation Alternative for Query-focused Summarization

Query-focused summarization (QFS) aims to provide a summary of a document that satisfies information need of a given query and is useful in various IR applications, such as abstractive snippet generation. Current QFS approaches typically involve injecting additional information, e.g. query-answer relevance or fine-grained token-level interaction between a query and document, into a finetuned large language model. However, these approaches often require extra parameters \&training, and generalize poorly to new dataset distributions. To mitigate this, we propose leveraging a recently developed constrained generation model Neurological Decoding (NLD) as an alternative to current QFS regimes which rely on additional sub-architectures and training. We first construct lexical constraints by identifying important tokens from the document using a lightweight gradient attribution model, then subsequently force the generated summary to satisfy these constraints by directly manipulating the final vocabulary likelihood. This lightweight approach requires no additional parameters or finetuning as it utilizes both an off-the-shelf neural retrieval model to construct the constraints and a standard generative language model to produce the QFS. We demonstrate the efficacy of this approach on two public QFS collections achieving near parity with the state-of-the-art model with substantially reduced complexity.


Introduction
In modern search systems, users are often presented with short snippets of a candidate document on their search results page.This snippet serves as a critical element in helping users determine whether a document satisfies their information needs without requiring them to invest additional time.The effectiveness of a snippet largely depends on its ability to accurately and concisely capture the relevant information from the corresponding document in just a few lines of text [5,23].
This task of query-focused summarization (QFS) snippet generation, commonly referred to as query-biased summarization [31] or abstractive snippet generation [4], aims to construct a summary that succinctly addresses the information need of a query by extracting essential information from a document.Traditionally, QFS has used extractive methods that rely on the most relevant spans of text from a candidate document based on the prevalence of query terms [2,32].Although efficient, this extractive approach is constrained by the format of the original document, with the effectiveness of the snippet heavily dependent on the length and information density of the candidate document [31].Moreover, this paradigm limits the possibility of personalization or multiple-document QFS.
With the advent of large language models (LM), a new paradigm has emerged that attempts to address the limitations of extractive snippet generation.These LM-based approaches directly generate abstractive snippets that do not necessarily appear anywhere in the original document [4,14,23].While these methods hold promise, successful application is nontrivial.They often require specific architectures with additional parameters to incorporate the relevance signal, along with extensive fine-tuning.Moreover, a significant challenge with natural language generation, including QFS, is the problem of hallucination, where the model confidently generates false information [11,19].This can lead to unreliable snippets that do not accurately reflect the content of the original document.
In this paper, we propose a novel lightweight alternative approach for QFS, relevance-constrained QFS, that achieves near parity with current state-of-the-art methods with significantly reduced complexity.Our approach does not require training an additional model or adding more parameters to the existing one.Instead, we demonstrate that QFS can be effectively and efficiently achieved by constraining a general language model (LM) to summarize the document with predefined lexical constraints.We hypothesize that QFS can be effectively achieved by enforcing the language model to summarize with predefined lexical constraints -i.e, constraining a pretrained LM to favor the most important terms for the relevance of the document.While the notion of constrained generation [18] has emerged as a way to combat hallucination in other natural language generation tasks such as commonsense generation, constrained machine translation, conversation [35] and table to text generation [17], we show the unique advantages it possesses in the case of abstractive QFS.
To achieve this, we first identify the most critical tokens from the ranking model using the gradient signal of each token [26] as these salient terms capture the most important aspects of the document's relevance to the query.We then convert these tokens into predicate logic constraints and use them as input to a version of constrained generation, Neurological Decoding [18].By constraining the LM to simultaneously satisfy these constraints and maintain fluency, we generate an abstractive summary that is optimized for relevance to the query.This approach allows us to effectively generate snippets without requiring additional complex modules or training methods, making it a lightweight yet effective alternative to the current state-of-the-art method.
Our experiments on two benchmark snippet generation datasets [12,20] demonstrate that this application of relevance-constrained QFS achieves comparable results to the current state-of-the-art method, suggesting a promising alternative perspective to the snippet generation task.

Related Work
Query-focused Summarization: To generate a query-focused summary, several studies used an additional query-attention mechanism.QR-BERTSUM-TL [13] incorporates query relevance scores into a pre-trained summarization model.Su et al. [29] propose merging the representation of an answer span predicted by a separate QA model into the Seq2Seq model's training and inference process to enforce the summary's coherence w.r.t. the query.QSG Transformer [23] suggests using a separate graph neural network model to learn per-token representations and fuse them to the Seq2Seq model to effectively generate a QFS.These mechanisms can be viewed as enforcing soft semantic constraints during the generation process, and requires additional modules and parameters to function effectively.We opt for a different approach, i.e. explicitly enforcing lexical constraints during the generation process, without the additional machinery that is necessary to handle the soft semantic constrains.Constrained Generation (or Conditional Generation) is a family of natural language generation (NLG) methods that aim to generate natural language including/excluding a set of specific words, i.e. lexical constraints.The NLG domain recipe leverages pre-trained large language models (LLM) finetuned on specific datasets [7].However, as pointed out by Lu et al. [18], such models only finetuned in an end-to-end manner do not learn to follow the underlying constraints reliably even when supervised with large amounts of training examples.Therefore, a line of works [1,10,17,18] in constrained generation proposes to explicitly modify the likelihood of next word prediction in the generation stage, such that the predefined lexical constraints can be better satisfied.

Relevance-Constrained QFS
Problem Formulation: Given a query-document pair ( , ), our task is to generate an abstract summarization , which addresses the information need of the query.
We propose addressing this problem by leveraging a relevanceconstrained generation.In this section, we first introduce how we construct the set of constraints used by the language model to generate the abstract summary.We then present the constrained generation process itself.
Identifying Constraints: In order to identify the most effective constraints for QFS, we first assume that each candidate document is relevant to the query.We then use a pointwise cross-entropy loss, L, to identify how each token contributes to the relevance of the document.To achieve this, we use a saliency based mapping approach to quantify this impact as gradient-based attribution methods have been widely adopted in existing NLP literature [8,25,34].
Formally, denote an input sequence , where is the -th token; and x = (x 1 , x 2 , • • • , x ) is a sequence of corresponding static token embeddings.Let (•) be a function that takes x as input and outputs a prediction logit, e.g., a transformer-style model with classification head.The gradients w.r.t. each input token can be regarded as each token's contribution, or saliency, to the final prediction (x).We denote this per token gradient vector as a = ( 1 , 2 , • • • , ), which is the normalized saliency across all tokens, where L denotes the loss between (x) and label = 1, and (•, •) is the saliency function.
While there exists various methods to estimate the saliency via (•, •) [8,27,28,30,34], we adopt InteGrad [30], as it is robust to input perturbations [34].Specifically, InteGrad sums the gradients along the path from a baseline input x ′ = 0 to the actual input x : where is the number of steps to interpolate the input x and × denotes dot product; thus (∇ x L, x ) is a scalar indicating saliency of token before normalization (Eq.1).In our implementation, we follow the original setup in [30] and set to 10 steps.
We note that any differentiable retrieval function can be used in place of (•) within this framework.In this paper, we use a standard DistilBERT document reranker trained on MS MARCO using a cross-entropy loss [3,9,21,22].
In our preliminary experiments, we observed that the saliency scores are often noisy, attributing gradients to stopwords and/or punctuations.Therefore, we filter out the stopwords and punctuations in a post hoc manner and only keep the top-3 important tokens from document to construct the actual decoding constraints C. Constructing Constraints: Having identified the most salient tokens, we construct the lexical constraints in a format appropriate for constrained generation, Conjunctive Normal Form, where each single denotes one single positive or negative constraint, which we refer to as a literal; and the logical disjunction of literals is referred to as a clause, e.g. 1 to . In our implementation, we construct 3 clauses with each clause initially consisting of a single literal.We then expand each clause by all possible forms of the original token via WordForms 1 .An example of this logic corresponding to Row 1, Table 2 is represented as Constrained Generation: At inference time, we run a simplified version of the Neurological Decoding (NLD) algorithm using the set of constraints C acquired from Section 3. As we do not use negative constraints in QFS, i.e. we do not avoid certain tokens, we consider only two states within the original NLD algorithm: reversible unsatisfaction where an unsatisfied logical clause with a positive literal can be satisfied at a future point and irreversible satisfaction where a positive literal will remain satisfied.This predicate logic is then applied within a conventional beam search during generation.
At timestep , the simplified algorithm performs three individual steps when filling in beam candidates: Pruning, Grouping, and Selecting.Pruning filters out candidates that are of low likelihood or satisfy fewer clauses; Grouping implicitly constructs the power set of all irreversible satisfied clauses, leading to at most 2 | C | groups; and Selecting populates the beam with candidates within each group that are most likely to satisfy remaining reversible unsatisfied clause by modifying the likelihood.Specifically, within each group, the likelihood is modified by the NLD score function: where is the likelihood of the LM generating token , I( ) indicates whether clause has been satisfied or not, is the overlap between the ongoing generation and the partially satisfied literal , e.g.ˆ ="apple" and ="apple tree" yields 0.5, and = 0.1 acts as the hyperparameter.Intuitively, this score modification favors candidates moving toward fully satisfying a positive literal within an unsatisfied clause with controlling the strength of this signal.After this explicit likelihood modification, we visit each group and select the highest scoring candidate in rotation until the beam is filled.After this process is complete, we select the beam candidate with highest score and proceed to generating the next token at + 1.Although the group construction suggests a high-complexity runtime, implicit construction results in this algorithm having the same runtime complexity as standard beam search [18].
We use BART [14] and T5 [24] for fair comparison with existing methods as the generating LM for abstractive QFS.As there exist no additional parameters or modules for this method, details of these backbone LMs are discussed in Section 4.

PubMedQA is a long-form abstractive question-answering dataset
Table 1: Results on test set, including ROUGE-1, ROUGE-2 and ROUGE-L, baseline results (the first section) are from [23]; Italic indicates the best performing system in literature.† denotes the constrained method significantly better than its unconstrained counterparts with paired t-test at 0.05 level from the biomedical domain with the contexts available.We use the standard train test split from the original datasets.
Compared Methods: To evaluate the performance of the proposed relevance-constrained generation method, we introduce the following baseline methods in order of increasing complexity: • End-to-End approaches: Transformer [33], BART [14] and T5 [24] are finetuned for Seq2Seq summarization.These LMs additionally act as the backbone LM for the proposed relevance-constrained QFS approach, i.e.Constrained-BART and Constrained-T5 such that the results are directly comparable.In this configuration, there are no constraints during the generation process.
• Improved query-document cross attention: SD2 [20] adds additional cross attention between query and document encoder, then uses the combined representation for generation.CSA Transformer [36] adds conditional self-attention layers originally designed for conditional dependency modeling to the Seq2Seq model.
• Incorporated query-document relevance: QR-BERTSUM-TL [13] injects query relevance scores into pretrained Seq2Seq summarization model; MSG [6] utilizes query relevance and interrelation between sentences of the document for fine-grained representation.Similarly, BART-QFS [29] also uses a pre-trained QA model to determine answer relevance in the document and injects this information into the Seq2Seq LM model.
• Additional module utilization: QSG-BART [23] utilizes an additional graph neural network module to model token-level interaction between query and document, and injects this information into Seq2Seq model.It reaches state-of-the-art performance on the QFS task, but requires additional parameters and training.Evaluation Metrics: We evaluate the effectiveness of Constrained QFS with ROUGE-1, ROUGE-2, and ROUGE-L [15] for fair comparison to existing works [23,29].Implementation Details: We adopt an off-the-shelf Cross Encoder model2 as our saliency model.We identify the top-3 important tokens with Eq.2 and construct constraints as  We experiment with two pre-trained Seq2Seq models as the base generator, T5 [24] and BART [14].Different from previous works BART-QFS and QSG BART [23,29], we do not warm start BART or T5 by pre-finetuning on existing abstractive summarization datasets; instead we only finetune them on our target datasets Debatepedia and PubMedQA.For T5, we format the input as Summarize: Document: d Question: q: and finetune the model weights on each dataset's training set with golden references.At inference time, we use the same input format and finetuned model weights for relevance-constrained generation/generation.For BART, we format the input as [CLS] d [SEP] q [EOS], where [CLS], [SEP], [EOS] are special tokens indicating start, separate and end of sequence, then we finetune and generate text in a similar fashion to T5.For both models, we finetune with AdamW optimizer [16], learning rate 2 −5 and early stop after no improvements on the dev set for three consecutive epochs.We make our code publicly available at https://github.com/zhichaoxu-shufe/Constrained-QFS.

Results and Analysis
We address two RQs in this section: • RQ1: How competitive is the proposed constrained generation method in terms of performance compared to baselines?• RQ2: How does constrained generation affect QFS performance?
To answer RQ1, shown in Table 1, we observe that the relevanceconstrained methods achieve competitive performance on two datasets.On Debatepedia dataset, Constrained-BART achieves near parity with the current state-of-the-art system and substantially outperforms all other baselines.This result is particularly interesting given the reduced complexity Constrained-BART.On the PubMedQA dataset, Constrained-BART achieves slightly better performance than QSG BART.A possible explanation for this improved performance might be the length of the documents in PubMedQA, where the relevanceconstrained process results in a more consistent snippet.We therefore conclude that the proposed relevance-constrained generation paradigm can achieve competitive performance without additional parameters or finetuning.
To answer RQ2, we specifically draw a comparison between the proposed methods and their unconstrained baselines, which were finetuned end-to-end and generated QFS without constraints.In the second section of Table 1, we observe that the proposed constrained generation methods consistently outperform their unconstrained counterparts across different datasets and backbone LMs.For instance, on the Debatepedia dataset, Constrained-BART outperforms BART 14.9% in R-2.Therefore, we conclude that by adding carefully constructed constraints into the generation stage, the performance of the QFS task can be significantly improved without modifying the backbone LMs.
Qualitative Analysis: We show two examples in Table 2.In the first example, the BART generation hallucinates "public owners" that is not faithful to the document; however, Constrained-BART is able to successfully summarize the document as C contains "privatization".In the second example, despite the underspecified query, the saliency model still extracts critical tokens, which are able to aid in the generation of a meaningful summary.
Ablation Study: In Table 3 we study the effect of different sources of constraints.Query-only denotes that the top-3 important tokens are from the query, and vice versa for Document-only and Query+Document.We observe that on the Debatepedia dataset, Document-only constraints significantly outperforms the other two approaches, while on PubMedQA this improvement is minor.After manual examination, we find that the golden references in Debatepedia dataset overlap more with documents compared to queries, while PubMedQA does not adhere to this trend.

Conclusion and Future Work
In this work, our lightweight relevance-constrained generation approach achieves competitive performance compared to the stateof-the-art method, and it can easily generalize to new domains provided the existence of an effective retrieval model to guide the constraint construction.Our future work may involve investigating the effectiveness and summarization faithfulness/factuality of this approach in real world IR systems.

Acknowledgement
Zhichao Xu is supported partially by NSF IIS-2205418 and NSF DMS-2134223.Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

Table 2 :
Sample qualitative study on Debatepedia dataset; tokens are marked salient and included in constraints set C.

Table 3 :
Effect of the source of constraints to QFS performance on Constrained BART.† denotes significantly better than the other two methods with paired t-test at 0.05 level.