Deep Offline Reinforcement Learning for Real-world Treatment Optimization Applications

There is increasing interest in data-driven approaches for recommending optimal treatment strategies in many chronic disease management and critical care applications. Reinforcement learning methods are well-suited to this sequential decision-making problem, but must be trained and evaluated exclusively on retrospective medical record datasets as direct online exploration is unsafe and infeasible. Despite this requirement, the vast majority of treatment optimization studies use off-policy RL methods (e.g., Double Deep Q Networks (DDQN) or its variants) that are known to perform poorly in purely offline settings. Recent advances in offline RL, such as Conservative Q-Learning (CQL), offer a suitable alternative. But there remain challenges in adapting these approaches to real-world applications where suboptimal examples dominate the retrospective dataset and strict safety constraints need to be satisfied. In this work, we introduce a practical and theoretically grounded transition sampling approach to address action imbalance during offline RL training. We perform extensive experiments on two real-world tasks for diabetes and sepsis treatment optimization to compare performance of the proposed approach against prominent off-policy and offline RL baselines (DDQN and CQL). Across a range of principled and clinically relevant metrics, we show that our proposed approach enables substantial improvements in expected health outcomes and in accordance with relevant practice and safety guidelines.


Introduction
Deep reinforcement learning (RL) has recently experienced a surge in popularity thanks to demonstrated successes in game playing (e.g.Atari and Go [Mnih et al., 2013, Silver et al., 2017]), and with AI bots (e.g.ChatGPT 1 ).Given its ability to learn from large real-world experience datasets, there is also immense excitement about the potential of deep RL for clinical decision support applications.In many such applications, the objective is to leverage historical medical records containing information on patient characteristics, disease state evolution, treatment decisions and clinical outcomes; and learn treatment policies that will optimize clinical outcomes of interest.Notably, deep RL has been used for treatment optimization for a range of clinical conditions including sepsis, hypertension, type 2 diabetes, and cancer [Raghu et al., 2017a, Roggeveen et al., 2021, Sun et al., 2021, Zheng et al., 2021, Tseng et al., 2017].However, unlike traditional game-playing or consumer-oriented applications on which deep RL methods have been developed and widely tested, treatment optimization applications are not amenable to learning through active interaction.This is due to critical safety concerns, which forbid direct online exploration of treatment alternatives on patients.
Offline RL (also known as batch RL [Lange et al., 2012]), an approach to learn from large, previously collected datasets without any interaction with an environment, is then ideally suited for treatment optimization applications.Yet, the treatment optimization literature has almost exclusively focused on traditional value-based off-policy RL methods, particularly Double Deep Q Networks (DDQN) [van Hasselt et al., 2016] and its variants [Sun et al., 2021, Zheng et al., 2021, Raghu et al., 2017a, Lu et al., 2020, Peng et al., 2018, Yu et al., 2019, Zhu et al., 2021].However, direct use of off-policy RL algorithms in an offline setting is known to perform poorly in general, due to issues with bootstrapping from out-of-distribution (OOD) actions [Kumar et al., 2019a] and overfitting to unseen actions [Agarwal et al., 2020, Fu et al., 2019].In other words, off-policy methods could overestimate the Q-values of unseen state-action pairs, and mistakenly select unacceptable or even unsafe actions.Recently proposed offline RL methods such as Conservative Q-Learning (CQL) and Model-based Offline Policy Optimization (MOPO) [Kumar et al., 2020, Yu et al., 2020a] address this overestimation problem by regularizing the Q-values for unseen actions during training and by lower-bounding value function estimates.
However, translating these advances in offline RL directly to real-world treatment optimization applications remains challenging.One key challenge is that medical record datasets reflect real-world clinical practice and hence contain a mixture of both optimal and suboptimal actions.In many cases, suboptimal treatments may be prescribed due to patient preferences, communication difficulties, time and resource restrictions, limitations in clinician experience, and/or inherent uncertainty in determining the best treatment strategy (e.g., due to conflicting clinical trial evidence) [Nemati et al., 2016a, Shah et al., 2005].Often, these practice barriers give rise to behavior policy distributions that are heavily imbalanced, with the frequency of suboptimal actions even outweighing the frequency of optimal actions.In this context, offline RL methods that severely penalize out-of-distribution actions may result in overly conservative policies.
To address this challenge, we leverage sampling methodologies to adapt offline RL methods for scenarios where suboptimal examples dominate the retrospective dataset.Specifically, our approach samples historical records of transitions (i.e., state, action, and reward tuples) corresponding to each action without altering the transition probabilities, in order to increase the proportion of less frequently seen actions in the training data.We performed extensive experiments to compare performance of a popular off-policy RL baseline (DDQN) and a SOTA offline RL method (Conservative Q Learning) with and without our sampling approach on two real-world tasks for type 2 diabetes and sepsis treatment optimization.We assessed expected health outcomes via principled off-policy evaluations and characterized consistency with relevant practice and safety guidelines for the different methods.
Our main contributions are summarized as follows: • We demonstrate that CQL, a SOTA offline RL method, can be applied in real-world treatment optimization applications to make recommendations that are more aligned with clinical practice than DDQN, a popular off-policy RL method, while also improving expected health outcomes over DDQN and the standard of care (SoC).• We argue theoretically and demonstrate empirically that when an intuitive heuristic to strictly enforce safety constraints during policy execution is applied to CQL's and DDQN's recommendations, the relative improvement of CQL over DDQN extends to constrained recommendations.• We propose a transition sampling approach to address action imbalance, and show that this increases the likelihood of CQL selecting less frequently seen actions, while continuing to penalize value estimates for out-of-distribution actions.• Extensive experimental results for two real-world healthcare applications demonstrate that CQL with sampling substantially improves expected health outcomes over the SoC and CQL baselines, while ensuring high alignment with clinical practice.
Our results suggest that offline RL, as opposed to off-policy RL, should be leveraged as a means of devising safe and clinically effective policies for treatment optimization problems.

Related Work
Our work focuses on deep offline RL methods for treatment optimization.We categorize existing relevant research into three threads: (a) Treatment optimization using RL; (b) Offline RL methods and their applications; and (c) Practical challenges associated with RL-based treatment optimization.

Treatment optimization using deep RL:
The overall literature on RL for treatment optimization is comprehensively surveyed in Yu et al. [2021a].Here, we review recent studies on deep reinforcement learning for treatment optimization applications.This literature has largely focused on optimizing management of complex syndromes in intensive inpatient settings [Raghu et al., 2017a,b, Roggeveen et al., 2021], optimizing medication dosing in anesthesia and critical care [Schamberg et al., 2020, Yu et al., 2020b], and optimizing treatment choices for chronic diseases in outpatient settings [Tseng et al., 2017, Sun et al., 2021, Zheng et al., 2021].Commonly, these works apply value-based off-policy deep RL algorithms such as DQN, DDQN, and variants (e.g., with dueling architecture, recurrent networks) on retrospective or batch datasets -an approach which is known to suffer from distribution shift between the learned and behavior policies and overfit to unseen actions [Raghu et al., 2017a,b, Peng et al., 2018, Yu et al., 2019, Lu et al., 2020, Zheng et al., 2021, Sun et al., 2021].
Offline RL: The need to learn optimal policies in practical data-driven decision making scenarios has led to the development of offline RL algorithms.These algorithms are set up to learn effectively from retrospective data collected under some behavior policy, without any direct exploration or environmental interaction during training.We review recent works on deep offline RL.First, implicit constraint Q-Learning [Yang et al., 2021] leverages imitation learning [Wang et al., 2018, Chen et al., 2020] to address the overfitting problem by avoiding querying OOD samples.Second, value-based and policy-based offline RL algorithms (e.g., CQL [Kumar et al., 2020], MOPO [Yu et al., 2020a], UWAC [Wu et al., 2021], Fisher-BRC [Kostrikov et al., 2021], COMBO [Yu et al., 2021b]) prevent over-optimism by penalizing the learned value function for OOD actions with regularization during training.A third set of offline RL algorithms (e.g., BCQ [Fujimoto et al., 2019], BEAR [Kumar et al., 2019b], BRAC [Wu et al., 2019]) uses regularization to penalize deviations between the learned and behaviour policies.A few recent studies have explored offline RL algorithms for treatment optimization [Fatemi et al., 2022, 2021, Killian et al., 2020].Among these, Fatemi et al. [Fatemi et al., 2022] proposed a modification to BCQ [Fujimoto et al., 2019] for a continuous-time semi-MDP setting.Further, Fatemi et al. [Fatemi et al., 2021] proposed a method to identify states from which negative outcomes are unavoidable.Finally, Killian et al. [Killian et al., 2020] used state representation learning with discretized BCQ.However, translation of advances in offline RL to real-world treatment optimization applications is still nascent, and several practical challenges remain.
Practical challenges in RL-based treatment optimization: A key challenge in real-world clinical applications stems from action imbalance due to dominance of suboptimal, and often conservative, actions in the data.The offline RL community is starting to recognize that overly conservative policies impede generalization and performance, and attempts to reduce conservativeness of CQL and variants are emerging.For example, the very recently proposed MCQ [Lyu et al., 2022] adapts CQL to actively train OOD actions.However, such strategies are not designed to directly address action imbalance in the retrospective data and cannot be generalized across offline RL methods.Inspired by sampling-based approaches to handle class imbalance in supervised learning [Kubat and Matwin, 1997, Ling and Li, 1998, Schistad Solberg and Solberg, 1996, Chawla et al., 2002, He et al., 2008], we propose a transition sampling scheme to address action imbalance and demonstrate its ability to improve quality and relevance of the resulting treatment policies.

Background
In this section, we represent the treatment optimization problem within an RL framework and introduce baseline approaches such as Q-learning and DDQN.

Problem Formulation
We consider a setting where the patient state evolves according to an underlying Markov Decision Process (MDP).This MDP is defined by the tuple (S, A, P, r, γ), where S denotes the set of all possible states, A denotes the set of discrete permissible actions, P : S × A → P(S) represents the transition function providing the next-state distribution after executing action a ∈ A in state s ∈ S, r : S × A → R denotes a reward function providing the expected immediate reward for executing action a in state s, and γ ∈ [0, 1] is a discount factor.Let T i denote the treatment horizon length for patient i.At time t, for patient i, a clinician observes patient state s i,t ∈ S and recommends a treatment, or action a i,t from a finite and discrete action set A = {1, 2, ..., A}.The reward r(s, a) is increasing in positive health outcomes (e.g., lab results within target range) and decreasing in negative health outcomes (e.g., mortality).The goal is to identify a treatment policy π : S → P(A) that chooses the action at each time step t that maximizes expected cumulative reward as follows: For some medical conditions, e.g., type 2 diabetes, feasible treatments are constrained by clinical practice and safety guidelines.This setting can be modeled as a constrained optimization problem: C j (s t,i , π(s t ′ ,i )) denotes the type j constraint violation cost at time t for taking action π(s t ′ ,i ) at state s t,i .c j is the threshold on cumulative constraint violation cost over the entire horizon for type j constraints.It can be set as 0 for treatment recommendation applications, to represent strict safety guidelines.
We are specifically interested in the offline reinforcement learning setting, where the policy that solves Equation 1 must be learned solely from some retrospective dataset D = {(s i,t , a i,t , r i,t , s i,t+1 ), for i = 1, ..., I; t = 1, ..., T i } generated by a behavior policy π β (s).This behavior policy corresponds to the SoC, and need not be optimal or strictly satisfy constraints due to real-world clinical practice challenges.As we are interested in comparing the performance of offline RL methods with off-policy RL methods, we begin by introducing the off-policy RL method, DDQN, which leverages the canonical RL algorithm, Q-learning.

Q-learning
Our goal is to learn a (potentially randomized) policy π : S → P(A) that specifies the action to take in each state s ∈ S. The value at state s with respect to policy π is estimated as the expected cumulative reward for executing π: The goal is to learn an optimal policy π * that maximizes V π (s) for all s.In a finite MDP setting, tabular Q-learning can asymptotically learn π * by learning the optimal Q function [Watkins, 1989] Q π * , where Q π * is obtained using the following Bellman equation: is known, the optimal action a * for a given state s is computed by a * = π * (s) = arg max a Q π * (s, a).Q π * can be estimated by iteratively applying the following Bellman operator: Further, each Q k can be represented using a neural network N (s, a; θ k ) with input s, output of dimension A, and parameters θ k , which are iteratively updated during training.

DDQN
DDQN approximates Q values using two neural networks: A master network and a target network.At every iteration k, the parameters of the master network are updated using Equation 5, where the inner maximization is replaced with the following function approximator: The target network is then updated by setting θ ′ k = θ k every τ iterations for some hyperparameter τ .By using different networks to select and evaluate actions, DDQN attempts to address the issue of Q value overestimation that may arise from taking the maximum of the estimated Q values in Equation 5 [van Hasselt et al., 2016].

Methods
In this work, we employ offline RL methods to improve the quality of treatment recommendations.We first describe a SOTA offline RL algorithm, CQL, that is used to generate a recommendation policy.Then, we explain our methodology to address the challenge of training CQL on retrospective datasets with action imbalance.Finally, we describe an intuitive heuristic to enforce strict constraint satisfaction, and discuss how this is expected to impact the performance of CQL, or more generally of offline RL methods.

Conservative Q-Learning (CQL)
A critical issue with directly applying Equation 5 in the offline RL setting is that the Q values of OOD actions may be overestimated.To mitigate this, CQL combines the standard Q-learning objective of solving the Bellman equation with an additional regularization term that minimizes Q values for OOD actions (i.e., large Q values for OOD actions are penalized) [Kumar et al., 2020].This gives rise to the objective Here, µ is some distribution over the actions conditional on state, and R(µ) is a regularization term.Alternatives for R(µ) proposed in Kumar et al. [2020] include: (i) Entropy, (ii) the negative of the KL divergence between µ and some given distribution, e.g., the learned policy at the previous iteration during model training, and (iii) a variant inspired by distributionally robust optimization, which penalizes the variance in the Q-function across actions and states.
Finally, α : α > 0 is a hyperparameter controlling the trade-off between the two objectives.While α can be tuned to learn less conservative policies, this alone may be insufficient to address the impact of severe action imbalance in the retrospective dataset D. Hence, we propose a sampling method to reduce action imbalance.

CQL with sampling
CQL is designed to ensure that Q value estimates for OOD actions are conservative.In settings like treatment optimization where the behavior policy exhibits large action imbalance, this may result in CQL predominantly recommending actions that are frequently recommended by the behavior policy, even when this is suboptimal.We thus propose to apply sampling approaches, which have been successfully used in many real-world classification problems to address class imbalance [Kubat and Matwin, 1997, Schistad Solberg and Solberg, 1996, Ling and Li, 1998, He et al., 2008, Chawla et al., 2002], to similarly address the problem of action imbalance in offline RL.
Consider the following sampling procedure: Given action a, 1 ≤ a ≤ A, suppose we sample with replacement a dataset D from the set of historical transitions {(s t , a, s t+1 )}.Denote the ratio of proportions of this set after sampling to before sampling as w a (w a > 1: Action a has been used less frequently in the retrospective dataset).Then D has distribution Since D is created by sampling with replacement from the set of historical transitions with action a, we also have, for each a: i.e., the transition probabilities are the same for D and D. In the following, we will denote Pr D [a|s t ] as π β .To see how this approach affects the CQL recommendations, we write the CQL objective (Equation 7) under the distribution D: Here, the expectation is taken over the state s t under D. Note that s t ∼ D is distinct from s t ∼ D as the frequency of transitions with state s t is the sum of the frequencies of transitions with state s t and action a across all a; This differs from D to D due to Equation 8.The right hand side of Equation 10 then follows from our assumptions about the sampling process, where the first summand in the expectation follows from Equation 8 and the second summand is derived from Equation 9. Thus when w a > 1, the CQL overestimation term (first summand) decreases with sampling compared to without sampling, while the Bellman loss term (second summand) receives greater weight after sampling is applied, i.e., CQL becomes less conservative with respect to action a.Conversely, when w a < 1, the CQL overestimation term (first summand) increases with sampling compared to without sampling, while the Bellman loss term (second summand) receives smaller weight after sampling is applied, i.e., CQL becomes more conservative with respect to action a.
Motivated by this analysis, we propose combining CQL with sampling as follows: Instead of adopting A hyperparameters w a , which could be computationally costly to tune, we introduce a single hyperparameter K.By tuning K through grid-search, we can in turn tune all w a to reduce action imbalance.Denoting the mean number of transitions per action in D as σ, we have: 1) Undersampling -for each action with more than Kσ transitions, sample Kσ transitions with replacement; 2) Oversampling -for each action with fewer than Kσ transitions, sample Kσ transitions with replacement; and 3) Under+oversampling -for each action, sample σ transitions with replacement.We then trained CQL on the sampled datasets.

Constraint satisfaction
The constraints from the diabetes setting take the form For constrained optimization problems (see Equation 2), direct application of CQL and DDQN does not guarantee constraint satisfaction during policy execution.For this special case, the learned policies of any value-based RL agent can be easily adapted to ensure strict constraint satisfaction.Given state s t , denote the feasible set of actions as A F (s t ).
For a given value-based RL agent π RL , define the corresponding constrained policy as Here, Q π RL (s t , a) denotes the RL agent's estimated Q value.π RL,c (s t ) thus recommends the feasible action with the highest predicted Q value.Then, the constrained and unconstrained recommended actions are the same when the latter is feasible.
Intuitively, this means that if the RL agent's rate of constraint satisfaction is high, the optimality gap for the constrained recommendations should be close to the optimality gap for the unconstrained recommendations.This is precisely expected to be the case for offline RL algorithms in treatment optimization settings.Since the SoC's rate of constraint satisfaction should be high (the SoC reflects domain experts' attention to safety considerations), and the offline RL agent's recommendations should not deviate too much from the SoC recommendations, the latter's constraint satisfaction rate should be high as well.Any performance guarantees for the offline RL agents' unconstrained recommendations should continue to the hold for constrained recommendations.This observation is supported by Property 1 below (proof in Appendix A.1). Property 1.Let π * be an unconstrained policy solving Equation 1, and let π * c be a constrained policy solving Equation 11.Assuming that the reward r(s, a) is bounded as r ≤ r(s, a) ≤ r, the optimality gap for π RL,c (s t ) can be bounded in terms of the optimality gap for π RL (s t ) as 5 Experimental Design

Treatment optimization applications
We conducted experiments comparing the recommendations of the different RL agents for two clinical applications: 1) type 2 diabetes, and 2) sepsis.We detail the tasks and datasets for these below.

Diabetes treatment recommendation
Task.We considered the problem of recommending antidiabetic treatment regimens to type 2 diabetes patients in an outpatient setting.At each visit, the doctor prescribes a treatment regimen from among 13 options: 1) Maintain, 2) increase, or 3) decrease the dosages of previously prescribed drugs, or start a new drug from among the subgroups 4) acarbose, 5) DPP-4 inhibitors, 6) biguanides, 7) SGLT2 inhibitors, 8) sulphonylureas, 9) thiazolidinediones, 10) GLP-1 RAs, 11) long or intermediate acting insulins, 12) premixed insulins, and 13) rapid acting insulins.The goal is to treat the patient's glycaeted haemoglobin (HbA 1c ) down to a target of 7%, minimize the incidence of severe hypoglycemia, and reduce the incidence of complications such as heart failure [Association, 2022].This results in the following expression for reward at patient i's visit at time t where s HbA1c i,t is the HbA 1c , s Hypo i,t indicates the occurrence of hypoglycemia, and s Compl i,t indicates the occurrence of complications or death at time t.Similar reward functions are used in Sun et al. [2021] and Zheng et al. [2021].For safety reasons, the treatment recommendation must also adhere strictly to clinical guidelines: < 30ml/min/1.73m2(15) < 45ml/min/1.73m 2 (16) where is the estimated glomular filtration rate (eGFR), s pancr i,t is the incidence of pancreatitis, s age i,t is the patient's age, and A D = {1, 2, . . ., 13} is the set of actions.The feasible set A F (s i,t ) is then Data.We studied this problem using electronic medical records from outpatient prescription visits for type 2 diabetes patients within the Singapore Diabetes Registry [Lim et al., 2021].Our study was approved by the relevant Institutional Review Board with a waiver of informed consent.For each patient visit at time t, the state was defined by 55 variables describing the patient medical profile, including demographics (age, gender, ethnicity), physical measurements (heart rate, blood pressure, BMI), blood and urine laboratory data (HbA1c, fasting glucose, lipid panel, full blood counts, creatinine, estimated glomerular filtration rate, urine albumin/creatinine ratio), medical history (diabetes duration, utilization details, comorbidities and complications), and the previous visit prescription.We included prescription visits based on two inclusion criteria: (a) visit had at least one preceding prescription, and (b) visit had at least 1 HbA 1c measurement within the past month and at least 1 eGFR reading within the past year.This yielded 1,302,461 visits for 71,863 patients.
Among these visits, we observe significant action imbalance.The most and least common action account for 64.0% and 0.01% of visits respectively.The most common action is "No change" and is prescribed in 51.2% of visits where the patient's HbA 1c is above the target of 7%.This could be indicative of clinical inertia, where treatments are not intensified appropriately due to lack of time during the consultation, or lack of expertise in primary care settings [Shah et al., 2005].The action imbalance observed in this dataset, along with an over-representation of suboptimal actions, suggests that it is a good candidate for the methods proposed in Section 4.2.

Sepsis treatment recommendation
Task.We also considered the problem of treating ICU patients with sepsis.At each 4 hour window t during patient i's ICU stay, the clinician administers fluids and/or vasopressors, each discretized into 5 volumetric categories.The treatment is described by the tuple (u, v), 1 ≤ u, v ≤ 5, with 25 treatment options in total.The goal being to prevent patient mortality, we model the treatment optimization problem with the following reward function: where s M ort i,t is the incidence of mortality within a 48 hour window of time t.Similar reward functions have been used in Fatemi et al. [2021] and Komorowski et al. [2018].
Data.We studied this problem using data on a cohort of sepsis patients generated from the publicly available MIMIC (Medical Information Mart for Intensive Care) -III dataset [Johnson et al., 2016] following Fatemi et al. [2021] 2 .For each ICU stay and 4 hour window t, the state was defined by 44 variables describing the patient medical profile, including demographics (age, gender), physical measurements (heart rate, blood pressure, weight), and blood and urine laboratory data (glucose, creatinine).There were a total of 18923 unique ICU stays.We observed large action imbalance, with the most and least common actions accounting for 27.1% and 1.9% of visits respectively.The most common action was (1, 1), corresponding to the lowest possible dose ranges for IV fluids and vasopressors.Applying existing offline RL methods to learn from this dataset may thus result in recommendations for insufficiently intensive treatments.

Evaluations
In the absence of data on counterfactuals, we considered evaluation techniques that use retrospective data.

Weighted Importance Sampling (WIS)
We applied WIS, an off-policy evaluation technique widely used in the treatment optimization literature [Komorowski et al., 2018, Raghu et al., 2017a, Roggeveen et al., 2021, Raghu et al., 2017b, Peng et al., 2018], to estimate the value of each RL agent's policy.We define the WIS score for an RL agent as follows.First, we define the importance ratios ρ i,t as the ratio of the likelihood of the RL agent policy π RL and the likelihood of the SoC policy π Clin selecting the SoC action a i,t given state s i,t .We then define trajectory-wise WIS estimators V W IS i for trajectory i, i = 1, . . ., N , in terms of these importance ratios, and average V W IS i across trajectories: This gives a biased but consistent estimator of the expected cumulative reward under the RL agent's policy [Hesterberg, 1995].
Following Komorowski et al. [2018], we estimated π Clin by training a multinomial logistic regression model with the one-hot-encoded selected action as the output, and the state as the features, then by taking the predicted probabilities for the different actions.We estimated π RL by "softening" the RL agent policy, i.e., approximating it with a random policy π W IS RL that selects from the non-optimal actions uniformly at random with some small probability.Specifically, where ϵ (0 < ϵ ≪ 1) is a softening probability.To ensure a fair comparison between the SoC and the RL agents, we applied softening to the SoC policy and calculated WIS estimates.

Additional metrics.
We defined metrics of how well the RL agents' recommendations were aligned with clinical practice.
Model Concordance Rate.This is the fraction of visits where the RL policy's recommendation matches the SoC [Sun et al., 2021, Lin et al., 2018, Nemati et al., 2016b]: Appropriate Intensification Rate.For the type 2 diabetes application, this is the fraction, out of visits with HbA1c over 7.0%, where the RL agent recommends treatment intensification (i.e., increase dose or add new drug): Constraint Satisfaction Rate (CSR).For the diabetes application, we defined the CSR for each constraint j, j = 1 to 4 as the fraction of all visits where constraint j applies such that the RL agent satisfies constraint j: As an illustrative example, for constraint 1, the CSR is the fraction, out of visits with eGFR under 30ml/min/1.73m 2 , where the recommendation is not "add metformin."

Implementation details
We set three seeds, and for each seed, randomly split the patients in each dataset into training, validation, and test sets with the ratio 60:20:20.We then used the d3rlpy library3 in Python to train CQL and DDQN.For CQL, d3rlpy sets the regularization term R(µ) in Equation 7as entropy, i.e., R(µ) = E[− log(µ)].This variant was also shown through ablation studies in Kumar et al. [2020] to generally outperform the other proposed variants.
For each seed, we trained (i) CQL with α ∈ [0.1, 0.5, 0.8, 0.9, 1.0]; (ii) CQL with undersampling and K ∈ [0.4,0.8, 1.2], CQL with oversampling and K ∈ [0.4,0.8], and CQL with under+oversampling; and (iii) DDQN.Transition sampling was applied to the training set but not the validation and test sets.These settings resulted in the sampling weights w a : 1 ≤ 1 ≤ A given in Table 1 (Recall from Section 4.2 that w a denotes the frequency of transitions with action a after sampling compared to before sampling).For both the diabetes and sepsis applications, under+oversampling gives rise to the greatest decrease in frequency (smallest w a ) and the greatest increase in frequency (largest w a ).
Under+oversampling thus seems to be the most aggressive sampling approach for these settings, and undersampling the least aggressive approach.Finally, we applied stratified random sampling to increase sample representativeness and to ensure that the empirical distributions of the sampled datasets are close to D [Teddlie and Yu, 2007].Details are in Appendix B.2.
For each RL agent, we used multilayer perceptron architectures for the master and target Q networks and considered 2 different configurations: (i) 2 linear layers with 256 hidden units, (iii) 3 linear layers with 512 hidden units.We set the batch size to 64, learning rate to 6.25e−5, and the target update interval to 8000 steps.We used grid-search across all hyperparameter combinations to select the model with the highest WIS score on the validation sets, and evaluated this model.To compute WIS scores, we trained multinomial logistic regression models to approximate the clinician policy (details in Appendix B.1), and applied a softening factor of 0.99.We also calculated WIS scores for the SoC using Equation 22.

Results and Discussion
Results for the diabetes application are in Fig. 1, Table 2, Table 3, while results for the sepsis application are in Fig. 2 and Table 4.We organize our findings into three subtopics: (i) Comparison of the offline RL method CQL with the off-policy RL method DDQN; (ii) Effect of sampling and regularization hyperparameter tuning on the performance of CQL; and (iii) Performance of the RL methods after the constraint satisfaction heuristic (Equation 12) is applied.

Comparing CQL and DDQN
Comparing the unconstrained recommendations of CQL with α = 1.0 and DDQN for the diabetes treatment application, we first see from Fig. 1 that the distribution across the treatment options is much closer to the SoC under CQL than under DDQN.In particular, CQL is far more likely than DDQN to recommend no change, increasing dose, or decreasing dose, and far less likely than DDQN to recommend adding new medications.The model concordance rates in Table 2 support this picture.Across 3 seeds, CQL exhibits a mean model concordance rate of 62.5% with the SoC's recommendations, compared to only 1.5% under DDQN.As a result of the greater model concordance of the SoC with CQL than with DDQN, we see from Table 3 that across the four constraints, the constraint satisfaction rates are higher under CQL (between 98.7% and 99.7%) than under DDQN (between 88.1% and 94.3%); and are closer to the constraint satisfaction rates under the SoC (between 98.1% and 100.0%).Thus CQL generates recommendations that have far greater alignment to clinical practice than DDQN.In terms of impact on health outcomes, Table 2 shows that the WIS score averaged across 3 seeds is higher under CQL (3.653) than under DDQN (-3.055).Both CQL and DDQN achieve greater WIS scores than the SoC (-6.741).We can then conclude from Equation 14 that unlike the DDQN and SoC recommendations, the CQL recommendations are not expected to lead to complications and/or mortality on average.

Effect of sampling and regularization hyperparameter tuning
Next, we compare the recommendations of the various CQL agents for the diabetes and sepsis applications.For sampling, results correspond to the agent with K selected via grid-search.Appendix C provides a sensitivity analysis of how performance depends on K.
Diabetes treatment application.Sampling and lowering α lead to greater divergence between the CQL and SoC recommendations: Both contribute to a drop in the frequency of the no change recommendation and increases in the frequencies of each of the remaining treatment options (Fig. 1).Similarly, the model concordance rate under CQL, as averaged over 3 seeds, decreases from 62.5% with α = 1.0 to 50.7% with α = 0.1, and to 43.6% and 41.3% with undersampling and under+oversampling respectively (Table 2).Thus both regularization hyperparameter tuning and sampling have the expected effects of reducing the action imbalance of CQL's prescribed treatments.At the same time, the CQL agents' recommendations are still more similar to the SoC than the DDQN baseline, i.e. as suggested in Section 4.2, hyperparameter tuning and sampling continue to encourage conservativeness.This translates to generally higher rates of constraint satisfaction than under DDQN: Constraint satisfaction rates are between 98.4% to 100.0% for CQL with α < 1, and between 93.1% to 100.0% for CQL with sampling (Table 3).
In terms of the optimality of the recommendations with hyperparameter tuning and sampling, Table 2 shows an increase in the rate of appropriate treatment intensification from 37.8% for CQL with α = 1.0 to 57.5% with α = 0.  with α = 1.0, which has a WIS score of 3.653, the WIS scores are also higher for CQL with α = 0.1 (5.530) and with α = 0.8 (3.795); as well as for all the sampling schemes, with under+oversampling achieving the highest WIS score among all methods (5.721), followed by undersampling (5.082), then oversampling (4.070).Then, sampling and hyperparameter tuning both offer alternative means of improving the recommendations under CQL, with sampling outperforming hyperparameter tuning.
Sepsis treatment application.For the sepsis application as well, sampling and lowering α lead to greater divergence between the CQL and SoC recommendations.Fig. 2 shows that both reduce action imbalance, contributing to increases in the frequencies of the highest doses (top right section of each plot) of both IV fluids and vasopressors, as well as increases in the frequencies of higher IV fluid doses for the lowest vasopressor dose category.The difference is more pronounced for sampling than for hyperparameter tuning, and indeed the model concordance rate in Table 4 also decreases for all the sampling settings (from 30.1% for α = 1.0 to between 25.3% and 28.5% for the different sampling settings), while no decrease is observed for any α, α < 1.
The WIS scores are also higher for CQL with α = 0.1 (0.203), undersampling (0.191), oversampling (0.180), and under+oversampling (0.210), compared to CQL with α = 1.0 (0.175).CQL with under+oversampling attains the highest score.From Equation 20, this implies that survival rates are expected to increase for these CQL agents over CQL with α = 1.0.Thus, the finding that sampling and hyperparameter tuning can both improve the recommendations under CQL, with sampling outperforming hyperparameter tuning, generalizes from the diabetes to the sepsis application.The relative performance of the different sampling approaches also generalizes to the sepsis application.Under+oversampling outperforms undersampling, which in turn outperforms oversampling, in terms of WIS scores.This is in contrast to the finding in the supervised learning setting that oversampling tends to outperform undersampling [Zha et al., 2022].
A key difference between the supervised learning setting and our treatment optimization setting is that in the former, accuracy can suffer due to loss of information on majority class sample, while in the latter, it is more conservative and suboptimal actions that are overrepresented.Thus, undersampling is unlikely to be contributing to loss of information on the optimal action.

Evaluation of constrained recommendations
With constrained recommendations, the distribution across actions did not change noticeably from
Our other findings also generalize to the constrained recommendations.Compared to DDQN, CQL with α = 1.0 exhibits higher model concordance with the SoC (62.6% vs. 1.6%), and achieves a higher WIS score (3.698 vs. -0.190),implying that CQL's recommendations are more closely aligned with the SoC while also translating to improved expected health outcomes.The WIS scores are higher for CQL with α = 0.8 (3.798), α = 0.1 (5.496), undersampling (5.763), oversampling (4.049), and under+oversampling (4.668), compared to CQL with α = 1.0 (3.698).CQL with undersampling attains the highest score.We conclude here as well that sampling and hyperparameter tuning can both improve over CQL recommendations, with sampling again outperforming hyperparameter tuning.

Conclusion
We have demonstrated that offline reinforcement learning (based on CQL) outperforms a popular deep off-policy RL method (DDQN) for a real-world diabetes treatment optimization application.We found that offline RL recommendations are not only more closely aligned to clinical practice, but also translate to substantial improvements in expected health outcomes.Further, to address the common challenges of action imbalance encountered in real-world treatment optimization tasks, we devised a practical but theoretically grounded offline RL strategy for transition sampling of training data.Via extensive experiments for two real-world treatment optimization applications, we demonstrated improvements with this strategy over off-policy (DDQN) and offline (CQL) RL baselines in terms of expected health outcomes, as well as in terms of alignment of the recommendations with clinical practice guidelines.Further, we showed theoretically and empirically that our results extend to when hard safety constraints are enforced via an intuitive heuristic.Our findings strongly suggest that offline RL should be chosen over off-policy RL for treatment optimization applications, as a means of enhancing safety and efficacy.Further, we highlight that transition sampling could find application in broader domains with critical safety considerations.

Figure 1 :
Figure 1: Distributions of unconstrained treatment recommendations under the various RL agents (as selected via grid-search) for the diabetes application.
Fig. 1 (see Appendix C).Model concordance rates for the unconstrained and constrained recommendations of each CQL agent are

Figure 2 :
Figure 2: Distributions of treatment recommendations under SoC and the various RL agents (as selected via grid-search) on the sepsis dataset.Table 4: Comparison of treatment recommendations by the SoC, and the unconstrained recommendations of the different RL agents, on the sepsis dataset.Metrics are Model Concordance Rate (MCR) and WIS score.The mean and standard deviations of the results across 3 seeds are reported.For each metric, the highest mean value across the RL agents is bolded, along with the associated standard deviation.

Table 1 :
min a {w a } and max a {w a } with different sampling approaches for the diabetes and sepsis applications.For undersampling and oversampling, ranges across various K are provided.

Table 2 :
Comparison of treatment recommendations by the SoC, and the unconstrained and constrained recommendations by the different RL agents, for the diabetes application.The mean and standard deviations of the results across 3 seeds are reported.For each metric, the highest mean value across the RL agents is bolded, along with the associated standard deviation.

Table 3 :
Mean and standard deviations across 3 seeds of the CSRs for the four constraints (Equations 15-18) in the diabetes application.For each constraint, we bold the highest mean CSR(s) across RL agents, and the associated standard deviation(s).