Enhancing Safety in Learning from Demonstration Algorithms via Control Barrier Function Shielding

Learning from Demonstration (LfD) is a powerful method for non-roboticists end-users to teach robots new tasks, enabling them to customize the robot behavior. However, modern LfD techniques do not explicitly synthesize safe robot behavior, which limits the deployability of these approaches in the real world. To enforce safety in LfD without relying on experts, we propose a new framework, SElding with Control barrier fUnctions in inverse REinforcement learning (SECURE), which learns a customized Control Barrier Function (CBF) from end-users that prevents robots from taking unsafe actions while imposing little interference with the task completion. We evaluate SECURE in three sets of experiments. First, we empirically validate SECURE learns a high-quality CBF from demonstrations and outperforms conventional LfD methods on simulated robotic and autonomous driving tasks with improvements on safety by up to 100%. Second, we demonstrate that roboticists can leverage SECURE to outperform conventional LfD approaches on a real-world knife-cutting, meal-preparation task by 12.5% in task completion while driving the number of safety violations to zero. Finally, we demonstrate in a user study that non-roboticists can use SECURE to effectively teach the robot safe policies that avoid collisions with the person and prevent coffee from spilling.


INTRODUCTION
Recent advances in robot learning have ofered the potential to aid people in a range of applications, including driving [47], manufacturing [48], and household tasks [10], like tidying up or serving someone a drink.Reinforcement learning (RL) has become a ubiquitous approach to develop robot controllers; however, defning the reward function to elicit desired behaviors can be difcult, and engineered reward functions might overft to particular RL algorithms [7].Instead, the feld of Learning from Demonstration (LfD) seeks to empower non-roboticist end-users to teach robots skills and customized behaviors through demonstrations [13,14,23,39].
Like RL, LfD research has yielded strong results in laboratory settings [13,14,36], but few techniques exist for LfD that enable robots to learn safe policies, hindering the deployment of LfD with end-users in the real world.Recently, Brown et al. [8] provided high-confdence bounds for quality of the inferred human intention as a proxy of safety.While promising, such approaches do not allow specifying constraints on the learned policy to explicitly prevent the robot from taking unsafe actions.
To ensure safety, Control Barrier Functions (CBFs) are a state-ofthe-art method for designing safe robotic controllers that adhere to explicit safety constraints.CBFs have successfully been applied in RL and HRI settings [3,4,16,29,30,35,46], and we hypothesize that CBFs could similarly help learned LfD policies to avoid unsafe states.However, conventional CBF approaches would still require experts to formally defne and construct such constraints.Instead, we aim to enforce safety in LfD settings without relying on experts by allowing users to defne safety via demonstration.
We present SECURE, a novel Safe Learning from Demonstrations (LfD) framework that learns personalized CBFs from enduser demonstrations.In contrast to approaches solely focusing on physical safety, SECURE acknowledges the variability in individuals' safety preferences [24,38].This user-centric approach not only enhances perceived safety but also ensures physical safety, as demonstrated in a cofee serving task where safety demonstrations defne minimum distance and maximum cup angle to avoid spills (see Figure 1).Our contributions in this work are four-fold: (1) We propose a new framework named ShiElding with Control barrier fUnctions in inverse REinforcement learning (SECURE), that learns a CBF from human demonstrations.We then develop two techniques, namely CBF Shield and Adaptive Resampling, which shield the LfD policy to be safe and enhance the sample efciency of SECURE for improved usability in HRI; (2) We demonstrate SECURE's ability to learn a high-quality CBF, in comparison to an expert-designed CBF in 2D Double Integrator system.Empirical evaluation on simulated robot control tasks showcases SECURE's task performance on par or exceeding Learning from Demonstrations (LfD) baselines, while signifcantly reducing safety constraint violations by up to 100%.(3) We demonstrate that roboticists can leverage SECURE to synthesize safe policies from demonstrations on a real-world knifecutting, meal-preparation task.SECURE outperforms conventional LfD approaches by 12.5% in task completion and eliminates 100% unsafe cases (i.e., "cut" human arms); (4) We further conduct a user study in which participants frst provide demonstrations in a cofee-cup placing task and then work on a secondary task in the robot's proximity.SECURE can efectively learn user-specifc safe policies from provided demonstrations to enable the robot to complete its task while being perceived as safe by users operating in its proximity.

RELATED WORK
Ensuring safe and reliable robot operation, particularly in interactions with human users, is of paramount importance [9].In the RL realm, safety challenges arise due to the learning process's exploration in unknown environments, where various safety approaches tailored to RL have emerged, including constrained policy optimization [1,17,32,40,43], safe exploration [20,33,34], learning a safety critic [5,41,44], risk-averse RL [45,51], and shielding [2,11].Shielding, in particular, is a framework that ensures the safety of a control policy by verifying that each action applied keeps the system within a predefned safe set of states [6].CBFs are mathematical functions utilized in control theory to enforce safety constraints by defning a safe set of states [3,4].CBFs are a popular technique to shield robots from unsafe actions, as they enforce the system to always remain within a set of safe states.
To develop safe controllers, prior work has explored synthesizing CBFs from data, including expert demonstrations [26,27,37,42].However, these approaches work with expert demonstrations, limiting their applicability with end-users, which is central in LfD.Researchers have also explored tuning specifc CBF parameters according to user data [18,25,31,46].In the context of RL safety, researchers have investigated the utilization of expert-designed CBFs to synthesize control policies that confne the system within safe states [15,16,29,30,35].Recent eforts have also focused on leveraging data-driven methods to learn CBFs within the RL framework for safety assurance [50].However, these approaches have been limited to RL and have not been extended to LfD methods where robots directly learn from and interact with humans.
While a recent method extended CBF to the domain of imitation learning [19], it requires a manually-designed CBF to supplement the Behavioral Cloning (BC) policy, which is not practical for realworld LfD settings.Castañeda et al. [12] proposes to construct a CBF from data to detect out-of-safe-distribution cases.Still, the approach risks being overly conservative.To the best of our knowledge, our study is the frst to successfully integrate CBFs with IRL algorithms and efectively increase policy performance while mitigating potential safety concerns.

PRELIMINARIES
In this section, we introduce three building blocks of SECURE: Markov Decision Process, Inverse Reinforcement Learning, and Control Barrier Function.Markov Decision Process: We model the environment as a Markov Decision Process (MDP) [49], M = ⟨S, A, , , , 0 ⟩.S and A denote the state and action space, respectively.: S → R is the reward of a given state.: S × A → S is a deterministic transition function that gives the next state, ′ , for applying the action, , in state, .∈ (0, 1) is the temporal discount factor.0 : S → R denotes the initial state probability distribution.A stochastic policy : S×A → R is a mapping from states to probabilities over actions.A trajectory, = ( 0 , 0 , • • • , , , • • • ), is generated by executing the policy within the environment: 0 ∼ 0 , ∼ ( ), +1 = ( , ) ∀ ≥ 0. The expected discounted return of a policy, , is The objective for RL is to fnd the optimal policy, * = arg max ().

METHOD
We describe SECURE in three steps: In Section 4.1, we frst describe how SECURE learns a CBF, represented by a neural network, from user-provided safety demonstrations (Figure 2, top).Second, Section 4.2 describes how SECURE utilizes a shielding mechanism with the learned neural CBF to prevent the robot from entering dangerous states while still allowing for task completion (Figure 2, middle).Finally, in Section 4.3, we introduce a novel adaptive sampling method for SECURE that improves the efciency in fnding safe and task-aware actions (Figure 2, bottom).

Safe LfD with CBF
To enable end-users to defne customized safety boundaries, we seek to learn user-specifc safety constraints, represented by a CBF, from user demonstrations.To learn the CBF, we need access to the safe states set, S , and the unsafe states set, S .While we can construct the safe state set with demonstrations: S = { | ∈ ∈ D}, we should not request demonstrators to take the risk of hurting themselves to provide unsafe demonstrations.Instead, we defne the near dangerous state set, S , as a set that the robot has to pass before entering S , shown in Equation 2.
Intuitively, S would be a set that "wraps" the actual physically unsafe states, e.g.collisions.For instance, if a robot helps a person with serving a cup of cofee, the person can demonstrate neardangerous states by moving their arms around the static robot arm holding the cup of cofee at distances that they perceive as neardangerous.Note that one user may defne a large distance as "near" dangerous even if the expected harm may be low, and SECURE respects such user-defned safety concepts.
Having defned S , we amend the CBF's second requirement as R2 ′ : For ∀ ∈ S , ℎ() < 0. As a corollary of the CBF property introduced in Section 3, if R1, R2 ′ , and R3 are satisfed, the policy cannot enter S , which further means the policy cannot enter the dangerous state set, S , according to the defnition of S .While R2 ′ is a stricter requirement than R2, it allows people to personally demonstrate what they deem as unsafe.We replace S in Equation 1to be S , resulting in Equation 3.
Finding a solution of ℎ and for ′ > 0 will satisfy CBF requirements and ensure that the agent does not enter dangerous states or near dangerous states.One observation to maximize is that the frst two terms are only dependent on the CBF, ℎ, while the third term relies on .Although one can jointly optimize ℎ and , such an optimization sufers from empirical difculty because is chasing the moving ℎ.To show this, we conduct an empirical experiment in the demolition derby domain (see Section 6).Joint optimization of ℎ and yields a 32.3% ± 11.0% success rate with a high 77.7% ± 3.4% occurrence of dangerous cases.SECURE instead takes a two-stage approach: 1) optimize the CBF, ℎ, to satisfy R1 and R2 ′ ; 2) modulate to satisfy R3 by the CBF shield we introduce in Section 4.2.As a result, SECURE achieves a high 52.3% ± 2.5% success rate and a low 3.3% ± 1.2% occurrence of dangerous cases.
For Stage 1, we formulate the loss function L barrier as shown in Equation 4, where ℎ (•) is a neural network parameterized by .Intuitively, minimizing L barrier provides an ℎ (•) that can discriminate safe states which have positive ℎ values and near-dangerous

=1
Output : e states which have negative ℎ values, when trained on the safe and near-dangerous states specifed through demonstrations.

Shielding Unsafe Actions
After learning the CBF, ℎ (•), from human demonstrations for encoding safe and near-dangerous states, one naïve way to avoid danger is to choose actions with ℎ > 0. However, this approach is myopic which can lead to danger.Consider a scenario where a fast-moving vehicle approaches unsafe states: merely choosing actions with ℎ > 0 results in the vehicle approaching the unsafe boundary and inevitably entering an unsafe state.In contrast, CBF R3 (Equation 5, where ∼ (•|)) enables SECURE to assess the gradual decline of ℎ from safe to unsafe states, ensuring the agent never enters unrecoverable states.Therefore, SECURE employs the CBF Shield to fnd actions aligned with R3.
CBF shield directly fnds safe actions that satisfy R3, i.e., L derivative ≥ 0. We summarize the CBF shield procedure in Algorithm 1.For each safe action choice, we begin by sampling a batch of actions { } =1 from the AIRL policy (lines 1-2).Specifcally, the policy output is modeled as a Gaussian distribution with () and (), and the action is sampled by ∼ N ( (), ()).Next, a straightforward approach could be randomly selecting one safe action from the batch of actions.However, while the selected action is safe, it is possible that the action interferes with the task completion (yellow arrows in Figure 3).Instead, CBF Shield aggregates multiple safe actions (green arrows in Figure 3) to better refect the policy's intention of accomplishing the task.As such, we calculate the ratio of safe actions within a sampled action batch, = , where is the sampled batch size.When the ratio exceeds a threshold, 0 , we have more confdence that the average of the safe actions aligns well with the policy mean output (i.e., aims at accomplishing the task).Thus, we aggregate safe actions within this batch (Line 6).When ≤ 0 , it suggests that the current batch does not contain enough safe actions and we resort to the Adaptive Sampling method (Section 4.3) to explore and fnd more safe actions efciently (Line 4-5).
To ensure the safety of the executed action, we aggregate the safe actions by averaging frst, (Line 6).If the averaged action (brighter green arrow in Figure 3) is deemed safe, () ≥ 0 (Line 7), is returned for execution.Otherwise, we select the closest action from the safe action set, e = min ∈ { | ( ) ≥0} ∥ − ∥ (Line 9).In summary, the proce-=1 dure of CBF shield ensures the satisfaction of R3 (i.e., policy safety) by always returning an action such that () ≥ 0 while also being task-aware, which helps the agent to accomplish the task while respecting personalized safety defnitions.

Adaptive Resampling
The CBF Shield introduced in the Section 4.2 assumes a minimum percentage of safe actions to be in the sampled action batch in order to obtain an action that is both safe and task-aware.However, the AIRL policy may be overly confdent in a task-oriented but unsafe action, and thus it might not sample an action batch containing even a single safe action, let alone enough for safe action aggregation.Therefore, there is a need to devise a strategy for greater exploration within the action space.To address this, SECURE modifes the policy action distribution, N ( , ), and conducts resampling from the modifed distribution.To preserve the task completion goal represented by the action mean, , we refrain from modifying it to avoid disrupting the task.Instead, we amplify the standard deviation in certain directions.To reduce the probability of generating safe but undesired actions, we selectively increase the standard deviation specifcally along the directions identifed as unsafe.
Algorithm 2 and Figure 4 show how our approach fnds unsafe directions and adjusts the standard deviation.First, we sample probing actions (the blue and green arrows in Figure 4) uniformly from action space (Line 1).To determine the unsafe action direction, we compute a weighted average of unsafe probing actions (i.e., green arrows in Figure 4, identifed by ℎ (•) < 0) where the weights are given by the negative ℎ values (Line 2).We can then adjust the standard deviation (i.e., the purple lines) by taking a small step with size , in the normalized direction of the unsafe actions (Line 3-4).A new batch of actions is sampled for a subsequent verifcation loop conducted by CBF shield.Our Adaptive Sampling approach provides an efcient way to fnd safe and efective actions.We collect a dataset comprising of 800 safe states and 800 unsafe states by sampling from the state space and labeling each state with the ground-truth CBF to separate the impact of data quality and the CBF learning process itself.To test the learned CBF, we discretize the state space with a grid size of 0.1 within the ranges [0, 10], [0, 10], [−1.5, 1.5], [−1.5, 1.5], for , , ¤, ¤, respectively.As such, we obtain 100 × 100 × 30 × 30 = 9, 000, 000 test states.We summarize the evaluation results in Table 1, which shows a low overly-conservative rate (1.9%) and a low under-conservative rate (4.1%).We observe that SECURE is efective in learning a highquality approximation of the ground-truth CBF with limited data.Additionally, SECURE strikes a good balance between being overconservative and under-conservative.

SIMULATION EXPERIMENTS
We evaluate SECURE in the following simulated domains: Demolition Derby Domain: a car is tasked to reach a target location while avoiding 16 other randomly moving cars (Figure 6).We utilize the approach from Qin et al. [35] to collect safe demonstrations by fltering out trajectories with collisions.We generate near-dangerous states by collecting states where the distance between the car and an obstacle is below a predefned threshold.
Panda Arm Push Domain: the objective is to push a block with a high center of gravity to a target location without toppling it [22] (Figure 7).We collect demonstrations by teleoperation via a keyboard.We collect three near-dangerous scenarios that knock down the block: a) pushing the upper part of the block (count: 442), b) pushing with high velocity (count: 590), and c) pushing the upper part of the block with high velocity (count: 444).
The number of safe and near-dangerous states for training the CBF, the number of demonstrations to train the policy, and the architecture of the neural network CBFs is tabulated in Table 2. Please refer to the supplementary for auxiliary details for the experiments.

Results
We develop two metrics to evaluate task completion and safety: "Success Rate, " which quantifes the rate of successful task completion, and "Dangerous Rate, " which is the rate of hazardous scenarios encountered.We evaluate both metrics across 100 trajectories with ten random seeds for both domains.Since SECURE is the frst method to address safety issues for IRL, there is no existing benchmark tailored for the same task.Therefore, we select two baselines: 1) behavior cloning (BC), as BC remains a prevalent approach; 2) the state-of-the-art IRL approach, AIRL, as it has strong capability to imitate demonstrated behaviors.
The results are summarized in Table 3, showcasing the exceptional performance of SECURE.With BC displaying the lowest performance, our results analysis focuses on comparing SECURE and AIRL.In the demolition derby domain, AIRL and SECURE have similar success rates (two one-sided t-test with bound=10, < .01)but SECURE achieves signifcantly less dangerous cases (71.2% less, Mann-Whitney = 0, < .001).In the Panda Arm Push domain, SECURE not only eliminates all instances of the block toppling over (comparing with AIRL, Mann-Whitney = 0, < .001)but also achieves a 43.7% improvement in the successful rate, signifcantly outperforming AIRL (Mann-Whitney = 99.5, < .001).

Ablation Study of Resampling Method
To evaluate each component's contribution in SECURE, we conduct ablation studies in simulated domains.In the frst ablation study, to examine the importance of averaging the safe actions within the shield, we randomly select a safe action from the batch instead of averaging all safe actions.For the second ablation study, we removed the adaptive resampling approach.Instead, we keep resampling with the policy output until a predetermined resampling limit is reached, upon which a random action is selected.The second ablation allows us to assess the efect of not adapting for resampling.
The results of the ablation study are presented in Figure 9, showing the signifcant impact of CBF Shield and the adaptive resampling.In the demolition derby domain, SECURE achieves a signifcant improvement (18.0% and 68.2%) in safety with respect to the two ablations (Kruskal-Wallis (2) = 16.25, < .001;pairwise posthoc comparisons using Dunn's test indicates SECURE signifcantly outperforms both ablations with < .01 and < .001,respectively), while maintaining similar or higher task performance.In the Panda Arm Push domain, SECURE eliminates all unsafe executions (Kruskal-Wallis (2) = 17.33, < .001,DUNN posthoc shows SECURE signifcantly outperforms both ablations with < .01 and < .001,respectively) as well as achieves a signifcant task performance gain of 28.2% and 43.8% with respect to the two ablations (Kruskal-Wallis (2) = 14.56, < .001,Dunn posthoc shows SECURE signifcantly outperforms both ablations with < .01 and < .001,respectively).These fndings validate our design.

Sensitivity Analysis
Due to the data-driven nature of SECURE, performance can be impacted by the data size and quality.As such, we conduct sensitivity analysis for SECURE from three perspectives: 1) dataset size; 2) label imbalance; and 3) noisy labels, and show SECURE is robust to non-ideal data.
Dataset Size: In the dataset size sensitivity test, we reduce the overall dataset size for CBF learning while preserving the ratio of safe and unsafe states.We observe SECURE is robust to dataset size in easier tasks, such as Demolition Derby, even with only 1% of the original dataset.The performance drops for harder tasks (e.g., Panda Arm Push) when the dataset size is reduced to 10%.Label Imbalance: In the label imbalance test, we reduce the number of unsafe states in observance of the relative difculty in collecting near-dangerous demonstrations.The results demonstrate that SECURE is empirically robust to a data imbalance ratio of 1:2 in Demolition Derby and a ratio of 1:4 in Panda Arm Push.Beyond these ratios, the learned CBF becomes under-conservative due to the overwhelming number of safe states within the dataset.Noisy Data: In the noisy data test, we consider the possible noisy data collection process with naïve user by fipping safe/unsafe labels within the dataset to examine SECURE's robustness.The results show SECURE is robust to noisy data in both domains, exhibiting strong performance even when up to 50% of the labels are wrong.

REAL-ROBOT EXPERIMENTS
We conduct two real-robot experiments to demonstrate SECURE's applicability to roboticists and users, respectively.In the frst case study, we (roboticists) provide demonstrations for a knife-cutting task and evaluate the success of SECURE in avoiding cutting our arms.In the second user study, we ask users to demonstrate in a cofee placing task and show SECURE's success on users' ratings on task completion, safety, and perceived safety.The number of safe and near-dangerous states for training the CBF for each domain, along with the number of demonstrations used to train the policy, and the size of the neural network CBF are tabulated in Table 2.

Demonstration with Roboticists
In this demonstration, we compare SECURE with benchmarks in a tofu-cutting task in close proximity to a human.We (roboticists) provide a set of safe demonstrations via kinesthetic teaching.Because of the possible danger the knife may pose, we collect 450 near dangerous states of close proximity of the robot and human arms from experimenters, ensuring they adhere to all necessary safety   precautions.Following previous CBF literature [35], we assume the robot's forward kinematics model is available.Similar to the simulated domain experiments, we evaluate SE-CURE against BC and AIRL with ten episodes and calculate the success rate and dangerous rate metrics.In this cutting task where avoiding collision is of utmost importance, SECURE achieves zero collision cases and 9 successful episodes, surpassing the baseline methods, BC and AIRL (Table 3 and Figure 10).The results demonstrate the safer execution of SECURE, efectively eliminating collisions without compromising task completion.Recordings of SE-CURE's execution can be found in the supplementary video.

User Study
We conducted a user study to understand non-roboticist users' abilities to provide helpful demonstrations for SECURE.In this study, we create a context where the user needs to prepare for a lecture by reaching for one out of four books and turning to certain pages, while the robot serves cofee for the user (Figure 11).In the frst session of the experiment, human participants frst demonstrate how to serve the cofee (i.e., the task) via kinesthetic teaching.The user then provides demonstrations for safe/unsafe human arm positions with respect to the robot and safe/unsafe cup tilt angles.Specifcally, to collect safe and unsafe demonstrations, we replay the user's kinesthetic teaching trajectory on the robot, pause at four states, and invite the participant to provide safe/unsafe demonstrations for arm positions by moving their arm around the robot and for cup tilts by changing the robot end efector tilt angles which is holding the cup.We collect fve kinesthetic teaching trajectories and the entire session lasts less than one hour for each participant.As such, we obtain task demonstrations and the user's defned safe/unsafe demonstrations in the frst session of the experiment.
Once we fnish the demonstration collection in the frst session with all participants, we prepare four diferent setups of data to train SECURE's policy and CBF.In order to see how diferent components within SECURE respond to amount of data available and whether data is personalized for each user, we consider a 2 by 2 within-subject design with the two factors being policy training data (grouped vs. individual) and CBF training data (grouped vs. individual).The grouped condition represents pooling all participants' data for training, while the individual condition means only using one participant's own data for training.As such, we obtain two behavior-cloning trained policies and two CBFs.
In the second session of the experiment, the participant is tasked to accomplish the task to reach for a book while the robot places the cofee.We test twelve episodes with each participant, with three episodes corresponding to each of the four conditions.After each episode, the participant evaluates the robot's task completion, safety, and perceived safety via a 10-item Likert Scale.We depict the experiment procedure in the supplementary video for a better visual understanding of the setup.
The user study was approved by the Institutional Review Board and we recruited twelve participants (ten male, two female, three within age range 18-25 and seven within age range [26][27][28][29][30][31][32][33][34][35].We summarize the results in Table 4.In all four conditions, we demonstrate SECURE successfully accomplishes the task (i.e., cofee placing) while being safe with the human subjects who reach for books and have close interaction with the robot, evidenced by the high ratings in task, safety, and perceived safety.Comparing the four conditions, the grouped policy and individual CBF yields the highest ratings on all three metrics.We hypothesize the result may suggest the utility to learn policy from larger number of task demonstrations as well as the value of personalized training for CBF.Users commented on executions with individual CBF as "P10: exactly how I defned my comfort zone" and "P12: it is not unsafe nor overly safe" compared with their comments regarding grouped CBF as "P7: it felt like the robot was aiming the cofee cup to my face" and "P2: the robot is overly safe -as long as my arm is visible, it tries to avoid me even if there is large distance".However, due to the limited number of subjects in our study, we could not reach a conclusion regarding the performance of grouped vs. individual SECURE without obtaining statistical signifcance, but we believe our study still demonstrates that that SECURE is successful in the hands of users.

DISCUSSION AND LIMITATIONS
The success of SECURE shown in previous sections is grounded in the novel integration of neural CBFs, IRL, and adaptive sampling.SECURE enables the robot to acquire an efective barrier function, which plays a crucial role in shielding the system from dangerous states.By incorporating CBF Shield, SECURE ensures that the system remains within a safe state and avoids potential hazards, and that the action executed is in line with the task objective.Furthermore, our adaptive sampling increases the efciency in fnding safe actions.Overall, the proposed SECURE method stands out among all the ablations and design choices and presents a promising paradigm for empowering end-users to teach robots new behaviors while maintaining their defnition of safety.
SECURE operates under a foundational set of assumptions.SE-CURE assumes all states within the task demonstrations are safe, which could be invalid if the user provides demonstrations containing undesirable behaviors.Additionally, SECURE assumes that users can provide a collection of undesired states.Nonetheless, we acknowledge that this presumption might not be feasible in certain domains (e.g., autonomous driving, where demonstrating undesirable states could jeopardize human safety).Therefore, the proposed algorithm, SECURE, ofers empirical safety assurances rather than absolute safety guarantees.Additionally, SECURE relies on access to the transition dynamics of the domain to assess the safety of proposed actions.We recognize that establishing these transition dynamics in complex domains can present considerable challenges.
In future work, we aim to explore methods to enable active inquiries about uncertain regions, opening up possibilities for proactive learning and further enhancing safety.Another future direction is to investigate user's perception towards grouped vs. individualized policies and safety modules in a larger-scale user study.

CONCLUSION
We introduce a novel Safe LfD framework, SECURE, which combines Control Barrier Functions (CBF) with Inverse Reinforcement Learning (IRL) methods to learn a safe policy from demonstrations.By integrating a learned CBF function from human demonstrations, SECURE establishes a CBF Shield that ensures the IRL policy avoids unsafe regions.Through empirical evaluations in two simulated domains and two real robot tasks, we demonstrate the efectiveness of SECURE.SECURE achieves comparable or superior task performance compared to traditional IRL methods while signifcantly reducing the number of unsafe cases.

Figure 2 :
Figure 2: This fgure illustrates SECURE's architecture.End-users contribute demonstrations and near-dangerous states to train the policy, (•), and CBF, ℎ (•).CBF Shield prevents the IRL policy from entering dangerous states while minimizing interference with task completion.Adaptive sampling introduced in CBF Shield generates safe and task-aware actions efciently.

Yue
Yang et al.

Figure 6 :
Figure 6: This fgure shows the Demolition Derby domain.

Figure 7 :
Figure 7: This fgure illustrates the Panda Arm Push domain.

Figure 8 :
Figure 8: This fgure shows the setup for the real-robot banana-cutting task.
(a) Behavior Cloning: Robot ignores human arm, leading to arm-knife contact.(b) AIRL: Robot ignores human arm, leading to arm-knife contact.(c) SECURE (ours): Robot yields for human arm, then safely continues.

Figure 10 :
Figure 10: Timelapse of execution of SECURE and baselines on kitchen cutting task.Unlike baselines, SECURE is able to succesfully fnish the task without cutting the nearby human.

Figure 11 :
Figure 11: Setup for user study.Robot is tasked to place cofee to pink square, and human is tasked to get a book and turn to certain chapters.

Table 1 :
The table shows the means and standard deviations of the learned CBF's performance with fve diferent random seeds for training on the 2D double integrator domain.

Table 2 :
Number of safe and near-dangerous states for CBF training, number of task demonstration states for policy learning, and neural network CBF's architecture in simulated and real-robot domains.CNN refers to Convolutional Neural Networks and FC refers to Fully-Connected networks with hidden layer node numbers specifed in the parentheses.

Table 3 :
This table shows the comparison of SECURE (ours) with BC and AIRL in three domains.The standard deviation is calculated with ten runs of diferent random seeds for each algorithm.Bold denotes best performing algorithm.

Table 4 :
This table shows the task (out of 105), safety (out of 42), and perceived safety (out of 42) ratings in the user study for four conditions.The ratings are reported as mean (standard error).Bold denotes the highest score condition.