Pandora’s Problem with Nonobligatory Inspection: Optimal Structure and a PTAS

Weitzman (1979) introduced Pandora’s box problem as a mathematical model of sequential search with inspection costs, in which a searcher is allowed to select a prize from one of n alternatives. Several decades later, Doval (2018) introduced a close version of the problem, where the searcher does not need to incur the inspection cost of an alternative, and can select it uninspected. Unlike the original problem, the optimal solution to the nonobligatory inspection variant is proved to need adaptivity by Doval (2018), and by recent work of Fu Li and Liu (2022), finding the optimal solution is NP-hard. Our first main result is a structural characterization of the optimal policy: We show there exists an optimal policy that follows only two different pre-determined orders of inspection, and transitions from one to the other at most once. Our second main result is a polynomial time approximation scheme (PTAS). Our proof involves a novel reduction to a framework developed by Fu Li and Xu (2018), utilizing our optimal two-phase structure. Furthermore, we show Pandora’s problem with nonobligatory inspection belongs to class NP, which by using the hardness result of Fu Li and Liu (2022), settles the computational complexity class of the problem. Finally, we provide a tight 0.8 approximation and a novel proof for committing policies (informally, the set of nonadaptive policies) for general classes of distributions, which was previously shown only for discrete and finite distributions in Guha, Munagala and Sarka (2008).


Introduction
Pandora's box problem, defined by Weitzman [Wei79], is a model of sequential search, in which a searcher is presented a list of options to choose from and obtaining information about the value of each option is costly.More formally, in a Pandora's box problem, a searcher is allowed to select a prize from one of initially closed boxes.The values of the prizes inside the boxes are independent random variables, sampled from (not necessarily identical) distributions that are known to the searcher.The searcher chooses a sequence of operations, each of which is either opening a box or selecting a box.Opening box has an associated cost and results in learning the value of the prize contained inside.Selecting box results in a payoff of and immediately ends the search process.The searcher's goal is to design an adaptive policy (i.e., a choice of which operation to perform next, for every possible past history of operations and their outcomes) to maximize its expected utility, defined as the expectation of the prize selected, minus the sum of the inspection costs paid while opening boxes.Weitzman shows that in a model of the problem where acquiring a box is only allowed after opening it, referred to as the obligatory inspection model, the optimal solution is nonadaptive and has a simple index-based structure.
However, in many real-world environments such as hiring or school search, the agent can acquire a box (select an option) "blind", i.e. without opening it and paying the inspection cost.Such scenarios motivate the nonobligatory inspection model, introduced by Doval [Dov18] , where the searcher is allowed to acquire a box without opening it first.Prior literature presented evidence of complexity of the optimal solution for Pandora's box problem with nonobligatory inspection.In particular, Doval presents an example of a problem instance (Problem 3 in [Dov18]) with three boxes -A, B, and C -such that the optimal policy first opens box A, but the question of whether it subsequently opens box B before C or vice-versa depends on the value of the prize discovered inside box A, making the order of inspection adaptive.Furthermore, recently [FLL22] showed that finding the optimal solution is NP-hard.It is even unknown whether the problem belong to class NP.
We study Pandora's box with nonobligatory inspection model and its optimal structure, and provide structural, complexity class, and approximation scheme results.In what follows, we overview our main results and techniques.

Structure of the Optimal Policy
We show that despite the seemingly complicated nature of optimal policy, e.g., adaptive order of visiting boxes, and computational hardness, it has a simple structure.In fact, we show that there exists an optimal policy that follows only two different pre-determined orders and transitions from one to the other at most at one point.
A two-phase structure.We prove that the optimal policy sets an initial ordering , and a cutoff index .It opens boxes one at a time according to this ordering until it either: (a) sees a sufficiently large value, in which case it concludes by using Weitzman's policy with obligatory inspection on the unopened boxes, or (b) reaches box without seeing a sufficiently large value, in which case it accepts box without inspection.Observe, for example, that this implies that there is just a single box that will ever be accepted without inspection.In other words, the optimal solution consists of two phases, where in each phase, the order of visiting boxes is pre-determined and nonadaptive.Whenever the maximum observed value, hereafter called the outside option and denoted by , exceeds the threshold, the policy switches to the second phase.
This result is summarized in the following statement, and also illustrated as Algorithm 1 in Section 3. The theorem is proved in Section 3.
Theorem 1.1.There exists an optimal policy specified by an ordering : [ ] → [ ] of the boxes, a threshold : [ ] → R for each index, and index , where 0 ≤ ≤ , such that while it has not terminated runs the following procedure for = 1, . . ., , sequentially.
• If < and if the maximum observed value is less than the next threshold, = max 1≤ < ( ) ≤ ( ), then the policy will open box ( ).
• If = and if the maximum observed value is less than the next threshold, = max 1≤ < ( ) ≤ ( ), then the policy will claim box ( ) closed and terminate.
A few papers [GMS08, CL09, AKLS17, Dov18] have studied the same model in different contexts-see the related work section.[Dov18] introduced the model in the context of search theory as a variant of Weitzman's model.
Although the property that there is a unique box to be claimed closed has been shown previously by [GMS08] for discrete and finite distributions, the two-phase structure is a novel contribution.

Optimal Structure
We first consider a standard generalization of Pandora's box problem, where an outside option is given for free, and the searcher can select it at any point (as an alternative to selecting one of the boxes).This generalization provides a unified format for the original problem and the subproblems.Then, we study the behavior of optimal searcher and the optimal expected utility, for any set of uninspected boxes, as a function of the outside option.Our key lemma (Lemma 3.3) proves that for any set of uninspected boxes, there is a threshold, such that for outside options above the threshold, the optimal policy never claims a closed box, and for outside options below the threshold, the optimal expected utility is constant.The constant optimal expected utility property implies that the optimal policy with any outside option below the threshold can just mimic the action of an optimal policy with outside option 0. On the other hand, since having an outside option above the threshold coincides with not ever claiming a closed box, in this situation, the optimal policy can mimic the action of Weitzman's policy.Furthermore, we extract additional properties of the outside options, which imply that as the searcher inspects boxes and the outside option (maximum observed value) is updated, there is at most one point where the outside option switches from being below the threshold of uninspected boxes to above.Altogether, these structural properties conclude our main structural result, Theorem 1.1.

PTAS
As a consequence of Theorem 1.1 (and also by [GMS08] for discrete and finite distributions), there is an optimal policy that has at most one fixed box that it may claim closed.Therefore, based on which box the fixed one is (if any) we can limit the search to one of + 1 possible optimal policies.In other words, we consider all the + 1 possibilities, find a PTAS for each, and output the one with the highest expected utility.
Our proof involves a novel reduction to a framework by [FLX18].We first overview the framework, how it is used for stochastic probing problems, and the challenges in tailoring it to our problem.We conclude by a summary of how we overcame the challenges and performed the reduction.
[FLX18] establishes a general framework for online stochastic problems and devises a PTAS for this general formulation.The stochastic dynamic program formulation in [FLX18] models a general online probing setting, where there is a set of elements, and the agent's goal is to adaptively probe the elements to maximize the expected reward.Whenever the agent probes an element, they get an immediate reward, and their internal state is updated.At the end of the process, the agent also gets a final reward dependent on their internal state.This framework has been successfully applied to many stochastic probing problems, the most relevant to our problem being Probemax (choose elements to probe adaptively and get the maximum value among elements probed) and committed Pandora's box problem (similar to Pandora's problem with obligatory inspection, but elements are forfeited forever if not selected).These two problems share two critical aspects of Pandora's problem with nonobligatory inspection, respectively: 1) the agent gets the maximum value among all elements probed and 2) there is a cost of inspection.Although, this poses a reduction from our problem to [FLX18] framework as a plausible approach, we face additional technical barriers not present in prior reductions for Probemax and committed Pandora's box.While resolving these technical barriers, we uncover additional structure for our problem that may be relevant beyond our specific PTAS reduction.
For the original problem, this outside option is initially set to 0. Note that although this construction seems similar to committing policies [BK19], in contrast, here the policies can be orderadaptive (similar to the two-phase optimal policy), and the fixed box may be opened or claimed closed.
For a formal discussion of [FLX18] framework, see Section 4.
Challenge 1: negative terms reflecting costs.We define the internal state to represent the best value (or an approximation of the value) that the agent has seen in the past.However, almost all previous problems that reduce to [FLX18] that use internal states to represent element value do not have cost of inspection.The framework requires the internal states to be supported on a set of constant size.This will necessitate a discretization of the values.The canonical way to discretize the values is to round them (up or down) to an approximate value.However, since the reward at each step is the difference between internal state and the cost incurred, − (where denotes the internal state at step , and is the cost), rounding values to a nearby approximate value may completely distort the difference, restraining us from a small multiplicative approximation loss.
Prior techniques for eliminating costs.[KWW16] introduce a reduction from Pandora's box with obligatory inspection to a maximization problem without costs.They also introduce a property of policies called non-exposure and show that the optimal policy of the obligatory inspection variant satisfies it.Informally speaking, a policy is non-exposed if it selects any inspected box whose value is above the threshold of the box.In any non-exposed policy, whenever a box is selected the gain is equal to a virtual value defined as a function of the revealed value and properties of the box.The insight from [KWW16] for removing cost from the expected utility function has been successfully utilized in [SS21] to prove equivalence of Pandora's box with commitment and free order prophets.Also, in Pandora's box with nonobligatory inspection problem, previously [BK19] used ideas from [KWW16] to provide utility upper bound and additional structure for the problem.

Failure of previous techniques, and a new reduction.
Unfortunately, the optimal policy for Pandora's problem with nonobligatory inspection may not always be non-exposed (See Example A.1 in Appendix A).However, given our knowledge about the two-phase structure of the optimal policy, we draw parallels between our two-phase policy and the non-exposed policies, and introduce stage-non-exposed policies.Basically, we argue although the optimal policy might not select a box when its value is above the threshold, the optimal policy will always enter phase two and gains its respective utility.It is easy to calculate the expected utility during and after the phase transition.
Challenge 2: discretizing values.Recall that by Theorem 1.1, our two-phase policy is determined by an order over the boxes and their thresholds.To define the internal states of [FLX18] framework, after our cost-elimination reduction, we need to discretize the values observed and the potential thresholds onto a (poly(1/ )) sized-support.We show that the optimal thresholds are fairly robust to minor changes and can be rounded down to a multiples of • OPT between 0 and OPT/ , where OPT is the expected optimal utility.However, discretizing the values proved more challenging.The standard way to discretize an element value is to truncate the value space at E[max ]/ (the truncation at E[max ]/ is esssential to ensure that the probability of the value of any element being above the truncated upper limit is at most ), and then discretize the values into increments of • E[max ]/ .However, since there is a potentially super constant gap between optimal utility OPT and the expected maximum value E[max ], the standard discretization methods do not work.I.e., discretizing the values into multiples of • E[max ] is too coarse to generate meaningful approximation guarantees, and discretizing into multiples of • OPT will yield good approximation for the agent utility, but the resulting support will have a super constant size.We resolve this issue by taking advantage of contribution of in the utility formula and internal states of [FLX18] framework.We conclude that although we cannot truncate the distibution to a constant multiple of OPT, for any fixed order, selecting only a constant support on this large range, and discretizing onto it has a limited loss.
Challenge 3: Dependence of the discretized support on inspection order.At this point, given a fixed order of boxes, we resolved how to discretize the values onto a subset of constant support (although within a large range), to preserve the agent's utility reasonably.The next challenge is that we do not know the optimal order, to be able to select the descritization support!To resolve this issue, we show there is a bounded number of discretization methods.First, we show we can bound the multiplicative gap between the optimal expected utility OPT and the expected maximum E[max ] by .Then, as we have mentioned before, the support can always be truncated at E[max ]/ .Thus, the number of distinct supports of constant size is bounded by (poly (1/ )) ; i.e., there are this many discretization methods.Therefore, as input to [FLX18] framework, we try all of these possibilities of discretization, run all the PTAS outputs (one for each discretization method), and use the discretization that resulted in the highest agent utility from the PTAS policy.

Related Work
Prior work.Pandora's problem (with obligatory inspection) was first proposed and analyzed in [Wei79], which shows that an elegant nonadaptive policy (which opens boxes in a pre-defined order with pre-defined thresholds, and selects the first box with value above its threshold) is optimal.[KWW16] provide a new interpretation of the problem and study various applications.Since the introduction of Pandora's problem, multiple papers in different communities [GMS08, CL09, AKLS17, Dov18] independently introduced and studied a stochastic probing problem that is in essence equivalent to Pandora's problem, but with nonobligatory inspection.This variant is then further studied in [BK19,FLL22].We will overview the prior works that are most related to our work.
[Dov18] explicitly formulates the nonobligatory inspection problem as a generalization to the original Pandora's problem and shows that the optimal policy may have a complicated structure.In particular, unlike the original Pandora's problem, there exists distributions for which no nonadaptive policy is optimal.This inspired the theory community to work on approximation algorithms and hardness results, as well as developing other variants of Pandora's problem.In addition, she provides sufficient conditions on the parameters of the problem under which she characterizes the optimal policy.
[GMS08] focus on discrete and finite distributions, and provide a structural result showing that in the optimal policy, at most one box will ever be claimed closed.They also provide a 0.8 approximately optimal solution.Due to the discrete nature of the environment, they focus on optimal decision trees, where each node in the tree represents the remaining unispected boxes and the maximum observed value (outside option).For their structural result, they start with an arbitrary optimal policy, and replace subtrees with higher outside options by subtrees with lower outside options while maintaining optimality.In our structural result, we use a similar idea.In particular, after we characterize the optimal utility as a function of outside option, our optimal policy mimics the action of an optimal policy with outside option 0 in the constant part of the utility function.However, in contrast to [GMS08], our techniques work for general distributions, and we give an explicit characterization of the optimal policy.
[KWW16] provided an alternative proof for Pandora's problem by reducing it to a maximization problem without cost.This also helps them compute the expected utility from Weitzman's policy, which we make extensive use of.A more detailed discussion can be found in Section 2 and Section 4.
The reason is that for each , ≥ E[ ], because the optimal policy can claim any box closed, and Concurrent Work.Concurrent and independent of our present work, Fu, Li, and Liu also obtain a PTAS for Pandora's problem with nonobligatory inspection.
To the best of our knowledge, their concurrent work contains a structural result, and their proof for the PTAS contains some similar ideas (e.g.their work also uses the [FLX18] framework, and they use similar techniques with regard to discretizing the random variables).In addition, Fu, Li, and Liu prove that finding the optimal policy for the Pandora's problem with nonobligatory inspection is NP-hard.An initial manuscript of their paper [FLL22] includes the hardness result as well as an improved approximation ratio for committing policies over [BK19].
Additional Related Work.Finally, there is a growing body of work that extends Pandora's box problem to various other settings, such as Pandora's box with additional order constraints [BFLL20], with correlated values distribution [CGT + 20], where the agent needs to commit to taking the box or forfeiting it forever at each step [FLX18,SS21], where each box could be partially opened at a reduced cost [AJS20], where each box could be inspected using different methods each at a different cost (a generalization of the nonobligatory inspection model) [Bey19], where the cost of inspection model is generalized to various combinatorial optimization problems [Sin18], etc.This recent trend illustrates a general community interest in exploring online decision problems that models cost of inspection.

Organization
The rest of the paper is organized as follows.In Section 2, we introduce the model and provide preliminaries.In Section 3, we characterize the structure of the optimal policy and prove Theorem 1.1.In Section 4, we provide a PTAS for Pandora's problem with non-obligatory inspection.In Appendix A, Appendix B and Appendix C, we provide missing proofs from Section 1, Section 3 and Section 4, respectively.

Model and Preliminaries
An agent has a set of boxes.This set is denoted by M. Box , 1 ≤ ≤ , contains a prize, , distributed according to distribution ( ) with expected value E .The support of the distribution of box is Θ , and Θ = ∪Θ is the union of all supports.Prizes inside boxes are independently distributed.Box has inspection cost .While and are known; is not.
The agent sequentially inspects boxes, and search is with recall.Given a set of uninspected boxes, U, and a vector of realized sampled prizes, , the agent decides whether to stop or to continue search; if she decides to continue search she decides which box in U to inspect next.If she decides to inspect box , she pays cost to instantaneously learn her value .If she decides to stop search, she can choose to select whichever box she pleases, regardless of whether it is inspected or not.We use I as an indicator for box being inspected and A as an indicator for the agent obtaining box .Since one box can be obtained, A ≤ 1.The agent is an expected utility maximizer, where utility, , is defined as the value of the box selected minus the sum of inspection costs paid.Given , the vector of realized sampled prizes, and the two vectors of indicator variables, A and I, respectively indicating which boxes were selected and inspected, we have: An important variant of the problem, in which inspection is required was introduced and optimally solved by Weitzman [Wei79].He showed that when A ≤ I , an index-based policy is the optimal solution.In this policy, the agent inspects boxes in decreasing order of their indices, , where is the unique solution to We learned this through personal correspondence with the authors.and is also known as the reservation value of box .The search stops either when one of the realized values is above the reservation value of every remaining uninspected box, or when the agent has inspected all of the boxes.Kleinberg et al. [KWW16] develop a new interpretation of Weitzman's characterization.They introduce a family of random variables := min{ , } defined for each box .These random variables are used to reduce Pandora's problem with obligatory inspection to a problem without costs, and provide an upper bound on its optimal expected utility.They also introduced an important property of polices for the original Pandora's box problem called non-exposed, which they show that the Weitzman's policy satisfies and hence prove the upper bound is tight.We provide the definition and related statements below.Definition 2.1.[KWW16] A policy is non-exposed if it is guaranteed to select any inspected box which have value > .Namely, (I − A ) • ( − ) + is always exactly equal to 0. Lemma 2.2.[KWW16] For any policy that satisfies In order to represent the internal states of Pandora's box problem, we consider a generalization, in which we are given a set of uninspected boxes U and the setting is exactly the same as the original problem, except that we are also given an outside option for free.We denote this problem, i.e., Pandora's box problem with nonobligatory inspection for unispecteded boxes U and outside option , by P(U, ).Using the same notation, our original problem is P(M, 0).Similarly, we denote the state of the problem with the set of uninspected boxes U and the maximum observed value as (U, ).Due to this formulation we use outside option and maximum observed interchangeably and denote them by .
Without loss of optimality, we only consider policies whose actions only depend on the set of unispected boxes and the maximum observed value (outside option).Also, when studying optimal policies, we consider those that are pointwise optimal, i.e., optimal for any state (U, ) they reach, even those with probability 0. We denote the optimal expected utility of problem P(U, ) by OPT(U, ).Furthermore, without loss of optimality, we focus on deterministic policies.
For policy and current state (U, ) we define the following functions: Definition 2.4 (state transition).For any policy , we will use (U, ) to denote all valid state transitions from state (U, ) when using policy .Formally, Definition 2.5 (plausible sequence of states).We will call a sequence of states Definition 2.6 (Reachable State).For any policy , we will use (U, ) to denote all states that are reachable by policy from state (U, ).Formally, a state (U ′ , ′ ) ∈ (U, ) if and only if there exists a plausible sequences of states for that start at (U, ) and ends at (U ′ , ′ ).For the sake of simplicity, we will use ( ) to denote all states that are reachable by policy from state (M, 0).For instance, if policy opens box first, then for any ′ ∈ M, ′ ≠ , (M \ { ′ }, 0) is not reachable by policy from (M, 0) since must inspect as its first action.
Definition 2.7 (use a backup box).We will say that a policy uses a backup box for problem P(U, ) if either claims a box closed up front (namely (U, ) = Close), or there exists a state (U ′ , ′ ) ∈ (U, ) such that uses a backup box for problem P(U ′ , ′ ).

Structure of the Optimal Policy
The main contribution of this section is proving the two-phase structure of the optimal policy stated in Theorem 1.1.First, we study the optimal expected utility as a function of the outside options.As an immediate observation, the optimal utility is an increasing function of the outside option; however, as we show, there is more structure to it.Specifically, in state (U, ), for any set of uninspected boxes U, there exists a threshold (U) such that the optimal utility for P(U, ) is the same for any outside option that does not exceed the threshold, and is strictly higher for those exceeding the threshold.Furthermore, there is always a policy that uses a backup box when the outside option is below threshold, while no optimal policy uses a backup box when the outside option exceeds the threshold.Then, we show in any optimal policy of P(M, 0), there is at most one transition point when before this point the outside option (current maximum observed value) is always below the threshold of the current uninspected boxes, and after the point, it is always above.Finally, using this structure, we show there exists an optimal policy that while the outside option is below the threshold, takes the next action as if the outside option were 0, and after the transition point, follows Weitzman's policy, proving the structure of Theorem 1.1.
Full proofs of the section are in Appendix B.
Definition 3.2.[ (U), threshold for uninspected boxes] With abuse of notation, let (U) ≥ 0 be the value that satisfies the following properties if there exists an optimal policy of P(U, 0) that uses a backup box with positive probability.
1.There exists an optimal policy of P(U, ) that uses a backup box with positive probability if 0 ≤ ≤ (U), and there does not exist any optimal policy of P(U, ) that uses a backup box if > (U).

(U)
If no optimal policy of P(U, 0) uses a backup box with positive probability, let (U) = NEG.For ease of notation we assume 0 > NEG.
Lemma 3.3 asserts that for any set of boxes such a threshold exists.
Lemma 3.3.For each set of boxes U, (U), as defined in Definition 3.2, exists.
Proof.If there is no optimal policy that uses a backup box with positive probability for P(U, 0), (U) = NEG and exists by definition.Therefore, for the remainder of the proof, we only focus on the case that there is an optimal policy for P(U, 0) that uses a backup box.The proof consists of two main steps.In the first step, we show that for any set of boxes U, there exists a threshold (U), such that for outside option > (U), no optimal policy for P(U, ) uses a backup box with positive probability, and when ≤ (U), OPT(U, ) = OPT(U, 0).In the second step, we show that (U) from the first step is equal to arg max {OPT(U, ) = OPT(U, 0)}, and there is an optimal policy using backup boxes with positive probability for outside option below the threshold.
The proof of the first step is by induction over the size of U, the number of boxes in the problem.Let (U) be the largest value such that an optimal policy with outside option (U) uses a backup box.If there is a single box, this means that the optimal utility of P(U, (U)) is equal to the expected value of the box, which is equal to the no outside option scenario P(U, 0).Using Observation 3.1, this concludes the base case of the induction.For |U | > 1, there are two possibilities.If an optimal policy of (U, (U)) claims a closed box in the first step, the argument is similar to |U | = 1.Otherwise, if the optimal policy starts with opening box and observing value , designing the optimal policy for the remainder of the boxes is equivalent to designing the optimal policy for the boxes other than with an outside option that is the maximum of (U) and (Equality 1).Since the optimal policy for P(U, (U)) uses a backup box, there exists some value ′ for which the subproblem (the problem for U \ { }) uses a backup box, implying (U \ { }) ≥ (U).We split the utility into the two parts where the outside option is equal to (U \ { }), i.e., ≤ (U \ { }), and where it is equal to , i.e., > (U \ { }) (Equality 2).By induction hypothesis, the part where ≤ (U \ { }), has optimal utility equal to the subproblem with outside option 0 (Equality 3).The sum of the two parts equals to utility of the problem given the set of boxes U, and no outside option, where the first action is opening box , and the rest of the action follows an optimal policy for P(U, ) (Equality 4).This constructs a policy for P(U, 0) and has optimal utility at most OPT(U, 0) (Inequality 5).By Observation 3.1, the inequality is in fact an equality.This concludes the first step of the proof. (1) ) ≤ OPT(U, 0). (5) Now, we move on to the second step of the proof.So far, we showed that there exists (U) such that no optimal policy with strictly larger outside option uses a backup box; and all optimal policies with outside option below the threshold have the same utility.We first show for any nonnegative outside option ′ below the threshold, there exists an optimal policy that uses a backup box.This is straight-forward because OPT(U, ′ ) = OPT(U, 0) implies that following any optimal policy of OPT(U, 0) is optimal for P(U, ′ ).Since we assumed there exists an optimal policy of P(U, 0) that uses a backup box, there exists one that uses a backup box for any P(U, ′ ) where 0 ≤ ≤ (U).Since all the problems with outside options ′ ≥ 0 satisfying OPT(U, ′ ) = OPT(U, 0) have an optimal policy that uses a backup box, for outside options ′′ ≥ 0 that no optimal policy uses a backup box, OPT(U, ′′ ) > OPT(U, 0).This concludes the proof.
The following lemma shows that in any optimal policy, the thresholds from Definition 3.2 for any set of uninspected boxes is such that once the maximum observed value exceeds the threshold at a stage, it always exceeds the thresholds at later stages.Lemma 3.4.Let OAL be an arbitrary optimal policy for problem P(U, ) and let (U, ), (U 1 , 1 ), • • • , (U , ) be any plausible sequence of states for OAL.
Proof sketch.The proof is by contradiction.We show if < (U ), then P(U , ), and therefore, P(U, ) have optimal policies that use backup boxes, which implies ≤ (U).
The following lemma states that there exists an optimal policy that whenever the outside option (maximum observed value) exceeds the threshold of the unispected boxes, runs Weitzman's policy, and whenever the maximum observed value is less than the threshold of the unispected boxes takes the same action.Lemma 3.5.There exists an optimal policy OAL for problem P(M, 0) that satisfies the following: for any reachable state (U, ) ∈ (OAL), • When ≤ (U), OAL (U, ) = OAL (U, 0); , where (U, ) represents the action Weitzman's policy would take given that U is the set of uninspected boxes and is the maximum value obtained so far by the algorithm.
Proof sketch.If no optimal policy uses a backup box for P(M, 0) with positive probability, then Weitzman's policy is an optimal policy satisfying the statement.Note that for any reachable state (U, ) of Weitzman's policy, > (U), otherwise there is an optimal policy that claims a closed box, which is in contradiction with the initial assumption.Now, suppose there exists an optimal policy that uses a backup box for P(M, 0).Let ( 1 , 1 ), • • • , ( , ) be the the first action of pointwise optimal deterministic policies for problems P(U 0 , 0), P(U 1 , 0), . . ., P(U , 0), respectively, where U 0 = M, U = M \ { 1 , . . ., }, and is the first time in the sequence, where the action taken, i.e., , is terminal.Note that since for each problem in the sequence the outside option is 0, claiming a closed box has at least as much utility as taking the outside option.Therefore, we assume = Close.The remaining step of the proof constructs OAL that follows the sequence of ( , ) as long as the maximum observed value is below the threshold, and follows Weitzman's policy whenever it is above the threshold.Note that by Lemma 3.3, the optimal utility of P(U , ) is equal to P(U , 0) when ≤ (U )), and therefore following the optimal action for P(U , 0) is also optimal for P(U , ).Also, when > (U ), no optimal policy uses a backup box with positive probability, and conditioned on not using a backup box, following Weitzman's policy is optimal.The formal discussion can be found in Appendix B.
Proof of Theorem 1.1.Let OAL be an optimal policy for the problem P(M, 0) satisfying the conditions in Lemma 3.5.If (M) = NEG, by Lemma 3.4 and Lemma 3.5, OAL follows Weitzman's policy, implying the statement of the theorem.Now, suppose (M) ≥ 0. By definition OAL uses a backup box.Let ( 1 , 1 ), • • • , ( , ) be the sequence of actions that OAL takes as long as the observed values are below the threshold (where by Lemma 3.5, these observed values are assumed to be 0).Without loss of optimality, we may assume that is the first time in this sequence that OAL claims a closed box and 1 , . . ., −1 correspond to opening boxes.This is trivial, since by assumption this sequence includes an action corresponding to claiming a closed box, all actions before claiming a closed box are opening boxes, and once a box is claimed closed the policy is at a terminal state.By Lemma 3.5, while the observed values are below the threshold, OAL takes ( 1 , 1 ), • • • ( , ).By Lemma 3.4, the maximum observed value at most at one point switches from being below the threshold to above the threshold, and once the maximum observed value is above the threshold, by Lemma 3.5, OAL follows Weitzman's policy.
We conclude by defining the parameters in the statement of the theorem.corresponds to the time where OAL claims a closed box when all observed values until that time were below their thresholds.is 0 if no optimal policy uses a backup box.For ≤ , ( ) corresponds to the box visited at time by OAL As mentioned in Section 2, a policy is pointwise optimal if it is optimal for any reachable state, even those with probability 0. if all the observed values were below their thresholds.Finally, ( ) = (M \ { (1), . . ., ( − 1)}), when (M \ { (1), . . ., ( − 1)} ≥ 0, and is equal to a negative value, otherwise.
Algorithmically, the optimal policy that satisfies the conditions in Theorem 1.1 belongs to a class of policies that given an initial order and thresholds over the boxes, only switches its order of inspection (between the initial order provided and Weitzman's order) at most once.We term this class of polices two-phase policies, which is described in Algorithm 1. (Notice that the thresholds could be negative.) Open box , observe value from the box.

PTAS
In this section, we will present a PTAS for Pandora's problem with nonobligatory inspection (denoted as the problem P := P(M, 0)).We will eventually reduce our problem to the general stochastic dynamic program formulation in [FLX18], but we need several intermediate steps to overcome difficulties caused by 1) our reward function having a negative cost term, and 2) the values of the boxes needing discretization.We will describe our reduction in the following order.In Section 4.1, we will introduce the stochastic dynamic program formulation in [FLX18] and its relevance to our problem.In Section 4.2, we will reduce P to its variant that fixes the unique box * that may be claimed close, hereafter referred to as the backup box.This variant, which we call P * , enables us to focus on a fixed backup box for future reductions.In Section 4.3, we introduce the notion of a pre-specified order threshold sequence that is relevant to all steps in our reduction.In Section 4.4, we focus on P * problem, and rephrase it using the new notion.In Section 4.5, we will prove that the thresholds in the optimal two-phase policy are robust to additive perturbations.Hence, we can reduce the search space for the thresholds to ( 1 ) without much loss in the utility.In Section 4.6, we will reduce the P * problem to a problem that always has nonnegative reward at each step, which we call Tweaked P * (abbreviated as TP * ).This resolves our concern about the negative cost terms in our reward function.In Section 4.7, we will discretize the TP * problem so that the value space of the system has constant support and call the resulting problem DTP * .Finally, in Section 4.8, we formulate the DTP * problem as the stochastic dynamic program (ST * ) specified in [FLX18], for which there exists a PTAS.
All the missing proofs of the section are in Appendix C.

The Stochastic Dynamic Program Formulation in [FLX18]
Here, we formally introduce the stochastic dynamic program, which is specified by a tuple (V, A, , , ℎ, ), and admits a PTAS with parameter .We will also discuss several constraints on the parameters that are crucial to the existence of the PTAS (those text will be in italic).
• V describes the set of all possible internal values, which needs to be of a size that only depends on .
• A = A ∪ {⊥} describes the action set, where A describe different ways to probe element , and ⊥ represents not probing anything.For each element , A must be of a size that only depends number of elements and and is polynomial in the number of elements.Moreover, the agent can never probe the same element twice (namely pick two actions from the same A set).
• describes how the value of the system changes from step to + 1. (i.e.The internal value at step + 1 is +1 = ( , ),where is the action at step .)The value of the system must be non-decreasing in .
• ( , ) describes the immediate reward the agent gets at step , given internal value and that the agent takes action .Notice that can only depend on the value and action at step , but not the value and action before step .Furthermore, ( , ) can be stochastic but must have nonnegative expected value.
• Finally, represents the maximum steps the policy can take before terminating.ℎ( +1 ) describes the final additional reward at the end of the process, which depends on the value of the system before the policy terminates.ℎ( +1 ) must be pointwise nonnegative.
• At the end of the process, the agent gets total reward ℎ( +1 ) + =1 ( , ).Here if the agent decides to terminate the process early at step * , we could view it as the agent taking a null action for all steps ′ > * , and getting zero immediate rewards for those steps.

Algorithmic Representation and Fixing Backup Box
As we have seen in previous sections, the optimal policy (or at least there exists one that) is a two-phase policy described in Algorithm 1 with some initial order and threshold where * is the unique box that may be claimed closed, hereafter referred to as the backup box.In particular, when 1 is negative, the two-phase policy does not use any backup box.In this case, the two-phase policy must be the Weitzman's policy.Otherwise, when the two-phase policy uses the backup box with non-zero probability, there are only = |M| choices for the backup box.In this case, all of the s (for 1 ≤ ≤ ) are nonnegative.
To make our life easier in our reductions, we will mainly study a variant of the P problem (which we will call P * ), where we are only allowed to claim a specific box * closed without inspection.If for each * ∈ M we could find an approximately optimal policy ALG ( * ) for problem P * with nonnegative thresholds, then simply taking the utility maximizing policy among ALG ( * ) for each * ∈ M and the Weitzman's policy gives an approximately optimal policy for problem P. From now on, we will consider two-phase policies with a predetermined backup box * and nonnegative thresholds 1 , • • • (illustrated in Algorithm 2).From this point on, we will use OPT := OPT(M, 0) to denote the optimal expected utility of problem P := P(M, 0).Similarly, we will use OPT * to denote the optimal expected utility of problem P * , which fixes the backup box * .

Index-Threshold Sequence, Classes of Policies, and Utilities
First, we introduce index-threshold sequence which is crucial for all the reduction steps and various classes of policies to be defined.Having fixed a backup box, * , and a position for the backup box in the order, + 1, the index-threshold sequence determines the boxes visited in order before the backup box and their respective thresholds.
We will define an index-threshold sequence as an ordered sequence of box indices followed by an ordered sequence of threshold values of the same length.We will use ord Open box , observe value from the box. 4: Run Weitzman's policy on remaining boxes from state (U , ). 6: end if 8: end for 9: Claim box * closed.
As we shall soon see in our reductions, for each problem P ∈ {P * , TP * , DTP * , ST * }, we will construct a class of policies C P such that an index-threshold sequence ord = ( 1 , • • • , , 1 , • • • ) completely determines a specific policy with this class.Furthermore, there exists an optimal policy to problem P that lies in the set C P .For instance, for the problem P * , C P * would be the class of all two-phase policies with backup box * .C TP * , C DTP * and C ST * will actually contain closely related policies to the two-phase policies.If a policy ALG belongs to the class of policies C P and is determined by ord, we will say that ALG is parameterized with ord.
We will also define a property of policies called below-threshold-nonadaptive that holds for any policy in all policy classes C P that we will define.Note that unlike two-phase property that specifies the action when a value exceeds the threshold (following Weitzman's policy), below-threshold-nonadaptive property is more general and does not specify the action in this case.This property only specifies the case where the values are below the thresholds and captures policies that are nonadaptive where the values are below the thresholds.

Definition 4.2 (Below-Threshold-Nonadaptive). A policy ALG parameterized with ord
1. ALG opens boxes in fixed order 1 , 2 , • • • , while the value of none of the previously opened boxes have exceeded their thresholds.
2. Given that before step , the value of none of the previously opened boxes have exceeded their thresholds, ALG's expected utility from steps ≥ is independent of what values it sees in steps < .
Definition 4.3 ( P (ord) ≥ and P (ord)).Let ALG be the algorithm parametrized by ord in class C P .We will now define P (ord) ≥ as the expected utility ALG gets at step from future steps, conditioned on the fact that in step 1, • • • , − 1, the value of the boxes are below the thresholds for the step.Since all polices we consider are below-threshold-nonadaptive, namely the utility of these policies are independent of previous values as long as they have not seen a box with above threshold value, P (ord) ≥ is well defined.We will use P (ord) to denote the expected utility from ord overall (namely, P (ord) = P (ord) ≥1 ).
We will make extensive use of these utility notations in our proofs, especially when comparing achievable utility between related problem formulations.Consequently, E[Weitz U ] and E[Weitz U ( )] will be equal to the utility of Weitzman's policy (with no outside option) and that with an outside option , respectively.

When analyzing the utility of a policy parameterized with ord
Note (Expectations on Weitz ≥ and Weitz ≥ ( )).When we take expectation over terms Weitz ≥ and Weitz ≥ ( ), we will always take expectation over : ∈ U , irrespective and independent of the range of we are taking expectation over.Hence, we will omit the subscript : ∈ U when taking expectations.E.g. when we use notation E Weitz ≥ , we mean E : ∈U Weitz ≥ , and when we use notation E > Weitz ≥ ( ) , we mean E > , : ∈U Weitz ≥ ( ) .

P *
We will begin by defining C P * , which will simply be the set of all two-phase policies with nonnegative thresholds.Recall that the two-phase policy (Algorithm 1) for problem P is determined by initial box order and thresholds Given that problem P * fixes the back up box, the class of two-phase policy P * is determined by ord . This proves validity of our choice of C P * .
We will also write out the utility recurrence formula for a two-phase policy ALG parameterized by ord at step .At step , ALG inspects box and pays cost .Then with probability Pr[ ≤ ], the algorithm ignores the current value and transition to step + 1 in phase one.With probability Pr[ > ], the algorithm transitions into phase two and gets the same utility as Weitzman's policy would with outside option .Hence we have the following recurrence: Finally, we have the following property for the optimal policy.

Discretizing Action (Threshold) Space
To have a polynomial sized action space (A), we need to discretize the thresholds.In this section, we will prove that the utility from the optimal index-threshold sequence ord is fairly robust to fluctuation in threshold values for the problem P * .
Our first claim says that there exists an optimal two-phase policy for problem P * where all the thresholds are no larger than OPT.This claim provides us with an upper bound to the search space for optimal thresholds.Claim 4.6.For problem P * , any optimal two-phase policy parametrized by ord Next, we prove that we can just search through index-threshold sequences with thresholds in increments of • OPT, and find a good ord whose associated two-phase policy gets at least OPT * − • OPT utility from P * problem.This enables us to restrict ourselves to considering thresholds of multiples of • OPT during our reductions in the next few sections.Proposition 4.7.Let ord * = ( 1 , • • • , , 1 , • • • , ) be the parameter associated with an optimal two-phase policy for problem P * that satisfies Claim 4.5.Then there exists another index-threshold sequence ord

Removing Cost Terms (Reducing P * to TP * )
In this section, we reduce problem P * to a problem with no costs TP * .This step is helpful to have a finite internal value space V in our eventual reduction to [FLX18] framework, while approximately preserving attainable utility.
[FLX18] requires the internal values to be supported on a set V with constant size.This will necessitate a discretization of the element values, as those values are usually not supported on a small set.In various reductions to [FLX18], there are generally two ways to define the internal value .The first option is to use to represent the best value (or an approximation of the value) that the agent has seen in the past.The second option is to use to represent the number of elements the policy has seen or selected.Given that in Pandora's problem with nonobligatory inspection, the value the agent selects is very much dependent on all probed elements and not just a constant-size subset of elements, it is much more reasonable for us to use the first option -use to represent some form of element value.However, almost all problems that reduce to [FLX18] which use to represent element values do not have costs of inspection.The canonical way to discretize the values is to round the value up or down to an approximate value.However, if the reward is at step is − for some cost , then rounding to a nearby approximate value may completely distort the value of − multiplicatively.To deal with this issue, we reduce the original P * problem to a problem without cost (we will call it TP * ) by drawing parallels between our two-phase policy and the non-exposed policy introduced by [KWW16] for the original Pandora's box problem.

Stage-Non-Exposed Policies
[KWW16] introduced the notion of non-exposed policies (see Definition 2.1), which has been successfully applied to related problems with cost of inspection [SS21].Since optimal policy for P (and hence P * for some * ) may not always be non-exposed (See Example A.1 in Appendix A), we provide a new related property that our policy satisfies.
Observe the following fact about non-exposed nonadaptive policies.

and inspects boxes
1 , • • • in sequential order.At step , if > , then the policy selects box and terminates the process.

Claim 4.9. [SS21] A nonadaptive policy parametrized by ord
We can prove a claim with similar conditions to Claim 4.9 for two-phase policies despite the adaptivity of two-phase policies.We prove that although the optimal two-phase policy might not select a box when its value is above the threshold, the optimal policy will always enter phase two.It is also easy to calculate the expected utility during and after the phase transition: if during the phase transition step , the observed value is , then the total utility from ≥ step is just the expected utility from Weitzman's policy on remaining boxes with outside option minus the cost .This quantity is always at least − , the utility the agent would have gotten if they had just selected box and ended the process at step .This gives us an alternative view of our two-phase policy: during the phase transition at step , we immediately select box and get utility − , but we also get the "leftover utility" from remaining boxes through Weitzman.This enables us to get rid of the cost term in similar manners to [SS21].
Elements in the stochastic dynamic program formulation correspond to boxes in our setting.Note that Weitzman's policy is non-exposed.This statement is almost an equivalence statement.A non-exposed policy with > must satisfy: ∈ ( , ] with probability 0. This can be formally dealt with easily.
From Section 3 we know that if a value is below threshold, the optimal mechanism can ignore it.Therefore considering , the first value above the threshold, as the maximum observed value and therefore the outside option is valid.

Definition 4.10 (stage-non-exposed). A two-phase policy with backup box * parameterized by ord
Claim 4.11.For problem P * , there exists an optimal two-phase policy parametrized by ord that is stage-non-exposed, namely, for each ∈ [ ], ≤ .
Corollary 4.12.For problem P * , there exists an optimal two-phase policy parametrized by ord Proposition 4.13.Let ALG be a stage-non-exposed two-phase policy parameterized by ord and let PT denote whether the phase transition happens at step .Then, ALG gets expected utility Proposition 4.13 gives rise to our problem formulation of TP * , which given an index-threshold sequence ord = ( 1 , • • • , , 1 , • • • , ), computes the utility of the associated stage-non-exposed two-phase policy.

Formulation with No Cost
Tweaked P * (we abbreviate as TP * ): • Box set: = M \ { * }.Let U denote the remaining available item set at the beginning of each step .
• In each step , the agent can either open (with no repetition) a box and specify a priori a threshold ≤ , or choose to stop the process.If the value of box is at most , then the agent gets 0 reward.Otherwise the agent gets E > [ ] + E > [(Weitz U +1 − ) + ] reward, and the agent has to stop the process for the next round.
• When the agent decides to stop, if none of the boxes they have opened have value larger than their specified threshold , then they get final reward E[ * ].Otherwise they get nothing when they stop.
We will now formally analyze the relationship between the utility of ord from problems P * and TP * .We define C TP * as the class of nonadaptive policies, which can be parametrized by ord A nonadaptive policy parametrized by ord opens boxes 1 , • • • , in sequential order, until it sees a value above , in which case it claims the reward and stops.If none of the boxes among 1 , • • • , have value above the threshold, then the nonadaptive policy gets final reward E[ * ].We first prove that C TP * contains the optimal policy, verifying that C TP * is well defined.Claim 4.14.C TP * contains an optimal policy for TP * .
We then verify that ord induces the same utility for both P * problem (as a parameter to two-phase policy) and TP * problem (as a parameter to nonadaptive policy).

Proposition 4.15. Given any stage-non-exposed two-phase policy parameterized by ord
Finally, we combine threshold discretization from section 4.5 and the equivalence of utility between TP * and P * in this section.

Discretization (Reducing TP * to DTP * )
Currently, it is still not extremely clear how we would reduce from the TP * to a stochastic program.We will first briefly describe (without proof) how we could modify TP * into a problem that still has the same optimal utility, but whose reward functions are myopic (which is required by the stochastic dynamic program formulation).Observing this new and more adaptive formulation of TP * (we call it Adaptive TP * ) will help us decide which values we need to discretize.
Adaptive TP * • Box set: U = M \ { * }.Let U denote the remaining available item set at the beginning of each step .
• During phase one, in each step , the agent can either open (with no repetition) a box and specify a priori a threshold ≤ , or choose to stop the process.If the value of box is at most , then the agent gets 0 reward.Otherwise the agent gets reward, update their internal value to , and phase two starts.
• During phase two, in each step the agent can open a box and get reward ( − ) + .The agent update their internal value to max( , −1 ).
• When the agent decides to stop, if none of the boxes they have opened have value larger than their specified threshold , then they get final reward E[ * ].Otherwise they get nothing when they stop.
In order for Adaptive TP * to be converted to a stochastic dynamic program (with format and constraints specified in Section 4.1, we need to have constant support and the number of choices of threshold for each box to be ( ).We have already seen in Section 4.5 and Corollary 4.16 that we could assume the thresholds are supported on W = {0, • OPT, 2 • • OPT, • • • , OPT} with only additive • OPT loss to the attainable utility.The remaining challenge is to discretize the internal state onto a constant sized support.When the value is updated, it could either be updated to the value of a box during phase transition, or the value of during phase two.Hence we need to discretize both of these quantities.Normally, for a problem without cost such as Probemax, the optimal expected utility from the agent is either above, or within a constant factor to E[max ], the expected maximum value from the elements.In this case, the standard way to discretize an element value is to truncate the value space at E[max ]/ (the truncation at E[max ]/ is essential to ensure that the probability that the value of any element is above the truncated upper limit is at most ), and then discretize the values into increments of • E[max ].Since the optimal agent utility is close to E[max ], this rounding only affects the agent utility by factor.This method indeed works for discretizing our s.Since Weitzman's policy is a valid policy for the Pandora's box with nonobligatory inspection problem, it must be the case E[max ] ≤ OPT.So the usual truncation plus discretization scheme works.
However, for approximating s as internal state values, the above scheme no longer works, since there is a potentially super constant gap between OPT and E[max ].Discretizing the values into multiples of • E[max ] is too coarse to generate meaningful approximation guarantees.On the other hand, discretizing into multiples of • OPT will yield good approximation for the agent utility, but the resulting support will have a super constant size.
is equal to the smallest support in W that has value at least (if > , then = ∞).
• In each step , the agent can either open (with no repetition) a box and specify a priori a threshold ∈ W such that ≤ , or choose to stop the process.If the value of box is at most , then the agent gets 0 reward.Otherwise the agent gets E > [( Weitz U +1 − ) + ] + E > [ ] reward, and the agent has to stop the process for the next round.
• When the agent decides to stop, if none of the boxes they have opened have value larger than their specified threshold , then they get final reward E[ * ].Otherwise they get nothing when they stop.
The exact same argument as in Claim 4.14 shows that the optimal strategy for the DTP * problem is nonadaptive.Thus we can define C DTP * as a subset of C TP * (which only allows thresholds to be in Corollary 4.20.There exists a constant-size support W such that the optimal expected utility from DTP * is at least OPT * − 3 • OPT At this point we have essentially established that if we know a near optimal index-threshold sequence for problem DTP * , we can immediately find the optimal support W that satisfies Corollary 4.20.Unfortunately, we do not have such super power, as the near optimal solution is what we are trying to find in the first place!However, remember each ∈ W must be at most = E[max ] , and also must be a multiple of 2 • OPT.We first extablish that the ratio between E[max ] and OPT is at most .Consequently, there are only polynomial number of possibilities for the choice of .Since W is of size poly 1 , there are only ( 1 ) many possible choices for W.

Claim 4.21. E[max ] ≤ • OPT.
Hence once we provide a PTAS for the problem DTP * , given a particular W, we then can simply run the PTAS for all possible configurations of W and choose a W whose PTAS policy yields the maximum expected reward.By Corollary 4.20, this expected reward must be at least (1 − ) • (OPT * − 3 • OPT).

Obtaining the PTAS (Reducing DTP * to ST * )
Now we are at the last step, which is to show that there exists a PTAS for the DTP * problem by reducing DTP * to the stochastic dynamic program with constant value space, whose format is defined in [FLX18].Based on a discretized version of Adaptive TP * (formulated in Section 4.7), our stochastic dynamic program is as follows.
Stochastic Dynamic Program (we abbreviate as ST * ): • M is the set of all boxes.As before, let box * denote the backup box we have fixed.
• is maximum number of rounds, which is just |M|.
• V is the set of all possible values of the system, which we will set as V = .
• A will represent the action space.Specifically, For all ≠ * , let A = { } ∈W ∪ {∞}, where represent the action of opening box with threshold , and W includes all elements in W = {0, • OPT, • • • , OPT} that are at most .For the back up box * , let A * = { * }, where * represent the action of opening the backup box without threshold.(The reward from claiming * closed will be encoded in the final reward function ℎ. ) A = ∈M A is just the union over all possible actions for each box.
• represents the current value of the program, while is the transition function.Assume at step at we take an action for box ≠ * , then we will set transition function For the action * , which opens the back up box, we will set • We define the reward function for each state transition as for any ≠ * , • Finally, we define the final reward function as the reward the agent gets by claiming the backup box closed.This reward is not allowed if the threshold is crossed for some box.
Firstly, as always, when we change the formulation of the problem, we need to define the class of policies that can be parameterized by ord To do this, we first quickly observe that once > 0, it is optimal to open all remaining boxes in problem ST * .This makes sense, since we are deliberately trying to design the stochastic dynamic program so that in "phase two" (once > 0), the expected reward of the optimal policy correspond to the expected reward from DTP * .Claim 4.22.Once > 0, it is optimal to open all remaining boxes, including the backup box.
Using the same argument as in Claim 4.14 (essentially that policies that take deterministic actions given an internal value does as well as policies that can take randomized actions), we can show that there exists an optimal policy that is below-threshold-nonadaptive .Combining this with Claim 4.22, we conclude that there exists an optimal policy of the following form.
Probe element with threshold , observe value from the box.As with the previous subsections for class of policies, we define C ST * as the set of all valid ST policies with supported on W and at most for each .This class is used to establish the following result.

Proposition 4.23. Given a valid index-threshold sequence ord
We are finally ready to prove the main theorem of the section, Theorem 1.3.The proof uses the relationship among P * , TP * , DTP * , ST * and is provided in the appendix.never opens this box).Therefore, ( open ) and ( closed ) are both at most OPT commit , where OPT commit is the maximum utility among committing policies.Upper bounding ( open ), ( closed ), and ( ) by OPT commit gives Since 0 ≤ ≤ 1, the minimum value for the right hand side occurs at = 1/2, implying the statement.
Example A.1.This example shows that the optimal policy of Pandora's problem with nonobligatory inspection may not be non-exposed.Consider the following two boxes with being a sufficiently small number: Based on the distributions and costs, = 2 − 2 and = 1/(2 ).The optimal policy starts by opening box .If = 0, then it claims box closed.However, if = 2, it continues with opening box and selecting if = 1/ .Therefore, in this case, although has been inspected and > , the optimal policy does not select it; which makes it an example of the optimal policy not satisfying non-exposure.

B Missing Proofs of Section 3
Observation 3.1.OPT(U, ) is increasing in .
Proof of Observation 3.1.For any , ′ where ′ > , let OAL be an optimal policy for P(U, ).We will show that there exists a policy for P(U, ′ ) which gets at least as much utility as OPT(U, ).Consider a policy OAL ′ for P(U, ′ ) where it pretends the outside option is and at each stage does exactly what OAL would do conditioned on the revealed information.For any fixed sequence of values of the boxes, OAL ′ always pays the same costs as OAL and returns a value that is either equal to or greater than the value returned by OAL.
The following lemma shows that given an optimal policy LOAL of a subproblem P(U, ), we can construct another policy ′ such that for any reachable state in LOAL (U, ) follows LOAL, and for any other reachable state follows .This lemma is used in the proofs of Lemma 3.4 and 3.5.
Lemma B.1.Let LOAL be a policy that is optimal for the problem P(U, ), then for any optimal policy for the problem P(M, 0), then we can construct another optimal policy ′ such that for any state Proof.For any optimal policy , let us construct the policy ′ such that at any state that is not a state in LOAL (U, ), ′ always takes the same action as , however at a state (U ′ , ′ ) ∈ LOAL (U, ), ′ will take the same action as LOAL.We will first verify that this policy is valid (namely, ′ never reaches a state where the action at that state is ill-defined).To prove this, we will show that ( ′ ) ⊂ ( ) ∪ LOAL (U, ).For any state (U ′ , ′ ) ∈ ( ′ ) but (U ′ , ′ ) ∉ LOAL (U, ), any sequence of states that start with (M, 0) and end with (U ′ , ′ ) that is plausible for policy ′ must not include the state (U, ) (otherwise since (U, ) ∈ LOAL (U, ), ′ will take the same action as LOAL at (U, ), and similarly ′ will take the same action as LOAL at the next state, etc, until ′ reaches (U ′ , ′ ), therefore (U ′ , ′ ) must be reachable by LOAL from (U, ), which is a contraction).Thus ′ must take the same action as for all states this sequence of states, this means that this sequence of states is plausible for policy as well, which means that (U ′ , ′ ) ∈ ( ).We conclude that (U ′ , ′ ) is either in ( ), or in LOAL (U, ).Now since LOAL is locally optimal at (U, ), for any (U ′ , ′ ) ∈ LOAL (U, ), LOAL (U ′ , ′ ) is the optimal first action for the problem P(U ′ , ′ ).Similarly, for any (U ′ , ′ ) ∈ ( ), (U ′ , ′ ) is the optimal first action for the problem P(U ′ , ′ ).We conclude that at any state (U ′ , ′ ) ∈ LOAL (U, ), ′ (U ′ , ′ ) = LOAL (U ′ , ′ ) is the locally optimal first action, and at any state is the locally optimal first action as well.Hence ′ is optimal.
Proof of Lemma 3.4.We know by Definition 3.2 that if 0 > (U 0 ), then no optimal policy uses a backup box for the problem P(U, ).Since (U, ) ∈ (OAL), OAL is also optimal for the problem P(U 0 , 0 ).Assume for contradiction that ≤ (U ) for some ∈ [ ], then there exists an optimal policy LOAL that uses a backup box for the problem P(U , ).By Lemma B.1, we know that there exists another optimal policy OAL ′ for problem P(U, ) such that OAL ′ uses a backup box for the problem P(U , ), and (U , ) ∈ OAL ′ (U, ).Since OAL ′ uses a backup box for problem P(U , ), and (U , ) ∈ OAL ′ (U, ), OAL ′ must also use a backup box for problem P(U, ), which is a contradiction to no optimal policy uses a backup box for the problem P(U, ).Lemma 3.5.There exists an optimal policy OAL for problem P(M, 0) that satisfies the following: for any reachable state (U, ) ∈ (OAL), • When ≤ (U), OAL (U, ) = OAL (U, 0); • When > (U), OAL (U, ) = (U, ).In fact, for any (U ′ , ′ ) ∈ (U, ), OAL (U ′ , ′ ) = (U ′ , ′ ), where (U, ) represents the action Weitzman's policy would take given that U is the set of uninspected boxes and is the maximum value obtained so far by the algorithm.
Proof of Lemma 3.5.If no optimal policy uses a backup box for P(M, 0) with positive probability, then Weitzman's policy is an optimal policy satisfying the statement.Note that for any reachable state (U, ) of Weitzman's policy, > (U), otherwise there is an optimal policy that claims a closed box, which is in contradiction with the initial assumption.Now, suppose there exists an optimal policy that uses a backup box for P(M, 0).Let ( 1 , 1 ), • • • , ( , ) be the the first action of pointwise optimal deterministic policies for problems P(U 0 , 0), P(U 1 , 0), . . ., P(U , 0), respectively, where U 0 = M, U = M \ { 1 , . . ., }, and is the first time in the sequence, where the action taken, i.e., , is terminal.Note that since for each problem in the sequence the outside option is 0, claiming a closed box has at least as much utility as taking the outside option.Therefore, we assume = Close.Now, consider a deterministic optimal policy that for problems P(U 0 , 0), P(U 1 , 0), . . ., P(U , 0), takes actions ( 1 , 1 ), • • • , ( , ) respectively.We show how to modify it to satisfy the conditions in the statement.
From a repeated application of the claim we have just proven, we know that from our original optimal policy OAL, we can construct another optimal policy such that for all ∈ [ ] and for all ≤ (U ), OAL ′ (U , ) = OAL ′ (U , 0).This also implies that all reachable states (U, ) ∈ (OAL) are of form (U , ) for some ∈ [ ], hence the first condition in our lemma is satisfied.Now we will modify our optimal policy further so that the second condition in our lemma is satisfied.We know that for any U ⊂ M and for any > (U), no optimal policy for P(U, ) claims a box closed.Conditioned on not claiming any box closed, we know that an optimal policy for P(U, ) is the Weitzman's algorithm.Thus by Corollary B.1, we can modify OAL ′ and construct another algorithm OAL ′′ , where for any (U, ) ∈ (OAL ′′ ) where > (U), and for any (U ′ , ′ ) ∈ (U, ), OAL ′′ (U, ) = (U, ).For any (U, ) ∈ (OAL ′′ ) that does not satisfy our previous condition, OAL ′′ (U, ) = OAL ′ (U, ) = OAL (U, 0).By Lemma 3.4, we know that for any (U, ) ∈ (OAL) where > (U) and any (U ′ , ′ ) ∈ (U, ), ′ > (U ′ ).Hence OAL ′′ takes the same action as OAL ′ for any reachable state (U, ) where ≤ (U).We conclude that OAL ′′ satisfies our second condition, while still satisfying our first condition.

C Missing Proofs of Section 4
The following claims use the fact that when : ∈ U are fixed, Weitz U ( ) can still be viewed as a function in .Proof.Essentially, at each step in a two-phase policy, the agent decides whether to move to phase two, or to forfeit the value forever and continue in phase one.Hence for any > , it must be at least as good to choose to continue to stage two, namely, P * (ord) ≥ ≤ Weitz ≥ ( ).Changing the s so that they comply with the condition P * (ord) ≥ does not affect optimality.Proof.Let OAL parametrized by ord * = ( 1 , • • • , , 1 , • • • ) be any optimal two-phase policy for problem P * that satisfied Claim 4.5.Notice that for any < , P * (ord * ) ≥ ≥ P * (ord * ) ≥ ( +1) .Moreover, the expected future utility from OAL in the first step is just P * (ord * ) ≥1 = OPT.At step , we know that P * (ord * ) ≥ ( +1) ≤ OPT.Since by Claim 4.5 P * (ord * ) ≥ ( +1) = Weitz ≥ ( +1) ( ), must also be at most OPT.
Proposition 4.7.Let ord * = ( 1 , • • • , , 1 , • • • , ) be the parameter associated with an optimal two-phase policy for problem P * that satisfies Claim 4.5.Then there exists another index-threshold sequence ord Proof.Let ord ′ have the same initial order 1 , • • • , as ord * , however, the thresholds in ord ′ will be those in ord * , but rounded down to a multiple of • OPT.Namely, in ord ′ , for = 1, • • • , , the threshold = • ⌊ ⌋.Notice that since by Claim 4.6 ≤ OPT, and ≤ , is also at most OPT.Hence ∈ W = {0, • OPT, 2 • • OPT, • • • , OPT}.We will now prove that P * (ord ′ ) ≥ ≥ P * (ord * ) ≥ − • OPT for all ∈ [ ] using induction.We start off by assuming that for all > , P * (ord ′ ) ≥ ≥ P * (ord * ) ≥ − • OPT.In our base case where = , any two-phase policy just claims box closed in step .Thus P * (ord ′ ) ≥ = E[ ] = P * (ord * ) ≥ ≥ P * (ord * ) ≥ − • OPT.Now when < , ord * and ord ′ will open box with threshold and ˜ respectively.Hence By the induction hypothesis, Given that ord * is the parameter for an optimal two-phase policy that satisfies Claim 4.5, we know that for any > +1 , Weitz ≥ ( +1) ( ) ≥ OPT ≥ ( +1) .By Claim C.1, Weitz ≥ ( +1) (•) is monotone and subadditive under addition.Therefore By plugging in inequalities ( 6) and (7) into our expansion of P * (ord ′ ) ≥ , we get Finally, we conclude that the expected utility from ord ′ , which is equal to P * (ord ′ ) ≥1 , is at least Claim 4.11.For problem P * , there exists an optimal two-phase policy parametrized by ord Proof.Let OAL be an optimal two-phase policy for P * parametrized by ord Since ord is optimal, it must not be the case where removing an box from the orderthreshold sequence improves utility.Therefore for any step < , P * (ord) ≥ P * (ord) ≥ ( +1) .We can now expand the utility recurrence formula for two stage polices and get By an exchange of terms, By Claim 4.5, it must be the case that For all , let ′ = min( , ).Then we could create another ord ) that is also optimal and satisfy conditions in the claim.
Corollary C.3.For problem P * , there exists an optimal two-phase policy parametrized by ord Proof.By Claim 4.5 and Claim 4.11.Proposition 4.13.Let ALG be a stage-non-exposed two-phase policy parameterized by ord and let PT denote whether the phase transition happens at step .Then, ALG gets expected utility Proof.Firstly, we will again use the recurrence formula for two stage policy as well as expand the definition of .

Taking the expectation over
gives us We can now rewrite the utility recurrence for ALG as Unrolling the recurrence gives the formula in the claim.
Proof.We prove that there is an optimal non adaptive solution for problem TP * by induction.Assume for any available item set U ′ where |U ′ | < |U | = |M \ { * }| = − 1, there exists an optimal non adaptive solution to the tweaked problem.Observe that there must exist an optimal policy for the problem TP * with set U such that the first action is deterministic -if the first action is randomized then that means there are two actions that are equally as good.Let OAL denote this this optimal deterministic policy.If the first action of OAL is to stop, then OAL is already non adaptive.On the other hand, if the first action of OAL is to open some box 1 with threshold 1 .After the first step, either 1 > 1 and the process stops, or 1 ≤ 1 and the agent still has 0 reward.Thus a non adaptive optimal policy OAL 2 for U \ { 1 } is also a locally optimal policy for the second case (where 1 ≤ 1 ).We can now device a new non adaptive optimal policy OAL ′ for tweaked problem on U, where in the first step, OAL ′ opens box 1 with threshold 1 , but in the case where 1 ≤ 1 , OAL ′ takes future actions according to OAL 2 .
Proposition 4.15.Given any stage-non-exposed two-phase policy parameterized by ord Proof.Given a stage-non-exposed two phase policy parametrized by ord = (ß 1 , • • • , , 1 , • • • , ), then by Proposition 4.13, the utility recurrence Similarly, for a non adaptive policy parametrized by ord for problem TP * , at step , the policy stops with probability Pr[ > ], in which case the agent gets reward Hence the utility recurrence for TP * is also Proof.Let be the first iteration where > 0 and let be the set of boxes the agent ends up opening in rounds > .The total reward the agent gets is just max max ∈ , (since > 0, the final reward is 0).In order to maximum this term, we should make the set as large as possible, namely, open all remaining boxes.Let us use ALG ( * ) to denote the two-phase policy parametrized by ord returned from problem P * .Then we can conclude that by doing our reduction for P * for all * ∈ M, then taking the better between the best ALG ( * ) for all * ∈ M and Weitzman's policy, we can find a policy with reward at least OPT − ( ) • OPT.
During our reduction to stochastic dynamic program, all steps are fully polynomial except from we tried all choices for W, which takes poly (1/ ) time, which has polynomial dependence on .Running the PTAS for the stochastic dynamic program itself also only takes time that has polynomial dependence on .Therefore our policy finding scheme is a PTAS for the Pandora's box with nonobligatory inspection problem.
outputs the index of the next box considered.• : (U, ) → {Open, Close, Stop} outputs the operation on the next box, where the operations include open the box, claim the box closed, or terminate the policy without probing.• Action (U, ) := ( (U, ), (U, )) indicates the next box and operation.An action is called terminal if the operation (U, ) is equal to Close or Stop.

Definition 4. 4 .
For a set U of boxes and a fixed outside option , we will defineWeitz U := max∈U and Weitz U ( ) := max max ∈U , .
furthermore, this holds with equality for every box if and only if the policy is non-exposed.
It is clear that given a nonadaptive strategy ALG parametrized with indexthreshold sequence ord ∈ C DTP * , ALG gets more expected reward from TP * compared to DTP * (because we round the positive terms, namely , downward and we round the negative terms, namely , upwards).Now we will prove that the reward ALG gets from DTP * is within an additive • OPT away from the reward ALG gets from TP * .Proposition 4.19.Given any nonadaptive policy ALG parametrized with ord = ( 1 , • • • , , 1 , • • • , ). the expected reward ALG gets from DTP * is at least the expected reward ALG gets from TP * minus 2 • .Formally, DTP * (ord) ≥ TP * (ord) − 2 • OPT.