Anytime State-Based Solution Methods
for Decision Processes with non-Markovian Rewards
Sylvie Thi´ebaux
Froduald Kabanza
John Slaney
of the request being made, or even simply for the veryfirst achievement of a goal which becomes irrelevant af- A popular approach to solving a decision pro- terwards. A decision process in which rewards depend on cess with non-Markovian rewards (NMRDP) is the sequence of states passed through rather than merely to exploit a compact representation of the re- on the current state is called a decision process with non- ward function to automatically translate the NM- Markovian rewards (NMRDP) (Bacchus et al., 1996).
RDP into an equivalent Markov decision pro-cess (MDP) amenable to our favorite MDP so- A difficulty with NMRDPs is that the most efficient MDP lution method. The contribution of this paper is solution methods do not directly apply to them. The tradi- a representation of non-Markovian reward func- tional way to circumvent this problem is to formulate the tions and a translation into MDP aimed at mak- NMRDP as an equivalent MDP, whose states are those of ing the best possible use of state-based any- the underlying system expanded to encode enough history- time algorithms as the solution method.
dependent information to determine the rewards. Hand explicitly constructing and exploring only parts crafting such an expanded MDP (XMDP) can however be of the state space, these algorithms are able to very difficult in general. This is exacerbated by the fact trade computation time for policy quality, and that the size of the XMDP impacts on the effectiveness of have proven quite effective in dealing with large many solution methods. Therefore, there has been interest MDPs. Our representation extends future linear in automating the translation into an XMDP, starting from a temporal logic (FLTL) to express rewards. Our natural specification of non-Markovian rewards and of the translation has the effect of embedding model- system’s dynamics (Bacchus et al., 1996; Bacchus et al., checking in the solution method. It results in 1997). This is the problem we focus on.
an MDP of the minimal size achievable with- When solving NMRDPs in this setting, the central issue is out stepping outside the anytime framework, and to define a non-Markovian reward specification language consequently in better policies by the deadline.
and a translation into an XMDP adapted to the class ofMDP solution methods and representations we would like INTRODUCTION
to use for the type of problems at hand. The two previousproposals within this line of research both rely on past lin- Markov decision processes (MDPs) are now widely ac- ear temporal logic (PLTL) formulae to specify the behav- cepted as the preferred model for decision-theoretic plan- iors to be rewarded (Bacchus et al., 1996; Bacchus et al., ning problems (Boutilier et al., 1999). The fundamental as- 1997), but adopt two very different translations adapted to sumption behind the MDP formulation is that not only the two very different types of solution methods and represen- system dynamics but also the reward function are Marko- tations. The translation in (Bacchus et al., 1996) targets vian. Therefore, all information needed to determine the classical state-based solution methods such as policy iter- reward at a given state must be encoded in the state itself.
ation (Howard, 1960) which generate complete policies atthe cost of enumerating all states in the entire MDP, while This requirement is not always easy to meet for plan- that in (Bacchus et al., 1997) targets structured solution ning problems, as many desirable behaviors are naturally methods and representations, which do not require explicit expressed as properties of execution sequences, see e.g.
state enumeration, see e.g. (Boutilier et al., 2000).
(Drummond, 1989; Haddawy and Hanks, 1992; Bacchusand Kabanza, 1998; Pistore and Traverso, 2001). Typ- The aim of the present paper is to provide a language and ical cases include rewards for the maintenance of some a translation adapted to another class of solution methods property, for the periodic achievement of some goal, for which have proven quite effective in dealing with large the achievement of a goal within a given number of steps MDPs, namely anytime state-based methods such as (Barto et al., 1995; Dean et al., 1995; Thi´ebaux et al., 1995; BACKGROUND
Hansen and Zilberstein, 2001). These methods typicallystart with a compact representation of the MDP based on probabilistic planning operators, and search forward froman initial state, constructing new states by expanding the A Markov decision process of the type we consider is a envelope of the policy as time permits. They may produce 5-tuple S, s0, A, Pr, R , where S is a finite set of fully ob- an approximate and even incomplete policy, but only ex- servable states, s0 ∈ S is the initial state, A is a finite set plicitly construct and explore a fraction of the MDP. Nei- of actions (A(s) denotes the subset of actions applicable in ther of the two previous proposals is well-suited to such so- s ∈ S), {Pr(s, a, •) | s∈S, a∈A(s)} is a family of proba- lution methods, the first because the cost of the translation bility distributions over S, such that Pr(s, a, s ) is the prob- (most of which is performed prior to the solution phase) ability of being in state s after performing action a in state annihilates the benefits of anytime algorithms, and the sec- s, and R : S → IR is a reward function such that R(s) is the ond because the size of the XMDP obtained is an obstacle immediate reward for being in state s. It is well known that to the applicability of state-based methods. Since here both such an MDP can be compactly represented using prob- the cost of the translation and the size of the XMDP it re- abilistic extensions of traditional planning languages, see sults in will severely impact on the quality of the policy e.g., (Kushmerick et al., 1995; Thi´ebaux et al., 1995).
obtainable by the deadline, we need an appropriate resolu- A stationary policy for an MDP is a function π : S → A, tion of the tradeoff between the two.
such that π(s) ∈ A(s) is the action to be executed in state Our approach has the following main features. The transla- S. We note E(π) the envelope of the policy, that is the set tion is entirely embedded in the anytime solution method, of states that are reachable (with a non-zero probability) to which full control is given as to which parts of the from the initial state s0 under the policy. If π is defined at XMDP will be explicitly constructed and explored. While all s ∈ E(π), we say that the policy is complete, and that it the XMDP obtained is not minimal, it is of the minimal size is incomplete otherwise. We note F (π) the set of states in achievable without stepping outside of the anytime frame- E(π) at which π is undefined. F (π) is called the fringe of work, i.e., without enumerating parts of the state or ex- the policy. We stipulate that the fringe states are absorbing.
panded state spaces that the solution method would not nec- The value of a policy π at any state s ∈ E(π), noted (s) essarily explore. This relaxed notion of minimality, which is the sum of the expected rewards to be received at each we call blind minimality is the most appropriate in the con- future time step, discounted by how far into the future they text of anytime state-based solution methods.
occur. That is, for a non-fringe state s ∈ E(π) \ F (π): When the rewarding behaviors are specified in PLTL, there (s) = R(s) + β Pr(s, π(s), s )(s ) does not appear to be a way of achieving a relaxed no- tion of minimality as powerful as blind minimality with- where 0 ≤ β ≤ 1 is the discounting factor controlling out a prohibitive translation. Therefore instead of PLTL, the contribution of distant rewards. For a fringe state s ∈ we adopt a variant of future linear temporal logic (FLTL) F (π), (s) is heuristic or is the value at s of a complete as our specification language, which we extend to handle default policy to be executed in absence of an explicit one.
rewards. While the language has a more complex seman- For the type of MDP we consider, the value of a policy π is tics than PLTL, it enables a natural translation into a blind- the value (s0) of π at the initial state s0, and the larger minimal XMDP by simple progression of the reward for- mulae. Moreover, search control knowledge expressed inFLTL (Bacchus and Kabanza, 2000) fits particularly nicely STATE-BASED ANYTIME ALGORITHMS
in this model-checking framework, and can be used to dra-matically reduce the fraction of the search space explored Traditional state-based solution methods such as policy it- eration (Howard, 1960) can be used to produce an optimal The paper is organised as follows. Section 2 begins with complete policy. Policy iteration can also be viewed as an background material on MDPs, NMRDPs, XMDPs, and anytime algorithm, which returns a complete policy whose anytime state-based solution methods. Section 3 describes value increases with computation time and converges to op- our language for specifying non-Markovian rewards and timal. The main drawback of policy iteration is that it ex- the progression algorithm. Section 4 defines our translation plicitly enumerates all states that are reachable from s0 in into an XMDP along with the concept of blind minimal- the entire MDP. Therefore, there has been interest in other ity it achieves, and presents our approach to the embedded anytime solution methods, which may produce incomplete construction and solution of the XMDP. Finally, Section 5, policies, but only enumerate an increasing fraction of the provides a detailed comparison with previous approaches, and concludes with some remarks about future work. The For instance, (Dean et al., 1995) describes methods which proofs of the theorems appear in (Thi´ebaux et al., 2002).
deploy policy iteration on judiciously chosen larger and In the initial state s0, p is false and two A decision process with non-Markovian rewards is identi- actions are possible: a causes a transition cal to an MDP except that the domain of the reward func- to s1 with probability 0.1, and no change with probability 0.9, while for b the transi- tion is S∗. The idea is that if the process has passed through tion probabilities are 0.5. In state s state sequence Γ(i) up to stage i, then the reward R(Γ(i)) true, and actions c and d (“stay” and “go”) is received at stage i. Figure 1 gives an example. Like the A reward is received the first time p is true, reward function, a policy for an NMRDP depends on his- but not subsequently. That is, the rewarded tory, and is a mapping from S ∗ to A. As before, the value of policy π is the expectation of the discounted cumulative (Γ(i)) | π, Γ0 = s0 larger envelopes. Another example is (Thi´ebaux et al., The clever algorithms developed to solve MDPs cannot be 1995), in which a backtracking forward search in the space directly applied to NMRDPs. One way of dealing with this of (possibly incomplete) policies rooted at s 0 is performed problem is to formulate the NMRDP as an equivalent MDP until interrupted, at which point the best policy found so with an expanded state space (Bacchus et al., 1996). The far is returned. Real-time dynamic programming (RTDP) expanded states in this XMDP (e-states, for short) augment (Barto et al., 1995), is another popular anytime algorithm, the states of the NMRDP by encoding additional informa- which is to MDPs what learning real-time A(Korf, 1990) tion sufficient to make the reward history-independent. An is to deterministic domains. It can be run on-line, or off- e-state can be seen as labeled by a state of the NMRDP line for a given number of steps or until interrupted. A (via the function τ in Definition 1 below) and by history more recent example is the LAOalgorithm (Hansen and information. The dynamics of NMRDPs being Markovian, Zilberstein, 2001) which combines dynamic programming the actions and their probabilistic effects in the XMDP are exactly those of the NMRDP. The following definition, All these algorithms eventually converge to the optimal adapted from (Bacchus et al., 1996), makes this concept of policy but need not necessarily explore the entire state equivalent XMDP precise. Figure 2 gives an example.
space to guarantee optimality.1 When interrupted beforeconvergence, they return a possibly incomplete but often Definition 1 MDP D = S , s0, A , Pr, R is an equivalent
useful policy. Another common point of these approaches expansion (or XMDP) for NMRDP D = S, s0, A, Pr, R is that they perform a forward search, starting from s if there exists a mapping τ : S → S such that: repeatedly expanding the envelope of the current policy one 1. τ (s0) = s0. step forward. Since planning operators are used to com-pactly represent the state space, these methods will only 2. For all s ∈ S , A (s ) = A(τ (s )). explicitly construct a subset of the MDP. In this paper, we 3. For all s1, s2 ∈ S, if there is a ∈ A(s1) such that will use these solution methods to solve decision processes Pr(s1, a, s2) > 0, then for all s1 ∈ S such that with non-Markovian rewards which we define next.
τ(s1)=s1, there exists a unique s2∈S , τ(s2)=s2, suchthat for all a ∈ A (s1), Pr (s1, a, s2)=Pr(s1, a, s2). NMRDPs AND EQUIVALENT XMDPs
4. For any feasible state sequence Γ ∈ D(s0) and We first need some notation. Let S ∗ be the set of finite Γ ∈ D (s0) such that τ(Γ ) = Γ sequences of states over S, and S ω be the set of possibly (Γ ) = R(Γ(i)) for all i. infinite state sequences. In the following, where ‘Γ’ standsfor a possibly infinite state sequence in S ω and i is a natural Items 1–3 ensure that there is a bijection between feasi- ble state sequences in the NMRDP and feasible e-state se- i’ we mean the state of index i in Γ, by ‘Γ(i)’ quences in the XMDP. Therefore, a stationary policy for the 0, . . . , Γi ∈ S∗ of Γ, and by pre(Γ) we mean the set of finite prefixes of Γ. Γ XMDP can be reinterpreted as a non-stationary policy for the NMRDP. Furthermore, item 4 ensures that the two poli- 1 ∈ S∗ and Γ2 ∈ Sω. For a decision cies have identical values, and that consequently, solving 0, A, Pr, R and a state s ∈ S, D(s) stands for the set of state sequences rooted at s that are an NMRDP optimally reduces to producing an equivalent feasible under the actions in D, that is: D(s) = {Γ XMDP and solving it optimally (Bacchus et al., 1996): Sω | Γ0 = s and ∀i ∃a ∈ A(Γ Γi+1) > 0}.
Proposition 1 Let D be an NMRDP, D an equivalent
Note that the definition of D(s) does not depend on R and XMDP for it, and π a policy for D . Let π be the func- therefore also stands for NMRDPs which we describe now.
tion defined on the sequence prefixes Γ(i) ∈ D(s0) by 1This is also true of the basic envelope expansion algorithm in π(Γ(i)) = π (Γ ), where for all j ≤ i τ(Γ ) = Γ (Dean et al., 1995), under the same conditions as for LAO.
is a policy for D such that Vπ(s0) = (s0). from now on). We also adopt the notations modality (f will be true in exactly k steps), i for 1 ≤ i ≤ k (f will be true within the next k steps), and if for 1 ≤ i ≤ k (f will be true at all the next k steps).
Although negation officially occurs only in literals, i.e., the formulae are in negation formal form (NNF), we allow our- selves to write formulae involving it in the usual way, pro- vided that they have an equivalent in NNF. Not every for- mula has such an equivalent, because there is no such literal as ¬$ and because eventualities (‘f will be true some time’) are not expressible. These restrictions are deliberate.
Figure 2: An XMDP equivalent to the NMRDP in Figure1. τ (s0) = τ(s2) = s0. τ(s1) = τ(s3) = s1. State s1 is The semantics of this language is similar to that of FLTL, rewarded; the other three states are not.
with an important difference: unlike the interpretation ofthe propositional constants in P, which is fixed (i.e. each When solving NMRDPs in this setting, the two key is- state is a fixed subset of P), the interpretation of the con- sues are how to specify non-Markovian reward functions stant $ is not. Remember that $ means ‘The behavior we compactly, and how to exploit this compact representation want to reward has just happened’. Therefore the interpre- to automatically translate an NMRDP into an equivalent tation of $ depends on the behavior B we want to reward XMDP amenable to our favorite solution methods. The (whatever that is), and consequently the modelling relation goal of this paper is to provide a reward function specifica- |= stating whether a formula holds at the i-th stage of an tion language and a translation that are adapted to the any- arbitrary sequence Γ ∈ S ω, is indexed by B. Defining |= time state-based solution methods previously mentioned.
is the first step in our description of the semantics: We take these problems in turn in the next two sections.
, i) |= $ iff Γ(i) ∈ B REWARDING BEHAVIORS
, i) |= p, for p ∈ P, iff p ∈ Γi Representing non-Markovian reward functions compactly reduces to compactly representing the behaviors of in- , i) |= ¬p, for p ∈ P, iff p ∈ Γ where by behavior we mean a set of fi- , i) |= f nite sequences of states (a subset of S ∗), e.g. the 1 ∧ f2 iff (Γ, i) |= { s0, s1 , s0, s0, s1 , s0, s0, s0, s1 . . .} in Figure 1. Re- , i) |= f1 ∨ f2 iff (Γ, i) |= f1 or (Γ, i) |= f2 call that we get rewarded at the end of any prefix Γ(i) in , i) |= f iff (Γ, i + 1) |= f that set. Once behaviors are compactly represented, it is straightforward to represent non-Markovian reward func- , i) |= f1 Uf2 iff ∀k ≥ i tions as mappings from behaviors to real numbers – we if (∀j, i ≤ j ≤ k , j) |= f2) then (Γ, k) |= f1 shall defer looking at this until Section 3.5.
Note that except for subscript B and for the first rule, this is To represent behaviors compactly, we adopt a version of fu- just the standard FLTL semantics, and that therefore $-free ture linear temporal logic (FLTL) augmented with a propo- formulae keep their FLTL meaning. As with FLTL, we say sitional constant ‘$’, intended to be read ‘The behavior we Γ |= f iff (Γ, 0) |= f, and |= f iff Γ |= f for all Γ ∈ Sω.
want to reward has just happened’ or ‘The reward is re- ceived now’. The language $FLTL begins with a set of The modelling relation |= can be seen as specifying when basic propositions P giving rise to literals: a formula holds, on which reading it takes B as input. Our next and final step is to use the |= relation to define, for a formula f , the behavior Bf that it represents, and for this and stand for ‘true’ and ‘false’, respectively.
we must rather assume that f holds, and then solve for B.
The connectives are classical and , and the temporal (next) and U (weak until), giving formulae: every time p is true. We would like Bf to be the set of all F ::= L | F ∧ F | F ∨ F | F | F U F finite sequences ending with a state containing p. For anarbitrary f , we take B Because our ‘until’ is weak (f f to be the set of prefixes that have 1 U f2 means f1 will be true to be rewarded if f is to hold in all sequences: from now on until f2 is, if ever), we can define the usefuloperator f ≡ f U (f will always be true Definition 2 B
To understand Definition 2, recall that B contains prefixes REWARD NORMALITY
at the end of which we get a reward and $ evaluates to true.
Since f is supposed to describe the way rewards will be re- $FLTL is so expressive that it is possible to write formulae ceived in an arbitrary sequence, we are interested in behav- which describe “unnatural” allocations of rewards. For in- iors B which make $ true in such a way as to make f hold stance, they may make rewards depend on future behaviors regardless of the sequence considered. However, there may rather than on the past, or they may leave open a choice be many behaviors with this property, so we take their inter- as to which of several behaviors is to be rewarded. 3 An f will only reward a prefix if it has to because that prefix is in every behavior satisfying f . In ward now if p is going to hold next. We call such formula all but pathological cases (see Section 3.3), this makes B reward-unstable. What a reward-stable f amounts to is that coincide with the (set-inclusion) minimal behavior B such whether a particular prefix needs to be rewarded in order to that |= f . The reason for this ‘stingy’ semantics, making make f true does not depend on the future of the sequence.
rewards minimal, is that f does not actually say that re- (p → $) (¬p → $) which wards are allocated to more prefixes than are required for says we should either reward all achievements of the goal (p → $) says only that a reward p or reward achievements of ¬p but does not determine is given every time p is true, even though a more generous which. We call such formula reward-indeterminate. What distribution of rewards would be consistent with it.
a reward-determinate f amounts to is that the set of behav-iors modelling f , i.e. {B | |= f }, has a unique minimum.
If it does not, Bf is insufficient (too small) to make f true.
It is intuitively clear that many behaviors can be specified In (Thi´ebaux et al., 2002), we show that formulae that are by means of $FLTL formulae. There is a list in (Bacchus both reward-stable and reward-determinate – we call them et al., 1996) of behaviors expressible in PLTL which it reward-normal – are precisely those that capture the notion might be useful to reward. All of those examples are ex- of “no funny business”. This is this intuition that we ask the pressible naturally in $FLTL, as follows.
reader to note, as it will be needed in the rest of the paper.
Just for reference then, we define: A simple example is the classical goal formula g sayingthat a goal p is rewarded whenever it happens: Definition 3 f is reward-normal iff for every Γ ∈ S ω and
As mentioned earlier, Bg is the set of finite sequences of every B ⊆ S∗ Γ |= f iff Bf ∩ pre(Γ) ⊆ B. states such that p holds in the last state. If we only care that p is achieved once and get rewarded at each state from While reward-abnormal formulae may be interesting, for (p → $). The behavior that this for- present purposes we restrict attention to reward-normal mula represents is the set of finite state sequences having at ones. Naturally, all formulae in Section 3.2 are normal.
least one state in which p holds. By contrast, the formula¬p U (p ∧ $) stipulates that only the first occurrence of p $FLTL FORMULA PROGRESSION
is rewarded (i.e. it specifies the behavior in Figure 1). Toreward the occurrence of p at most once every k steps, we Having defined a language to represent behaviors to be re- (( k+1p ∧ ¬ ≤kp) → k+1$).
warded, we now turn to the problem of computing, given areward formula, a minimum allocation of rewards to states For response formulae, where the achievement of p is trig- actually encountered in an execution sequence, in such a way as to satisfy the formula. Because we ultimately wish reward every state in which p is true following the first issue to use anytime solution methods which generate state se- of the command. To reward only the first occurrence p after quences incrementally via forward search, this computa- (c → (¬p U (p ∧ $))). As for tion is best done on the fly, while the sequence is being bounded variants for which we only reward goal achieve- generated. We therefore devise an incremental algorithm ment within k steps of the command, we write for example inspired from a model-checking technique normally used ≤k(p → $)) to reward all such states in which p to check whether a state sequence is a model of an FLTL formula (Bacchus and Kabanza, 1998). This technique is It is also worth noting how to express simple behaviors known as formula progression because it ‘progresses’ or involving past tense operators. To stipulate a reward if p ‘pushes’ the formula through the sequence.
has always been true, we write $ U ¬p. To say that we In essence, our progression algorithm computes the mod- are rewarded if p has been true since q was, we write elling relation |= given in Section 3.1, but unlike the def- 3These difficulties are inherent in the use of linear-time for- 2If there is no B such that |= f, which is the case for any malisms in contexts where the principle of directionality must be $-free f which is not a logical theorem, then Bf is enforced. They are shared for instance by formalisms developed This limiting case is a little artificial, but since such formulae do for reasoning about actions such as the Event Calculus and LTL not describe the attribution of rewards, it does no harm.
action theories, see e.g. (Calvanese et al., 2002).
inition of |= , it is designed to be useful when states in (Γ, 0) |= f iff (Γ, 1) |= Prog(b0, Γ0, f0), where f0 = f the sequence become available one at a time, in that it de- and b0 stands for Γ(0) ∈ B. So B must be such that fers the evaluation of the part of the formula that refers to (Γ, 1) |= Prog(b0, Γ0, f0). To ensure minimality, we first the future to the point where the next state becomes avail- assume that Γ(0) ∈ B, i.e. b0 is false, and compute able. Let Γi be a state, say the last state of the sequence Prog(false, Γ0, f0). If the result is , then since no mat- prefix Γ(i) that has been generated so far, and let b be a ter what Γ1 turns out to be (Γ, 1) |= , we know that the boolean true iff Γ(i) is in the behavior B to be rewarded.
assumption about b0 being false does not suffice to sat- The progression of the $FLTL formula f through Γ i given isfy f . The only way to get f to hold is to assign a re- b, written Prog(b, Γ , f ), is a new formula satisfying the ward to Γ(0), so we take Γ(0) to be in B, i.e. b0 is true, following property. Where b ⇔ (Γ(i) ∈ B), we have: and set the formula to be considered in the next state tof1 = Prog(true, Γ0, f0). If on the other hand the result is Property 1 , i) |= f iff , i + 1) |= Prog(b, Γ , f )
not , then we need not reward Γ(0) to make f hold, so wetake Γ(0) not to be in B and set f1 = Prog(false, Γ0, f0).
That is, given that b tells us whether or not to reward Γ(i), i iff the new formula Prog(b, Γi When Γ1 becomes available, we can iterate this reasoning to compute the smallest value of b1 such that (Γ, 1) |= 1 and that of the corresponding f2 = Prog(b1, Γ1, f1). Andso on: progression through a sequence of states defines a Algorithm 1 $FLTL Progression
sequence of booleans b0, b1, . . . and a sequence of formu- lae f0, f1, . . . . When Γi becomes available, we can com- i such that (Γ, i) |= corresponding fi+1. The value of bi represents Γ(i) ∈ Bf and tells us whether we should allocate a reward at that iff p ∈ s and otherwise stage, while fi+1 is the new formula with which to iterate iff p ∈ s and otherwise the process. In Algorithm 1, the function Rew takes Γ i and Prog(b, s, f1 ∧ f2) = Prog(b, s, f1) Prog(b, s, f2) fi as parameter, and returns bi by computing the value of Prog(b, s, f1 ∨ f2) = Prog(b, s, f1) Prog(b, s, f2) i). The function $Prog takes Γi and fi as i+1 by calling Prog(bi Γi Prog(b, s, f1 U f2) = Prog(b, s, f2) (Prog(b, s, f1) ∧ f1 Uf2) The following theorem states that under weak assumptions,rewards are correctly allocated by progression: = true iff Prog(false, s, f ) = Theorem 1 Let f be reward-normal, and let f0, f1, . . .
be the result of progressing it through the successive states
of a sequence
Γ. Then, provided no fi is ⊥, for all i
This is to be matched with the definition of |= in Sec- i) iff Γ(i) ∈ Bf . tion 3.1. Whenever |= evaluates a subformula whose truth only depends on the current state, Prog does the same The premiss of the theorem is that f does not eventually progress to . Indeed if fi = for some i, it means that |= evaluates a subformula whose truth depends on future even rewarding Γ(i) does not suffice to make f true, so states, Prog defers the evaluation by returning a new sub- something must have gone wrong: at some earlier stage, formula to be evaluated in the next state. Note that Prog is the boolean b was made false where it should have been computable in linear time in the length of f , and that for made true. The usual explanation is that the original f $-free formulae, it collapses to FLTL formula progression (Bacchus and Kabanza, 1998), regardless of the value of b.
reward unstable, progresses to in the next state if p istrue there: regardless of Γ Like |= , the function Prog assumes that B is known, but of course we only have f and one new state at a 0 = false, and f1 = ¬p, so if p ∈ Γ1 then f2 = .
However, other (admittedly bizarre) possibilities exist: for time of Γ, and what we really want to do is compute p → $ is reward-unstable, its substi- the appropriate B, namely that represented by f .
$, which also progresses to in a similarly as in Section 3.1, we now turn to the second few steps, is logically equivalent to $ and is reward-normal.
step, which is to use Prog to decide on the fly whethera newly generated sequence prefix Γ(i) is in Bf and so If the progression method is to deliver the correct minimal should be allocated a reward. This amounts to incremen- behavior in all cases (even in all reward-normal cases) it has tally computing Bf ∩ pre(Γ), which provided f is reward to backtrack on the choice of values for the b is. In the inter- normal, is the minimal behavior B such that (Γ, 0) |= f .
est of efficiency, we choose not to allow backtracking. In- stead, our algorithm raises an exception whenever a reward formula progresses to , and informs the user of the se- SOLVING NMRDPs
quence which caused the problem. The onus is thus placedon the domain modeller to select sensible reward formulae TRANSLATION INTO XMDP
so as avoid possible progression to . It should be notedthat in the worst case, detecting reward-normality cannot We now exploit the compact representation of a non- be easier than the decision problem for $FLTL so it is not Markovian reward function as a reward function specifi- to be expected that there will be a simple syntactic criterion cation to translate an NMRDP into an equivalent XMDP for reward-normality. In practice, however, commonsense amenable to state-based anytime solution methods. Recall precautions such as avoiding making rewards depend ex- from Section 2.3 that each e-state in the XMDP is labeled plicitly on future tense expressions suffice to keep things by a state of the NMRDP and by history information suf- ficient to determine the immediate reward. In the case ofa compact representation as a reward function specificationφ REWARD FUNCTIONS
0, this additional information can be summarized by the progression of φ0 through the sequence of states passed With the language defined so far, we are able to compactly through. So an e-state will be of the form s, φ , where represent behaviors. The extension to a non-Markovian re- s ∈ S is a state, and φ ⊆ $FLTL × IR is a reward function ward function is straightforward. We represent such a func- specification (obtained by progression). Two e-states s, φ tion by a set φ ⊆ $FLTL × IR of formulae associated with and t, ψ are equal if s = t, the immediate rewards are the real valued rewards. We call φ a reward function specifi- same, and the results of progressing φ and ψ through s are cation. Where formula f is associated with reward r in φ, we write ‘(f : r) ∈ φ’. The rewards are assumed to beindependent and additive, so that the reward function R Definition 5 Let D = S, s0, A, Pr, R be an NMRDP,
and φ0 be a reward function specification representing R(i.e., Rφ = R , see Definition 4). We translate D into the Definition 4 (Γ(i)) =
XMDP D = S , s0, A , Pr , R defined as follows: E.g, if φ is {¬p U p ∧ $ : 5.2, (q → a reward of 5.2 the first time that p holds, a reward of 7.3 from the first time that q holds onwards, a reward of 12.5 3. A ( s, φ ) = A(s) when both conditions are met, and 0 in otherwise.
4. If a ∈ A ( s, φ ), then Pr ( s, φ , a, s , φ ) = Again, we can progress a reward function specification φto compute the reward at all stages i of Γ. As before, pro- Pr(s, a, s ) if φ = SProg(s, φ) 0, φ1, . . . of reward function If a ∈ A ( s, φ ), then Pr ( s, φ , a, •) is undefined. the function that applies Prog to all formulae in a rewardfunction specification: SProg(s, φ) = {(Prog(s, f ) : r) | (f : r) ∈ φ} Then, the total reward received at stage i is simply the sum Item 1 says that the e-states are labeled by a state and a of the real-valued rewards granted by the progression func- reward function specification. Item 2 says that the initial tion to the behaviors represented by the formulae in φ e-state is labeled with the initial state and with the original reward function specification. Item 3 says that an action is applicable in an e-state if it is applicable in the state label- ing it. Item 4 explains how successor e-states are and their By proceeding that way, we get the expected analog of The- probabilities are computed. Given an action a applicable orem 1, which states progression correctly computes non- in an e-state s, φ , each successor e-state will be labeled by a successor state s of s via a in the NMRDP and bythe progression of φ through s. The probability of that e- Theorem 2 Let φ be a reward-normal4 reward function
state is Pr(s, a, s ) as in the NMRDP. Note that the cost of specification, and let φ0, φ1 . . . be the result of pro- computing Pr is linear in that of computing Pr and in the gressing it through the successive states of a sequenceΓ sum of the lengths of the formulae in φ. Item 5 has been Then, provided (: r) ∈ φi for any i, then )} = (Γ(i)). It is easy to show that this translation leads to an equivalent XMDP, as defined in Definition 1. Obviously, the function We extend the definition of reward-normality to reward spec- ification functions the obvious way, by requiring that all reward τ required for Definition1 is given by τ( s, φ ) = s, and then the proof is a matter of checking conditions.
The size of the XMDP obtained, i.e. the number of e-states R(Γ(i − 1); ∆) if ∈ D(s) it contains is a key issue for us, as it has to be amenable to state-based solution methods. Ideally, we would like the Blind minimality is similar, except that, since there is no XMDP to be of minimal size. However, we do not know looking ahead, no distinction can be drawn between feasi- of a method building the minimal equivalent XMDP incre- ble trajectories and others in the future of s: mentally, adding parts as required by the solution method.
Definition 6 Let S be the set of e-states in an equivalent
And since in the worst case even the minimal XMDP can XMDP D for an NMRDP D = S, s0, A, Pr, R . D is be larger than the NMRDP by a factor exponential in the blind minimal iff for each e-state s, r ∈ S there exists a length of the reward formulae (Bacchus et al., 1996), con- prefix Γ(i) ∈ D(s0) such that Γ structing it entirely would nullify the interest of anytime i = s and for all ∈ S ∗: R(Γ(i − 1); ∆) if ∆ However, as we now explain, Definition 5 leads to an equiv-alent XMDP exhibiting a relaxed notion of minimality, and Theorem 4 Let D be the translation of D as in Defini-
which is amenable to incremental construction. By inspec- tion 5. D is a blind minimal equivalent XMDP for D. tion, we may observe that wherever an e-state s, φ hasa successor s , φ via action a, this means that in order EMBEDDED SOLUTION/CONSTRUCTION
to succeed in rewarding the behaviors described in φ by Blind minimality is essentially the best achievable with means of execution sequences that start by going from s tos anytime state-based solution methods which typically ex- via a, it is necessary that the future starting with s suc- tend their envelope one step forward without looking ceeds in rewarding the behaviors described in φ . If s, φ deeper into the future. Our translation into a blind-minimal is in the minimal equivalent XMDP, and if there really are XMDP can be trivially embedded in any of these solution such execution sequences succeeding in rewarding the be- methods. This will result in an ‘on-line construction’ of the haviors described in φ, then s , φ must also be in the min- XMDP: the method will entirely drive the construction of imal XMDP. That is, construction by progression can only those parts of the XMDP which it feels the need to explore, introduce e-states which are a priori needed. Note that an and leave the others implicit. If time is short, a subopti- e-state that is a priori needed may not really be needed: mal or even incomplete policy may be returned, but only there may in fact be no execution sequence using the avail- a fraction of the state and expanded state spaces will be able actions that exhibits a given behavior. For instance, constructed. Note that the solution method should raise an exception as soon as one of the reward formulae progresses every time command p is issued, we will be rewarded k to , i.e., as soon as an expanded state s, φ is built such steps later provided q is true then. Obviously, whether p is that (: r) ∈ φ, since this acts as a detector of unsuitable true at some stage affects the way future states should be rewarded. However, if k steps from there a state satisfyingq can never be reached, then a posteriori p is irrelevant, and To the extent enabled by blind minimality, our approach al- there was no need to label e-states differently according to lows for a dynamic analysis of the reward formulae, much whether p was true or not. To detect such cases, we would as in (Bacchus et al., 1997). Indeed, only the execution have to look perhaps quite deep into feasible futures. Hence sequences realisable under a particular policy actually ex- the relaxed notion which we call blind minimality does not plored by the solution method contribute to the analysis of always coincide with absolute minimality.
rewards for that policy. Specifically, the reward formulaegenerated by progression for a given policy are determined We now formalise the difference between true and blind by the prefixes of the execution sequences realisable under minimality. To simplify notation (avoiding functions like this policy. This dynamic analysis is particularly useful, the τ of Definition 1), we represent each e-state as a pair since relevance of reward formulae to particular policies where s ∈ S and r is a function from S ∗ to IR intu- (e.g. the optimal policy) cannot be detected a priori.
itively assigning rewards to sequences in the NMRDP start-ing from s. A given s may be paired with several functions The forward-chaining planner TLPlan (Bacchus and Ka- r corresponding to relevantly different histories of s. The banza, 2000) introduced the idea of using FLTL to spec- XMDP is minimal if every such r is needed to distinguish ify domain-specific search control knowledge and formula between reward patterns in the feasible futures of s: progression to prune unpromising sequential plans (plansviolating this knowledge) from deterministic search spaces.
Theorem 3 Let S be the set of e-states in a minimal equiv-
This has been shown to provide enormous time gains, lead- alent XMDP D for D = S, s0, A, Pr, R . Then for each ing TLPlan to win the 2002 planning competition hand- e-state s, r ∈ S there exists a prefix Γ(i) ∈ D(s0) such tailored track. Because our approach is based on progres- that Γi = s and for all ∈ S∗: sion, it provides an elegant way to exploit search control knowledge, yet in the context of decision-theoretic plan- formulae. For the labeling, two extreme cases are consid- ning. Here this results in a dramatic reduction of the frac- ered: one very simple and the other elaborate. In the sim- tion of the XMDP to be constructed and explored, and ple case, an e-state is labeled by the set of all subformu- therefore in substantially better policies by the deadline.
lae which are true at it. The computation of such simplelabels can be done forward starting from the initial state, We achieve this as follows. We specify, via a $-free formulac and so could be embedded in an anytime solution method.
0, properties which we know must be verified by paths fea- However, because the structure of the original reward for- sible under promising policies. Then we simply progressc mulae is lost when considering subformulae individually, 0 alongside the reward function specification, making e- fine distinctions between histories are drawn which are to- states triples s, φ, c where c is a $-free formula obtained tally irrelevant to the reward function. Consequently, the by progression. To prevent the solution method to apply an expanded state space easily becomes exponentially bigger action that leads to the control knowledge being violated, than the blind-minimal one. This is problematic with the the action applicability condition (item 3 in Definition 5) solution methods we consider, because size severely affects becomes: a ∈ A ( s, φ, c ) iff a ∈ A(s) and c = (the their performance in solution quality.
other changes are straightforward). For instance, the effectof the control knowledge formula In the elaborate case, a pre-processing phase uses PLTL from consideration any feasible path in which p is not fol- formula regression to find sets of subformulae as poten- lowed by q. This is detected as soon as violation occurs, tial labels for possible predecessor states, so that the sub- when the formula progresses to . Although this paper sequent generation phase builds an XMDP representing all focuses on non-Markovian rewards rather than dynamics, and only the histories which make a difference to the way it should be noted that $-free formulae can also be used to actually feasible execution sequences should be rewarded.
express non-Markovian constraints on the system’s dynam- The XMDP produced is minimal, and so in the best case ex- ics, which can be incorporated in our approach exactly as ponentially smaller than the blind-minimal one. However, the prohibitive cost of the pre-processing phase makes itunusable for anytime solution methods (it requires expo- RELATED AND FUTURE WORK
nential space and a number of iterations through the statespace exponential in the size of the reward formulae). We It is evident that our thinking about solving NMRDPs and do not consider that any method based on PLTL and regres- the use of temporal logic to represent them draws on (Bac- sion will achieve a meaningful relaxed notion of minimality chus et al., 1996). Both this paper and (Bacchus et al., without a costly pre-processing phase. Our main contribu- 1997) advocate the use of PLTL over a finite past to spec- tion is an approach based on FLTL and progression which ify non-Markovian rewards. In the PLTL style of specifi- does precisely that, letting the solution method resolve the cation, we describe the past conditions under which we get tradeoff between quality and cost in a principled way inter- rewarded now, while with $FLTL we describe the condi- mediate between the two extreme suggestions above.
tions on the present and future under which future states The structured representation and solution methods tar- will be rewarded. While the behaviors and rewards maybe the same in each scheme, the naturalness of thinking in geted by (Bacchus et al., 1997) differ from the any- one style or the other depends on the case. Letting the kids time state-based solution methods our framework primar-ily aims at, in particular in that they do not require explicit have a strawberry dessert because they have been good allday fits naturally into a past-oriented account of rewards, state enumeration at all (Boutilier et al., 2000; Hoey et al.,1999). Accordingly, the translation into XMDP given in whereas promising that they may watch a movie if they tidytheir room (indeed, making sense of the whole notion of (Bacchus et al., 1997) keeps the state and expanded state promising) goes more naturally with FLTL. One advantage space implicit, and amounts to adding temporal variablesto the problems together with the decision-tree describing of the PLTL formulation is that it trivially enforces the prin-ciple that present rewards do not depend on future states.
their dynamics. It is very efficient but rather crude: the In $FLTL, this responsibility is placed on the domain mod- encoded history features do not even vary from one state tothe next, which strongly compromises the minimality of the On the other hand, the greater expressive power of $FLTL opens the possibility of considering a richer class XMDP.5 However, non-minimality is not as problematic aswith the state-based approaches, since structured solution of decision processes, e.g. with uncertainty as to whichrewards are received (the dessert or the movie) and when methods do not enumerate states and are able to dynami- (some time next week, before it rains). This is a topic for cally ignore some of the variables that become irrelevant atsome point of policy construction.
future work. At any rate, as we now explain, $FLTL is bet-ter suited than PLTL to solving NMRDPs using anytime 5(Chomicki, 1995) uses a similar approach to extend a database with auxiliary relations containing additional informa-tion sufficient to check temporal integrity constraints. As there is (Bacchus et al., 1996) proposes a method whereby an e- only ever one sequence of databases, what matters is more the size state is labeled by a set of subformulae of the PLTL reward of these relations than avoiding making redundant distinctions.
In another sense, too, our work represents a middle way, References
combining the advantages conferred by state-based and Bacchus, F., Boutilier, C., and Grove, A. (1996). Rewarding be- structured approaches, e.g. by (Bacchus et al., 1996) on haviors. In Proc. AAAI-96, pages 1160–1167.
one side, and (Bacchus et al., 1997) on the other. From the Bacchus, F., Boutilier, C., and Grove, A. (1997). Structured so- former we inherit a meaningful notion of minimality. As lution methods for non-markovian decision processes. InProc. AAAI-97, pages 112–117.
with the latter, approximate solution methods can be used Bacchus, F. and Kabanza, F. (1998). Planning for temporally ex- and can perform a restricted dynamic analysis of the reward tended goals. Annals of Mathematics and Artificial Intelli- formulae. In virtue of the size of the XMDP produced, the translation proposed in (Bacchus et al., 1997) is clearly Bacchus, F. and Kabanza, F. (2000). Using temporal logic to ex- unsuitable to anytime state-based methods. The question press search control knowledge for planning. Artificial In-telligence, 116(1-2).
of the appropriateness of our translation to structured solu- Barto, A., Bardtke, S., and Singh, S. (1995). Learning to act us- tion methods, however, cannot be settled as clearly. On the ing real-time dynamic programming. Artificial Intelligence, one hand, our approach does not preclude the exploitation of a structured representation of system’s states,6 and for- Boutilier, C., Dean, T., and Hanks, S. (1999). Decision-theoretic mula progression enables even state-based methods to ex- planning: Structural assumptions and computational lever-age.
In Journal of Artificial Intelligence Research, vol- ploit some of the structure in ‘$FLTL space’. On the other hand, the gap between blind and true minimality indicates Boutilier, C., Dearden, R., and Goldszmidt, M. (2000). Stochastic that progression alone is insufficient to always fully exploit dynamic programming with factored representations. Artifi- that latter structure (reachability is not exploited). With our cial Intelligence, 121(1-2):49–107.
translation, even structured solution methods will not rem- Calvanese, D., De Giacomo, G., and Vardi, M. (2002). Reasoning about actions and planning in LTL action theories. In Proc. edy this. There is a hope that (Bacchus et al., 1997) is able to take advantage of the full structure of the reward func- Chomicki, J. (1995). Efficient checking of temporal integrity con- tion, but also a possibility that it will fail to exploit even as straints using bounded history encoding. ACM Transactions much structure as our approach would, as efficiently.
on Database Systems, 10(2):149-186.
Dean, T., Kaelbling, L., Kirman, J., and Nicholson, A. (1995).
The most important item for future work is an empirical Planning under time constraints in stochastic domains. Ar- comparison of the three approaches in view to answering tificial Intelligence, 76:35–74.
this question and identifying the domain features favoring Drummond, M. (1989). Situated control rules. In Proc. KR-89, one over the other. Ours has been fully implemented as Feng, Z. and Hansen, E. (2002). Symbolic LAOsearch for fac- an extension (rewards, probabilities) of TLplan’s planning tored markov decision processes. In Proc. AIPS-02 Work- language, and also includes functions and bounded quan- shop on Planning via Model-Checking. To appear.
tification (Bacchus and Kabanza, 2000). To allow for com- Haddawy, P. and Hanks, S. (1992). Representations for decision- theoretic planning: Utility functions and deadline goals. In parisons to be reported in a longuer version of this paper, we are in the process of implementing the other two ap- Hansen, E. and Zilberstein, S. (2001). LAO: A heuristic search proaches, no implementation of either of them being re- algorithm that finds solutions with loops. Artificial Intelli- ported in the literature. Another exciting area for future work is the investigation of temporal logic formalisms for Hoey, J., St-Aubin, R., Hu, A., and Boutilier, C. (1999). SPUDD: stochastic planning using decision diagrams. In Proc. UAI- specifying heuristics for NMRDPs or more generally for search problems with temporally extended goals, as good Howard, R. (1960). Dynamic Programming and Markov Pro- heuristics are important to some of the solution methods we cesses. MIT Press, Cambridge, MA.
are targeting. Related to this is the problem of extending, to Korf, R. (1990). Real-time heuristic search. Artificial Intelligence, temporally extended goals, the GOALP predicate (Bacchus Kushmerick, N., Hanks, S., and Weld, D. (1995). An algorithm and Kabanza, 2000) which is the key to the specification for probabilistic planning. Artificial Intelligence, 76:239– of reusable control knowledge in the case of reachability goals. Finally, we should investigate the precise relation- Pistore, M. and Traverso, P. (2001). Planning as model-checking ship between our line of work and recent work on planning for extended goals in non-deterministic domains. In Proc. for weak temporally extended goals in non-deterministic Pistore, M. and Traverso, P. (2002). Planning with a language for domains, such as attempted reachability and maintenance extended goals. In Proc. AAAI-02.
Thi´ebaux, S., Hertzberg, J., Shoaff, W., and Schneider, M. (1995).
A stochastic model of actions and plans for anytime plan- Acknowledgements
ning under uncertainty. International Journal of Intelligent Rajeev Gor´e, Charles Gretton, David Price, Leonore Zuck, and reviewers for useful discussions and comments.
Thi´ebaux, S., Kabanza, F., and Slaney, J. (2002).
checking approach to decision-theoretic planning with 6Symbolic implementations of the solution methods we con- sider, e.g. (Feng and Hansen, 2002), as well as formula progres- tralian National University, Computer Sciences Laboratory.
sion in the context of symbolic state representations (Pistore and
Traverso, 2001) could be investigated for that purpose.


Microsoft word - frequently asked questions boat tours 2013 final.doc

PORT OF LOS ANGELES SCHOOL BOAT TOUR PROGRAM FREQUENTLY ASKED QUESTIONS Who is my Port of Los Angeles School Boat Tour contact? Trisha Malahni, Direct number: (310) 732-3960 Public Relations Division main line: (310) 732-3508 Email: [email protected] How can I find information about the School Boat Tour Program? Schools interested in participating in the Port of Los An

[courrier - 30] courrier/courrier<pag30-31> . 30-09-05

Prevenir a pandemia Um negócio de milhões A ameaça estendeu-se ao homem e fez do medicamento para a gripeTamiflu uma mina de ouro para o gigante farmacêutico Roche H Á dias estive em Washington e mentar da NATO sobre a preparaçãoda segurança civil em caso de ata- as pessoas queiram comprar os nossos produtos e o mes- mo se aplica ao Tamiflu.» «Claro. Por isso,

© 2010-2018 Modern Medicine