Anytime State-Based Solution Methods for Decision Processes with non-Markovian Rewards Sylvie Thi´ebaux Froduald Kabanza John Slaney Abstract
of the request being made, or even simply for the veryfirst achievement of a goal which becomes irrelevant af-
A popular approach to solving a decision pro-
terwards. A decision process in which rewards depend on
cess with non-Markovian rewards (NMRDP) is
the sequence of states passed through rather than merely
to exploit a compact representation of the re-
on the current state is called a decision process with non-
ward function to automatically translate the NM-
Markovian rewards (NMRDP) (Bacchus et al., 1996).
RDP into an equivalent Markov decision pro-cess (MDP) amenable to our favorite MDP so-
A difficulty with NMRDPs is that the most efficient MDP
lution method. The contribution of this paper is
solution methods do not directly apply to them. The tradi-
a representation of non-Markovian reward func-
tional way to circumvent this problem is to formulate the
tions and a translation into MDP aimed at mak-
NMRDP as an equivalent MDP, whose states are those of
ing the best possible use of state-based any-
the underlying system expanded to encode enough history-
time algorithms as the solution method.
dependent information to determine the rewards. Hand
explicitly constructing and exploring only parts
crafting such an expanded MDP (XMDP) can however be
of the state space, these algorithms are able to
very difficult in general. This is exacerbated by the fact
trade computation time for policy quality, and
that the size of the XMDP impacts on the effectiveness of
have proven quite effective in dealing with large
many solution methods. Therefore, there has been interest
MDPs. Our representation extends future linear
in automating the translation into an XMDP, starting from a
temporal logic (FLTL) to express rewards. Our
natural specification of non-Markovian rewards and of the
translation has the effect of embedding model-
system’s dynamics (Bacchus et al., 1996; Bacchus et al.,
checking in the solution method. It results in
1997). This is the problem we focus on.
an MDP of the minimal size achievable with-
When solving NMRDPs in this setting, the central issue is
out stepping outside the anytime framework, and
to define a non-Markovian reward specification language
consequently in better policies by the deadline.
and a translation into an XMDP adapted to the class ofMDP solution methods and representations we would like
INTRODUCTION
to use for the type of problems at hand. The two previousproposals within this line of research both rely on past lin-
Markov decision processes (MDPs) are now widely ac-
ear temporal logic (PLTL) formulae to specify the behav-
cepted as the preferred model for decision-theoretic plan-
iors to be rewarded (Bacchus et al., 1996; Bacchus et al.,
ning problems (Boutilier et al., 1999). The fundamental as-
1997), but adopt two very different translations adapted to
sumption behind the MDP formulation is that not only the
two very different types of solution methods and represen-
system dynamics but also the reward function are Marko-
tations. The translation in (Bacchus et al., 1996) targets
vian. Therefore, all information needed to determine the
classical state-based solution methods such as policy iter-
reward at a given state must be encoded in the state itself.
ation (Howard, 1960) which generate complete policies atthe cost of enumerating all states in the entire MDP, while
This requirement is not always easy to meet for plan-
that in (Bacchus et al., 1997) targets structured solution
ning problems, as many desirable behaviors are naturally
methods and representations, which do not require explicit
expressed as properties of execution sequences, see e.g.
state enumeration, see e.g. (Boutilier et al., 2000).
(Drummond, 1989; Haddawy and Hanks, 1992; Bacchusand Kabanza, 1998; Pistore and Traverso, 2001). Typ-
The aim of the present paper is to provide a language and
ical cases include rewards for the maintenance of some
a translation adapted to another class of solution methods
property, for the periodic achievement of some goal, for
which have proven quite effective in dealing with large
the achievement of a goal within a given number of steps
MDPs, namely anytime state-based methods such as (Barto
et al., 1995; Dean et al., 1995; Thi´ebaux et al., 1995;
BACKGROUND
Hansen and Zilberstein, 2001). These methods typicallystart with a compact representation of the MDP based on
probabilistic planning operators, and search forward froman initial state, constructing new states by expanding the
A Markov decision process of the type we consider is a
envelope of the policy as time permits. They may produce
5-tuple S, s0, A, Pr, R , where S is a finite set of fully ob-
an approximate and even incomplete policy, but only ex-
servable states, s0 ∈ S is the initial state, A is a finite set
plicitly construct and explore a fraction of the MDP. Nei-
of actions (A(s) denotes the subset of actions applicable in
ther of the two previous proposals is well-suited to such so-
s ∈ S), {Pr(s, a, •) | s∈S, a∈A(s)} is a family of proba-
lution methods, the first because the cost of the translation
bility distributions over S, such that Pr(s, a, s ) is the prob-
(most of which is performed prior to the solution phase)
ability of being in state s after performing action a in state
annihilates the benefits of anytime algorithms, and the sec-
s, and R : S → IR is a reward function such that R(s) is the
ond because the size of the XMDP obtained is an obstacle
immediate reward for being in state s. It is well known that
to the applicability of state-based methods. Since here both
such an MDP can be compactly represented using prob-
the cost of the translation and the size of the XMDP it re-
abilistic extensions of traditional planning languages, see
sults in will severely impact on the quality of the policy
e.g., (Kushmerick et al., 1995; Thi´ebaux et al., 1995).
obtainable by the deadline, we need an appropriate resolu-
A stationary policy for an MDP is a function π : S → A,
tion of the tradeoff between the two.
such that π(s) ∈ A(s) is the action to be executed in state
Our approach has the following main features. The transla-
S. We note E(π) the envelope of the policy, that is the set
tion is entirely embedded in the anytime solution method,
of states that are reachable (with a non-zero probability)
to which full control is given as to which parts of the
from the initial state s0 under the policy. If π is defined at
XMDP will be explicitly constructed and explored. While
all s ∈ E(π), we say that the policy is complete, and that it
the XMDP obtained is not minimal, it is of the minimal size
is incomplete otherwise. We note F (π) the set of states in
achievable without stepping outside of the anytime frame-
E(π) at which π is undefined. F (π) is called the fringe of
work, i.e., without enumerating parts of the state or ex-
the policy. We stipulate that the fringe states are absorbing.
panded state spaces that the solution method would not nec-
The value of a policy π at any state s ∈ E(π), noted Vπ(s)
essarily explore. This relaxed notion of minimality, which
is the sum of the expected rewards to be received at each
we call blind minimality is the most appropriate in the con-
future time step, discounted by how far into the future they
text of anytime state-based solution methods.
occur. That is, for a non-fringe state s ∈ E(π) \ F (π):
When the rewarding behaviors are specified in PLTL, there
Vπ(s) = R(s) + β
Pr(s, π(s), s )Vπ(s )
does not appear to be a way of achieving a relaxed no-
tion of minimality as powerful as blind minimality with-
where 0 ≤ β ≤ 1 is the discounting factor controlling
out a prohibitive translation. Therefore instead of PLTL,
the contribution of distant rewards. For a fringe state s ∈
we adopt a variant of future linear temporal logic (FLTL)
F (π), Vπ(s) is heuristic or is the value at s of a complete
as our specification language, which we extend to handle
default policy to be executed in absence of an explicit one.
rewards. While the language has a more complex seman-
For the type of MDP we consider, the value of a policy π is
tics than PLTL, it enables a natural translation into a blind-
the value Vπ(s0) of π at the initial state s0, and the larger
minimal XMDP by simple progression of the reward for-
mulae. Moreover, search control knowledge expressed inFLTL (Bacchus and Kabanza, 2000) fits particularly nicely
STATE-BASED ANYTIME ALGORITHMS
in this model-checking framework, and can be used to dra-matically reduce the fraction of the search space explored
Traditional state-based solution methods such as policy it-
eration (Howard, 1960) can be used to produce an optimal
The paper is organised as follows. Section 2 begins with
complete policy. Policy iteration can also be viewed as an
background material on MDPs, NMRDPs, XMDPs, and
anytime algorithm, which returns a complete policy whose
anytime state-based solution methods. Section 3 describes
value increases with computation time and converges to op-
our language for specifying non-Markovian rewards and
timal. The main drawback of policy iteration is that it ex-
the progression algorithm. Section 4 defines our translation
plicitly enumerates all states that are reachable from s0 in
into an XMDP along with the concept of blind minimal-
the entire MDP. Therefore, there has been interest in other
ity it achieves, and presents our approach to the embedded
anytime solution methods, which may produce incomplete
construction and solution of the XMDP. Finally, Section 5,
policies, but only enumerate an increasing fraction of the
provides a detailed comparison with previous approaches,
and concludes with some remarks about future work. The
For instance, (Dean et al., 1995) describes methods which
proofs of the theorems appear in (Thi´ebaux et al., 2002).
deploy policy iteration on judiciously chosen larger and
In the initial state s0, p is false and two
A decision process with non-Markovian rewards is identi-
actions are possible: a causes a transition
cal to an MDP except that the domain of the reward func-
to s1 with probability 0.1, and no change
with probability 0.9, while for b the transi-
tion is S∗. The idea is that if the process has passed through
tion probabilities are 0.5. In state s
state sequence Γ(i) up to stage i, then the reward R(Γ(i))
true, and actions c and d (“stay” and “go”)
is received at stage i. Figure 1 gives an example. Like the
A reward is received the first time p is true,
reward function, a policy for an NMRDP depends on his-
but not subsequently. That is, the rewarded
tory, and is a mapping from S ∗ to A. As before, the value
of policy π is the expectation of the discounted cumulative
(Γ(i)) | π, Γ0 = s0
larger envelopes. Another example is (Thi´ebaux et al.,
The clever algorithms developed to solve MDPs cannot be
1995), in which a backtracking forward search in the space
directly applied to NMRDPs. One way of dealing with this
of (possibly incomplete) policies rooted at s 0 is performed
problem is to formulate the NMRDP as an equivalent MDP
until interrupted, at which point the best policy found so
with an expanded state space (Bacchus et al., 1996). The
far is returned. Real-time dynamic programming (RTDP)
expanded states in this XMDP (e-states, for short) augment
(Barto et al., 1995), is another popular anytime algorithm,
the states of the NMRDP by encoding additional informa-
which is to MDPs what learning real-time A∗ (Korf, 1990)
tion sufficient to make the reward history-independent. An
is to deterministic domains. It can be run on-line, or off-
e-state can be seen as labeled by a state of the NMRDP
line for a given number of steps or until interrupted. A
(via the function τ in Definition 1 below) and by history
more recent example is the LAO∗ algorithm (Hansen and
information. The dynamics of NMRDPs being Markovian,
Zilberstein, 2001) which combines dynamic programming
the actions and their probabilistic effects in the XMDP
are exactly those of the NMRDP. The following definition,
All these algorithms eventually converge to the optimal
adapted from (Bacchus et al., 1996), makes this concept of
policy but need not necessarily explore the entire state
equivalent XMDP precise. Figure 2 gives an example.
space to guarantee optimality.1 When interrupted beforeconvergence, they return a possibly incomplete but often
Definition 1 MDP D = S , s0, A , Pr, R is an equivalent
useful policy. Another common point of these approaches
expansion (or XMDP) for NMRDP D = S, s0, A, Pr, R
is that they perform a forward search, starting from sif there exists a mapping τ : S → S such that:
repeatedly expanding the envelope of the current policy one
1. τ (s0) = s0.
step forward. Since planning operators are used to com-pactly represent the state space, these methods will only
2. For all s ∈ S , A (s ) = A(τ (s )).
explicitly construct a subset of the MDP. In this paper, we
3. For all s1, s2 ∈ S, if there is a ∈ A(s1) such that
will use these solution methods to solve decision processes
Pr(s1, a, s2) > 0, then for all s1 ∈ S such that
with non-Markovian rewards which we define next. τ(s1)=s1, there exists a unique s2∈S , τ(s2)=s2, suchthat for all a ∈ A (s1), Pr (s1, a, s2)=Pr(s1, a, s2).NMRDPs AND EQUIVALENT XMDPs 4. For any feasible state sequence Γ ∈ D(s0) and
We first need some notation. Let S ∗ be the set of finite
Γ ∈ D (s0) such that τ(Γ ) = Γ
sequences of states over S, and S ω be the set of possibly
(Γ ) = R(Γ(i)) for all i.
infinite state sequences. In the following, where ‘Γ’ standsfor a possibly infinite state sequence in S ω and i is a natural
Items 1–3 ensure that there is a bijection between feasi-
ble state sequences in the NMRDP and feasible e-state se-
i’ we mean the state of index i in Γ, by ‘Γ(i)’
quences in the XMDP. Therefore, a stationary policy for the
0, . . . , Γi ∈ S∗ of Γ, and by pre(Γ)
we mean the set of finite prefixes of Γ. Γ
XMDP can be reinterpreted as a non-stationary policy for
the NMRDP. Furthermore, item 4 ensures that the two poli-
1 ∈ S∗ and Γ2 ∈ Sω. For a decision
cies have identical values, and that consequently, solving
0, A, Pr, R and a state s ∈ S, D(s)
stands for the set of state sequences rooted at s that are
an NMRDP optimally reduces to producing an equivalent
feasible under the actions in D, that is: D(s) = {Γ ∈
XMDP and solving it optimally (Bacchus et al., 1996):
Sω | Γ0 = s and ∀i ∃a ∈ A(Γ
Γi+1) > 0}. Proposition 1 Let D be an NMRDP, D an equivalent
Note that the definition of D(s) does not depend on R and
XMDP for it, and π a policy for D . Let π be the func-
therefore also stands for NMRDPs which we describe now. tion defined on the sequence prefixes Γ(i) ∈ D(s0) by
1This is also true of the basic envelope expansion algorithm in
π(Γ(i)) = π (Γ ), where for all j ≤ i τ(Γ ) = Γ
(Dean et al., 1995), under the same conditions as for LAO∗. is a policy for D such that Vπ(s0) = Vπ (s0).
from now on). We also adopt the notations
modality (f will be true in exactly k steps),
i for 1 ≤ i ≤ k (f will be
true within the next k steps), and
if for 1 ≤ i ≤ k (f will be true at all the next k steps).
Although negation officially occurs only in literals, i.e., the
formulae are in negation formal form (NNF), we allow our-
selves to write formulae involving it in the usual way, pro-
vided that they have an equivalent in NNF. Not every for-
mula has such an equivalent, because there is no such literal
as ¬$ and because eventualities (‘f will be true some time’)
are not expressible. These restrictions are deliberate.
Figure 2: An XMDP equivalent to the NMRDP in Figure1. τ (s0) = τ(s2) = s0. τ(s1) = τ(s3) = s1. State s1 is
The semantics of this language is similar to that of FLTL,
rewarded; the other three states are not.
with an important difference: unlike the interpretation ofthe propositional constants in P, which is fixed (i.e. each
When solving NMRDPs in this setting, the two key is-
state is a fixed subset of P), the interpretation of the con-
sues are how to specify non-Markovian reward functions
stant $ is not. Remember that $ means ‘The behavior we
compactly, and how to exploit this compact representation
want to reward has just happened’. Therefore the interpre-
to automatically translate an NMRDP into an equivalent
tation of $ depends on the behavior B we want to reward
XMDP amenable to our favorite solution methods. The
(whatever that is), and consequently the modelling relation
goal of this paper is to provide a reward function specifica-
|= stating whether a formula holds at the i-th stage of an
tion language and a translation that are adapted to the any-
arbitrary sequence Γ ∈ S ω, is indexed by B. Defining |=
time state-based solution methods previously mentioned.
is the first step in our description of the semantics:
We take these problems in turn in the next two sections. • (Γ, i) |= $ iff Γ(i) ∈ BREWARDING BEHAVIORS • (Γ, i) |=B• (Γ, i) |= ⊥LANGUAGE AND SEMANTICS • (Γ, i) |= p, for p ∈ P, iff p ∈ Γi
Representing non-Markovian reward functions compactly
reduces to compactly representing the behaviors of in-
• (Γ, i) |= ¬p, for p ∈ P, iff p ∈ Γ
where by behavior we mean a set of fi-
• (Γ, i) |= f
nite sequences of states (a subset of S ∗), e.g. the
1 ∧ f2 iff (Γ, i) |=
{ s0, s1 , s0, s0, s1 , s0, s0, s0, s1 . . .} in Figure 1. Re-
• (Γ, i) |= f1 ∨ f2 iff (Γ, i) |= f1 or (Γ, i) |= f2
call that we get rewarded at the end of any prefix Γ(i) in
• (Γ, i) |= f iff (Γ, i + 1) |= f
that set. Once behaviors are compactly represented, it is
straightforward to represent non-Markovian reward func-
• (Γ, i) |= f1 Uf2 iff ∀k ≥ i
tions as mappings from behaviors to real numbers – we
if (∀j, i ≤ j ≤ k (Γ, j) |= f2) then (Γ, k) |= f1
shall defer looking at this until Section 3.5.
Note that except for subscript B and for the first rule, this is
To represent behaviors compactly, we adopt a version of fu-
just the standard FLTL semantics, and that therefore $-free
ture linear temporal logic (FLTL) augmented with a propo-
formulae keep their FLTL meaning. As with FLTL, we say
sitional constant ‘$’, intended to be read ‘The behavior we
Γ |= f iff (Γ, 0) |= f, and |= f iff Γ |= f for all Γ ∈ Sω.
want to reward has just happened’ or ‘The reward is re-
ceived now’. The language $FLTL begins with a set of
The modelling relation |= can be seen as specifying when
basic propositions P giving rise to literals:
a formula holds, on which reading it takes B as input. Our
next and final step is to use the |= relation to define, for a
formula f , the behavior Bf that it represents, and for this
and ⊥ stand for ‘true’ and ‘false’, respectively.
we must rather assume that f holds, and then solve for B.
The connectives are classical ∧ and ∨, and the temporal
(next) and U (weak until), giving formulae:
every time p is true. We would like Bf to be the set of all
F ::= L | F ∧ F | F ∨ F | F | F U F
finite sequences ending with a state containing p. For anarbitrary f , we take B
Because our ‘until’ is weak (ff to be the set of prefixes that have
1 U f2 means f1 will be true
to be rewarded if f is to hold in all sequences:
from now on until f2 is, if ever), we can define the usefuloperator
f ≡ f U ⊥ (f will always be true
Definition 2 B
To understand Definition 2, recall that B contains prefixes
REWARD NORMALITY
at the end of which we get a reward and $ evaluates to true. Since f is supposed to describe the way rewards will be re-
$FLTL is so expressive that it is possible to write formulae
ceived in an arbitrary sequence, we are interested in behav-
which describe “unnatural” allocations of rewards. For in-
iors B which make $ true in such a way as to make f hold
stance, they may make rewards depend on future behaviors
regardless of the sequence considered. However, there may
rather than on the past, or they may leave open a choice
be many behaviors with this property, so we take their inter-
as to which of several behaviors is to be rewarded. 3 An
f will only reward a prefix if it has
to because that prefix is in every behavior satisfying f . In
ward now if p is going to hold next. We call such formula
all but pathological cases (see Section 3.3), this makes Breward-unstable. What a reward-stable f amounts to is that
coincide with the (set-inclusion) minimal behavior B such
whether a particular prefix needs to be rewarded in order to
that |= f . The reason for this ‘stingy’ semantics, making
make f true does not depend on the future of the sequence.
rewards minimal, is that f does not actually say that re-
(p → $) ∨ (¬p → $) which
wards are allocated to more prefixes than are required for
says we should either reward all achievements of the goal
(p → $) says only that a reward
p or reward achievements of ¬p but does not determine
is given every time p is true, even though a more generous
which. We call such formula reward-indeterminate. What
distribution of rewards would be consistent with it.
a reward-determinate f amounts to is that the set of behav-iors modelling f , i.e. {B | |= f }, has a unique minimum. EXAMPLES
If it does not, Bf is insufficient (too small) to make f true.
It is intuitively clear that many behaviors can be specified
In (Thi´ebaux et al., 2002), we show that formulae that are
by means of $FLTL formulae. There is a list in (Bacchus
both reward-stable and reward-determinate – we call them
et al., 1996) of behaviors expressible in PLTL which it
reward-normal – are precisely those that capture the notion
might be useful to reward. All of those examples are ex-
of “no funny business”. This is this intuition that we ask the
pressible naturally in $FLTL, as follows.
reader to note, as it will be needed in the rest of the paper. Just for reference then, we define:
A simple example is the classical goal formula g sayingthat a goal p is rewarded whenever it happens:
Definition 3 f is reward-normal iff for every Γ ∈ S ω and
As mentioned earlier, Bg is the set of finite sequences of
every B ⊆ S∗ Γ |= f iff Bf ∩ pre(Γ) ⊆ B.
states such that p holds in the last state. If we only care
that p is achieved once and get rewarded at each state from
While reward-abnormal formulae may be interesting, for
(p → $). The behavior that this for-
present purposes we restrict attention to reward-normal
mula represents is the set of finite state sequences having at
ones. Naturally, all formulae in Section 3.2 are normal.
least one state in which p holds. By contrast, the formula¬p U (p ∧ $) stipulates that only the first occurrence of p$FLTL FORMULA PROGRESSION
is rewarded (i.e. it specifies the behavior in Figure 1). Toreward the occurrence of p at most once every k steps, we
Having defined a language to represent behaviors to be re-
(( k+1p ∧ ¬ ≤kp) → k+1$).
warded, we now turn to the problem of computing, given areward formula, a minimum allocation of rewards to states
For response formulae, where the achievement of p is trig-
actually encountered in an execution sequence, in such a
way as to satisfy the formula. Because we ultimately wish
reward every state in which p is true following the first issue
to use anytime solution methods which generate state se-
of the command. To reward only the first occurrence p after
quences incrementally via forward search, this computa-
(c → (¬p U (p ∧ $))). As for
tion is best done on the fly, while the sequence is being
bounded variants for which we only reward goal achieve-
generated. We therefore devise an incremental algorithm
ment within k steps of the command, we write for example
inspired from a model-checking technique normally used
≤k(p → $)) to reward all such states in which p
to check whether a state sequence is a model of an FLTL
formula (Bacchus and Kabanza, 1998). This technique is
It is also worth noting how to express simple behaviors
known as formula progression because it ‘progresses’ or
involving past tense operators. To stipulate a reward if p
‘pushes’ the formula through the sequence.
has always been true, we write $ U ¬p. To say that we
In essence, our progression algorithm computes the mod-
are rewarded if p has been true since q was, we write
elling relation |= given in Section 3.1, but unlike the def-
3These difficulties are inherent in the use of linear-time for-
2If there is no B such that |= f, which is the case for any
malisms in contexts where the principle of directionality must be
$-free f which is not a logical theorem, then Bf is
enforced. They are shared for instance by formalisms developed
This limiting case is a little artificial, but since such formulae do
for reasoning about actions such as the Event Calculus and LTL
not describe the attribution of rewards, it does no harm.
action theories, see e.g. (Calvanese et al., 2002).
inition of |= , it is designed to be useful when states in
(Γ, 0) |= f iff (Γ, 1) |= Prog(b0, Γ0, f0), where f0 = f
the sequence become available one at a time, in that it de-
and b0 stands for Γ(0) ∈ B. So B must be such that
fers the evaluation of the part of the formula that refers to
(Γ, 1) |= Prog(b0, Γ0, f0). To ensure minimality, we first
the future to the point where the next state becomes avail-
assume that Γ(0) ∈ B, i.e. b0 is false, and compute
able. Let Γi be a state, say the last state of the sequence
Prog(false, Γ0, f0). If the result is ⊥, then since no mat-
prefix Γ(i) that has been generated so far, and let b be a
ter what Γ1 turns out to be (Γ, 1) |= ⊥, we know that the
boolean true iff Γ(i) is in the behavior B to be rewarded.
assumption about b0 being false does not suffice to sat-
The progression of the $FLTL formula f through Γ i given
isfy f . The only way to get f to hold is to assign a re-
b, written Prog(b, Γ , f
), is a new formula satisfying the
ward to Γ(0), so we take Γ(0) to be in B, i.e. b0 is true,
following property. Where b ⇔ (Γ(i) ∈ B), we have:
and set the formula to be considered in the next state tof1 = Prog(true, Γ0, f0). If on the other hand the result is
Property 1 (Γ, i) |= f iff (Γ, i + 1) |= Prog(b, Γ , f )
not ⊥, then we need not reward Γ(0) to make f hold, so wetake Γ(0) not to be in B and set f1 = Prog(false, Γ0, f0).
That is, given that b tells us whether or not to reward Γ(i),
i iff the new formula Prog(b, Γi
When Γ1 becomes available, we can iterate this reasoning
to compute the smallest value of b1 such that (Γ, 1) |= 1
and that of the corresponding f2 = Prog(b1, Γ1, f1). Andso on: progression through a sequence of states defines a
Algorithm 1 $FLTL Progression
sequence of booleans b0, b1, . . . and a sequence of formu-
lae f0, f1, . . . . When Γi becomes available, we can com-
i such that (Γ, i) |=
corresponding fi+1. The value of bi represents Γ(i) ∈ Bf
and tells us whether we should allocate a reward at that
iff p ∈ s and ⊥ otherwise
stage, while fi+1 is the new formula with which to iterate
iff p ∈ s and ⊥ otherwise
the process. In Algorithm 1, the function Rew takes Γ i and
Prog(b, s, f1 ∧ f2) = Prog(b, s, f1) ∧ Prog(b, s, f2)
fi as parameter, and returns bi by computing the value of
Prog(b, s, f1 ∨ f2) = Prog(b, s, f1) ∨ Prog(b, s, f2)
i). The function $Prog takes Γi and fi as
i+1 by calling Prog(bi Γi
Prog(b, s, f1 U f2) = Prog(b, s, f2) ∨
(Prog(b, s, f1) ∧ f1 Uf2)
The following theorem states that under weak assumptions,rewards are correctly allocated by progression:
= true iff Prog(false, s, f ) = ⊥Theorem 1 Let f be reward-normal, and let f0, f1, . . . be the result of progressing it through the successive states of a sequence Γ. Then, provided no fi is ⊥, for all i
This is to be matched with the definition of |= in Sec-
i) iff Γ(i) ∈ Bf .
tion 3.1. Whenever |= evaluates a subformula whose
truth only depends on the current state, Prog does the same
The premiss of the theorem is that f does not eventually
progress to ⊥. Indeed if fi = ⊥ for some i, it means that
|= evaluates a subformula whose truth depends on future
even rewarding Γ(i) does not suffice to make f true, so
states, Prog defers the evaluation by returning a new sub-
something must have gone wrong: at some earlier stage,
formula to be evaluated in the next state. Note that Prog is
the boolean b was made false where it should have been
computable in linear time in the length of f , and that for
made true. The usual explanation is that the original f
$-free formulae, it collapses to FLTL formula progression
(Bacchus and Kabanza, 1998), regardless of the value of b.
reward unstable, progresses to ⊥ in the next state if p istrue there: regardless of Γ
Like |= , the function Prog assumes that B is known,
but of course we only have f and one new state at a
0 = false, and f1 = ¬p, so if p ∈ Γ1 then f2 = ⊥.
However, other (admittedly bizarre) possibilities exist: for
time of Γ, and what we really want to do is computep → $ is reward-unstable, its substi-
the appropriate B, namely that represented by f . → $, which also progresses to ⊥ in a
similarly as in Section 3.1, we now turn to the second
few steps, is logically equivalent to $ and is reward-normal.
step, which is to use Prog to decide on the fly whethera newly generated sequence prefix Γ(i) is in Bf and so
If the progression method is to deliver the correct minimal
should be allocated a reward. This amounts to incremen-
behavior in all cases (even in all reward-normal cases) it has
tally computing Bf ∩ pre(Γ), which provided f is reward
to backtrack on the choice of values for the b is. In the inter-
normal, is the minimal behavior B such that (Γ, 0) |= f .
est of efficiency, we choose not to allow backtracking. In-
stead, our algorithm raises an exception whenever a reward
formula progresses to ⊥, and informs the user of the se-
SOLVING NMRDPs
quence which caused the problem. The onus is thus placedon the domain modeller to select sensible reward formulae
TRANSLATION INTO XMDP
so as avoid possible progression to ⊥. It should be notedthat in the worst case, detecting reward-normality cannot
We now exploit the compact representation of a non-
be easier than the decision problem for $FLTL so it is not
Markovian reward function as a reward function specifi-
to be expected that there will be a simple syntactic criterion
cation to translate an NMRDP into an equivalent XMDP
for reward-normality. In practice, however, commonsense
amenable to state-based anytime solution methods. Recall
precautions such as avoiding making rewards depend ex-
from Section 2.3 that each e-state in the XMDP is labeled
plicitly on future tense expressions suffice to keep things
by a state of the NMRDP and by history information suf-
ficient to determine the immediate reward. In the case ofa compact representation as a reward function specificationφREWARD FUNCTIONS
0, this additional information can be summarized by the
progression of φ0 through the sequence of states passed
With the language defined so far, we are able to compactly
through. So an e-state will be of the form s, φ , where
represent behaviors. The extension to a non-Markovian re-
s ∈ S is a state, and φ ⊆ $FLTL × IR is a reward function
ward function is straightforward. We represent such a func-
specification (obtained by progression). Two e-states s, φ
tion by a set φ ⊆ $FLTL × IR of formulae associated with
and t, ψ are equal if s = t, the immediate rewards are the
real valued rewards. We call φ a reward function specifi-
same, and the results of progressing φ and ψ through s are
cation. Where formula f is associated with reward r in φ,
we write ‘(f : r) ∈ φ’. The rewards are assumed to beindependent and additive, so that the reward function RDefinition 5 Let D = S, s0, A, Pr, R be an NMRDP, and φ0 be a reward function specification representing R(i.e., Rφ = R, see Definition 4). We translate D into theDefinition 4 Rφ(Γ(i)) = XMDP D = S , s0, A , Pr , R defined as follows:
E.g, if φ is {¬p U p ∧ $ : 5.2, (q →
a reward of 5.2 the first time that p holds, a reward of 7.3
from the first time that q holds onwards, a reward of 12.5
3. A ( s, φ ) = A(s)
when both conditions are met, and 0 in otherwise. 4. If a ∈ A ( s, φ ), then Pr ( s, φ , a, s , φ ) =
Again, we can progress a reward function specification φto compute the reward at all stages i of Γ. As before, pro-
Pr(s, a, s ) if φ = SProg(s, φ)
0, φ1, . . . of reward function
If a ∈ A ( s, φ ), then Pr ( s, φ , a, •) is undefined.
the function that applies Prog to all formulae in a rewardfunction specification:
SProg(s, φ) = {(Prog(s, f ) : r) | (f : r) ∈ φ}
Then, the total reward received at stage i is simply the sum
Item 1 says that the e-states are labeled by a state and a
of the real-valued rewards granted by the progression func-
reward function specification. Item 2 says that the initial
tion to the behaviors represented by the formulae in φ
e-state is labeled with the initial state and with the original
reward function specification. Item 3 says that an action is
applicable in an e-state if it is applicable in the state label-
ing it. Item 4 explains how successor e-states are and their
By proceeding that way, we get the expected analog of The-
probabilities are computed. Given an action a applicable
orem 1, which states progression correctly computes non-
in an e-state s, φ , each successor e-state will be labeled
by a successor state s of s via a in the NMRDP and bythe progression of φ through s. The probability of that e-
Theorem 2 Let φ be a reward-normal4 reward function
state is Pr(s, a, s ) as in the NMRDP. Note that the cost of
specification, and let φ0, φ1 . . . be the result of pro-
computing Pr is linear in that of computing Pr and in the
gressing it through the successive states of a sequenceΓ
sum of the lengths of the formulae in φ. Item 5 has been
Then, provided (⊥ : r) ∈ φi for any i, then
)} = Rφ(Γ(i)).
It is easy to show that this translation leads to an equivalent
XMDP, as defined in Definition 1. Obviously, the function
We extend the definition of reward-normality to reward spec-
ification functions the obvious way, by requiring that all reward
τ required for Definition1 is given by τ( s, φ ) = s, and
then the proof is a matter of checking conditions. BLIND MINIMALITY
The size of the XMDP obtained, i.e. the number of e-states
R(Γ(i − 1); ∆) if ∆ ∈ D(s)
it contains is a key issue for us, as it has to be amenable
to state-based solution methods. Ideally, we would like the
Blind minimality is similar, except that, since there is no
XMDP to be of minimal size. However, we do not know
looking ahead, no distinction can be drawn between feasi-
of a method building the minimal equivalent XMDP incre-
ble trajectories and others in the future of s:
mentally, adding parts as required by the solution method. Definition 6 Let S be the set of e-states in an equivalent
And since in the worst case even the minimal XMDP can
XMDP D for an NMRDP D = S, s0, A, Pr, R . D is
be larger than the NMRDP by a factor exponential in the
blind minimal iff for each e-state s, r ∈ S there exists a
length of the reward formulae (Bacchus et al., 1996), con-
prefix Γ(i) ∈ D(s0) such that Γ
structing it entirely would nullify the interest of anytime
i = s and for all ∆ ∈ S ∗:R(Γ(i − 1); ∆) if ∆
However, as we now explain, Definition 5 leads to an equiv-alent XMDP exhibiting a relaxed notion of minimality, and
Theorem 4 Let D be the translation of D as in Defini-
which is amenable to incremental construction. By inspec-
tion 5. D is a blind minimal equivalent XMDP for D.
tion, we may observe that wherever an e-state s, φ hasa successor s , φ
via action a, this means that in order
EMBEDDED SOLUTION/CONSTRUCTION
to succeed in rewarding the behaviors described in φ by
Blind minimality is essentially the best achievable with
means of execution sequences that start by going from s tos
anytime state-based solution methods which typically ex-
via a, it is necessary that the future starting with s suc-
tend their envelope one step forward without looking
ceeds in rewarding the behaviors described in φ . If s, φ
deeper into the future. Our translation into a blind-minimal
is in the minimal equivalent XMDP, and if there really are
XMDP can be trivially embedded in any of these solution
such execution sequences succeeding in rewarding the be-
methods. This will result in an ‘on-line construction’ of the
haviors described in φ, then s , φ must also be in the min-
XMDP: the method will entirely drive the construction of
imal XMDP. That is, construction by progression can only
those parts of the XMDP which it feels the need to explore,
introduce e-states which are a priori needed. Note that an
and leave the others implicit. If time is short, a subopti-
e-state that is a priori needed may not really be needed:
mal or even incomplete policy may be returned, but only
there may in fact be no execution sequence using the avail-
a fraction of the state and expanded state spaces will be
able actions that exhibits a given behavior. For instance,
constructed. Note that the solution method should raise an
exception as soon as one of the reward formulae progresses
every time command p is issued, we will be rewarded k
to ⊥, i.e., as soon as an expanded state s, φ is built such
steps later provided q is true then. Obviously, whether p is
that (⊥ : r) ∈ φ, since this acts as a detector of unsuitable
true at some stage affects the way future states should be
rewarded. However, if k steps from there a state satisfyingq can never be reached, then a posteriori p is irrelevant, and
To the extent enabled by blind minimality, our approach al-
there was no need to label e-states differently according to
lows for a dynamic analysis of the reward formulae, much
whether p was true or not. To detect such cases, we would
as in (Bacchus et al., 1997). Indeed, only the execution
have to look perhaps quite deep into feasible futures. Hence
sequences realisable under a particular policy actually ex-
the relaxed notion which we call blind minimality does not
plored by the solution method contribute to the analysis of
always coincide with absolute minimality.
rewards for that policy. Specifically, the reward formulaegenerated by progression for a given policy are determined
We now formalise the difference between true and blind
by the prefixes of the execution sequences realisable under
minimality. To simplify notation (avoiding functions like
this policy. This dynamic analysis is particularly useful,
the τ of Definition 1), we represent each e-state as a pair
since relevance of reward formulae to particular policies
where s ∈ S and r is a function from S ∗ to IR intu-
(e.g. the optimal policy) cannot be detected a priori.
itively assigning rewards to sequences in the NMRDP start-ing from s. A given s may be paired with several functions
The forward-chaining planner TLPlan (Bacchus and Ka-
r corresponding to relevantly different histories of s. The
banza, 2000) introduced the idea of using FLTL to spec-
XMDP is minimal if every such r is needed to distinguish
ify domain-specific search control knowledge and formula
between reward patterns in the feasible futures of s:
progression to prune unpromising sequential plans (plansviolating this knowledge) from deterministic search spaces. Theorem 3 Let S be the set of e-states in a minimal equiv-
This has been shown to provide enormous time gains, lead-
alent XMDP D for D = S, s0, A, Pr, R . Then for each
ing TLPlan to win the 2002 planning competition hand-
e-state s, r ∈ S there exists a prefix Γ(i) ∈ D(s0) such
tailored track. Because our approach is based on progres-
that Γi = s and for all ∆ ∈ S∗:
sion, it provides an elegant way to exploit search control
knowledge, yet in the context of decision-theoretic plan-
formulae. For the labeling, two extreme cases are consid-
ning. Here this results in a dramatic reduction of the frac-
ered: one very simple and the other elaborate. In the sim-
tion of the XMDP to be constructed and explored, and
ple case, an e-state is labeled by the set of all subformu-
therefore in substantially better policies by the deadline.
lae which are true at it. The computation of such simplelabels can be done forward starting from the initial state,
We achieve this as follows. We specify, via a $-free formulac
and so could be embedded in an anytime solution method.
0, properties which we know must be verified by paths fea-
However, because the structure of the original reward for-
sible under promising policies. Then we simply progressc
mulae is lost when considering subformulae individually,
0 alongside the reward function specification, making e-
fine distinctions between histories are drawn which are to-
states triples s, φ, c where c is a $-free formula obtained
tally irrelevant to the reward function. Consequently, the
by progression. To prevent the solution method to apply an
expanded state space easily becomes exponentially bigger
action that leads to the control knowledge being violated,
than the blind-minimal one. This is problematic with the
the action applicability condition (item 3 in Definition 5)
solution methods we consider, because size severely affects
becomes: a ∈ A ( s, φ, c ) iff a ∈ A(s) and c = ⊥ (the
their performance in solution quality.
other changes are straightforward). For instance, the effectof the control knowledge formula
In the elaborate case, a pre-processing phase uses PLTL
from consideration any feasible path in which p is not fol-
formula regression to find sets of subformulae as poten-
lowed by q. This is detected as soon as violation occurs,
tial labels for possible predecessor states, so that the sub-
when the formula progresses to ⊥. Although this paper
sequent generation phase builds an XMDP representing all
focuses on non-Markovian rewards rather than dynamics,
and only the histories which make a difference to the way
it should be noted that $-free formulae can also be used to
actually feasible execution sequences should be rewarded.
express non-Markovian constraints on the system’s dynam-
The XMDP produced is minimal, and so in the best case ex-
ics, which can be incorporated in our approach exactly as
ponentially smaller than the blind-minimal one. However,
the prohibitive cost of the pre-processing phase makes itunusable for anytime solution methods (it requires expo-
RELATED AND FUTURE WORK
nential space and a number of iterations through the statespace exponential in the size of the reward formulae). We
It is evident that our thinking about solving NMRDPs and
do not consider that any method based on PLTL and regres-
the use of temporal logic to represent them draws on (Bac-
sion will achieve a meaningful relaxed notion of minimality
chus et al., 1996). Both this paper and (Bacchus et al.,
without a costly pre-processing phase. Our main contribu-
1997) advocate the use of PLTL over a finite past to spec-
tion is an approach based on FLTL and progression which
ify non-Markovian rewards. In the PLTL style of specifi-
does precisely that, letting the solution method resolve the
cation, we describe the past conditions under which we get
tradeoff between quality and cost in a principled way inter-
rewarded now, while with $FLTL we describe the condi-
mediate between the two extreme suggestions above.
tions on the present and future under which future states
The structured representation and solution methods tar-
will be rewarded. While the behaviors and rewards maybe the same in each scheme, the naturalness of thinking in
geted by (Bacchus et al., 1997) differ from the any-
one style or the other depends on the case. Letting the kids
time state-based solution methods our framework primar-ily aims at, in particular in that they do not require explicit
have a strawberry dessert because they have been good allday fits naturally into a past-oriented account of rewards,
state enumeration at all (Boutilier et al., 2000; Hoey et al.,1999). Accordingly, the translation into XMDP given in
whereas promising that they may watch a movie if they tidytheir room (indeed, making sense of the whole notion of
(Bacchus et al., 1997) keeps the state and expanded state
promising) goes more naturally with FLTL. One advantage
space implicit, and amounts to adding temporal variablesto the problems together with the decision-tree describing
of the PLTL formulation is that it trivially enforces the prin-ciple that present rewards do not depend on future states.
their dynamics. It is very efficient but rather crude: the
In $FLTL, this responsibility is placed on the domain mod-
encoded history features do not even vary from one state tothe next, which strongly compromises the minimality of the
On the other hand, the greater expressive power of
$FLTL opens the possibility of considering a richer class
XMDP.5 However, non-minimality is not as problematic aswith the state-based approaches, since structured solution
of decision processes, e.g. with uncertainty as to whichrewards are received (the dessert or the movie) and when
methods do not enumerate states and are able to dynami-
(some time next week, before it rains). This is a topic for
cally ignore some of the variables that become irrelevant atsome point of policy construction.
future work. At any rate, as we now explain, $FLTL is bet-ter suited than PLTL to solving NMRDPs using anytime
5(Chomicki, 1995) uses a similar approach to extend a
database with auxiliary relations containing additional informa-tion sufficient to check temporal integrity constraints. As there is
(Bacchus et al., 1996) proposes a method whereby an e-
only ever one sequence of databases, what matters is more the size
state is labeled by a set of subformulae of the PLTL reward
of these relations than avoiding making redundant distinctions.
In another sense, too, our work represents a middle way,
References
combining the advantages conferred by state-based and
Bacchus, F., Boutilier, C., and Grove, A. (1996). Rewarding be-
structured approaches, e.g. by (Bacchus et al., 1996) on
haviors. In Proc. AAAI-96, pages 1160–1167.
one side, and (Bacchus et al., 1997) on the other. From the
Bacchus, F., Boutilier, C., and Grove, A. (1997). Structured so-
former we inherit a meaningful notion of minimality. As
lution methods for non-markovian decision processes. InProc. AAAI-97, pages 112–117.
with the latter, approximate solution methods can be used
Bacchus, F. and Kabanza, F. (1998). Planning for temporally ex-
and can perform a restricted dynamic analysis of the reward
tended goals. Annals of Mathematics and Artificial Intelli-
formulae. In virtue of the size of the XMDP produced,
the translation proposed in (Bacchus et al., 1997) is clearly
Bacchus, F. and Kabanza, F. (2000). Using temporal logic to ex-
unsuitable to anytime state-based methods. The question
press search control knowledge for planning. Artificial In-telligence, 116(1-2).
of the appropriateness of our translation to structured solu-
Barto, A., Bardtke, S., and Singh, S. (1995). Learning to act us-
tion methods, however, cannot be settled as clearly. On the
ing real-time dynamic programming. Artificial Intelligence,
one hand, our approach does not preclude the exploitation
of a structured representation of system’s states,6 and for-
Boutilier, C., Dean, T., and Hanks, S. (1999). Decision-theoretic
mula progression enables even state-based methods to ex-
planning: Structural assumptions and computational lever-age.
In Journal of Artificial Intelligence Research, vol-
ploit some of the structure in ‘$FLTL space’. On the other
hand, the gap between blind and true minimality indicates
Boutilier, C., Dearden, R., and Goldszmidt, M. (2000). Stochastic
that progression alone is insufficient to always fully exploit
dynamic programming with factored representations. Artifi-
that latter structure (reachability is not exploited). With our
cial Intelligence, 121(1-2):49–107.
translation, even structured solution methods will not rem-
Calvanese, D., De Giacomo, G., and Vardi, M. (2002). Reasoning
about actions and planning in LTL action theories. In Proc.
edy this. There is a hope that (Bacchus et al., 1997) is able
to take advantage of the full structure of the reward func-
Chomicki, J. (1995). Efficient checking of temporal integrity con-
tion, but also a possibility that it will fail to exploit even as
straints using bounded history encoding. ACM Transactions
much structure as our approach would, as efficiently. on Database Systems, 10(2):149-186.
Dean, T., Kaelbling, L., Kirman, J., and Nicholson, A. (1995).
The most important item for future work is an empirical
Planning under time constraints in stochastic domains. Ar-
comparison of the three approaches in view to answering
tificial Intelligence, 76:35–74.
this question and identifying the domain features favoring
Drummond, M. (1989). Situated control rules. In Proc. KR-89,
one over the other. Ours has been fully implemented as
Feng, Z. and Hansen, E. (2002). Symbolic LAO∗ search for fac-
an extension (rewards, probabilities) of TLplan’s planning
tored markov decision processes. In Proc. AIPS-02 Work-
language, and also includes functions and bounded quan-
shop on Planning via Model-Checking. To appear.
tification (Bacchus and Kabanza, 2000). To allow for com-
Haddawy, P. and Hanks, S. (1992). Representations for decision-
theoretic planning: Utility functions and deadline goals. In
parisons to be reported in a longuer version of this paper,
we are in the process of implementing the other two ap-
Hansen, E. and Zilberstein, S. (2001). LAO∗: A heuristic search
proaches, no implementation of either of them being re-
algorithm that finds solutions with loops. Artificial Intelli-
ported in the literature. Another exciting area for future
work is the investigation of temporal logic formalisms for
Hoey, J., St-Aubin, R., Hu, A., and Boutilier, C. (1999). SPUDD:
stochastic planning using decision diagrams. In Proc. UAI-
specifying heuristics for NMRDPs or more generally for
search problems with temporally extended goals, as good
Howard, R. (1960). Dynamic Programming and Markov Pro-
heuristics are important to some of the solution methods we
cesses. MIT Press, Cambridge, MA.
are targeting. Related to this is the problem of extending, to
Korf, R. (1990). Real-time heuristic search. Artificial Intelligence,
temporally extended goals, the GOALP predicate (Bacchus
Kushmerick, N., Hanks, S., and Weld, D. (1995). An algorithm
and Kabanza, 2000) which is the key to the specification
for probabilistic planning. Artificial Intelligence, 76:239–
of reusable control knowledge in the case of reachability
goals. Finally, we should investigate the precise relation-
Pistore, M. and Traverso, P. (2001). Planning as model-checking
ship between our line of work and recent work on planning
for extended goals in non-deterministic domains. In Proc.
for weak temporally extended goals in non-deterministic
Pistore, M. and Traverso, P. (2002). Planning with a language for
domains, such as attempted reachability and maintenance
extended goals. In Proc. AAAI-02.
Thi´ebaux, S., Hertzberg, J., Shoaff, W., and Schneider, M. (1995).
A stochastic model of actions and plans for anytime plan-
Acknowledgements
ning under uncertainty. International Journal of Intelligent
Rajeev Gor´e, Charles Gretton, David Price, Leonore Zuck,
and reviewers for useful discussions and comments.
Thi´ebaux, S., Kabanza, F., and Slaney, J. (2002).
checking approach to decision-theoretic planning with
6Symbolic implementations of the solution methods we con-
sider, e.g. (Feng and Hansen, 2002), as well as formula progres-
tralian National University, Computer Sciences Laboratory.
sion in the context of symbolic state representations (Pistore and
http://csl.anu.edu.au/∼thiebaux/papers/nmr.ps.gz.
Traverso, 2001) could be investigated for that purpose.
PORT OF LOS ANGELES SCHOOL BOAT TOUR PROGRAM FREQUENTLY ASKED QUESTIONS Who is my Port of Los Angeles School Boat Tour contact? Trisha Malahni, Direct number: (310) 732-3960 Public Relations Division main line: (310) 732-3508 Email: [email protected] How can I find information about the School Boat Tour Program? Schools interested in participating in the Port of Los An
Prevenir a pandemia Um negócio de milhões A ameaça estendeu-se ao homem e fez do medicamento para a gripeTamiflu uma mina de ouro para o gigante farmacêutico Roche H Á dias estive em Washington e mentar da NATO sobre a preparaçãoda segurança civil em caso de ata- as pessoas queiram comprar os nossos produtos e o mes- mo se aplica ao Tamiflu.» «Claro. Por isso,