Markov Decision Process
A stationary discrete Markov Decision Process (MDP) is characterized by the tuple $(\mathcal{X},\mathcal{A},\mathbb{P},r,\gamma)$, where
- $\mathcal{X}$ is the (finite countable) set of states;
- $\mathcal{A}$ is the (finite countable) set of actions;
- $P : \mathcal{X} \times \mathcal{A} \times \mathcal{X} \to [0,1]$ is the transition probability function, such that $P(x,a,x^+)$ is the probability of ending up in state $x^+ \in \mathcal{S}$ from state $x \in \mathcal{X}$ when taking action $a \in \mathcal{A}$;
- $r : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is the reward function, so that $r(x,x^+)$ returns the reward for
transitioning from state $x \in \mathcal{X}$ to state $x^+ \in \mathcal{X}$;
- $\gamma \in \mathbb{R}_{> 0}$ is a discount factor [5].
The aim is to find a policy, i.e., a function $\pi : \mathcal{S} \to \mathcal{A}$, returning the best action for any given state. A solution concept for MDP is the value function, $v^{\pi} : \mathcal{S} \to \mathbb{R}$, defined as
\[\begin{equation} v^{\pi}(x) = \overbrace{\sum_{x^+ \in \mathcal{X}} P(x,\pi(x),x^+) \left( r(x,x^+) + \gamma v(x^+) \right)}^{=:\mathsf{T}(v^{\pi})} \end{equation}\]
returning the "goodness" of policy $\pi$. The expression in $(1)$ is known as Bellman equation, and can be expressed as an operator of $v^{\pi}$, i.e., $\mathsf{T}[v^\pi(s)] =: \mathsf{T}(v^{\pi})$. It can be shown that the value function yielded by the optimal policy, $v^*$, results from the fixed-point problem $v^* = \mathsf{T}(v^*)$. Therefore, the latter can be formulated as a canonical VI, with $F = \mathsf{I} - \mathsf{T}$.
Let's start by defining the problem data
using Random, LinearAlgebra
using Monviso, JuMP, Clarabel, Tullio
Random.seed!(2024)
# Number of states and actions, discount factor
num_X, num_A = 20, 10
γ = 0.8
# Transition probabilities and reward
P = let p = rand(num_X, num_A, num_X); p ./ sum(p; dims=3); end
R = rand(num_X, num_X)Let's define the Bellman operator (as fixed point), the VI mapping, and the initial solution
# Bellman operator
T(v) = maximum(@tullio [i,j] := P[i,j,k] * (R[i,k] + γ * v[k]); dims=2)
F(x) = x .- T(x)
# VI
model = Model(Clarabel.Optimizer)
set_silent(model)
@variable(model, y[1:num_X])
vi = VI(F; y=y, model=model)
x0 = rand(num_X)As examples, we can use the forward-reflected-backward (Monviso.frb), the projected reflected gradient (Monviso.prg), and the extra anchored gradient (Monviso.eag).
max_iter = 300
step_size = 0.05
residuals = (
frb = Vector{Float32}(undef, max_iter),
prg = Vector{Float32}(undef, max_iter),
eag = Vector{Float32}(undef, max_iter)
)
# Copy the same initial solution for the three methods
let sol_frb = (xk = x0, x1k = x0),
sol_prg = (xk = x0, x1k = x0),
sol_eag = (xk = x0,)
for k in 1:max_iter
# Forward-reflected-backward
xk1 = frb(vi, sol_frb..., step_size)
residuals.frb[k] = norm(xk1 - sol_frb.xk)
sol_frb = (xk = xk1, x1k = sol_frb.xk)
# Projected reflected gradient
xk1 = prg(vi, sol_prg..., step_size)
residuals.prg[k] = norm(xk1 - sol_prg.xk)
sol_prg = (xk = xk1, x1k = sol_prg.xk)
# Extra anchored gradient
xk1 = eag(vi, sol_eag..., x0, k, step_size)
residuals.eag[k] = norm(xk1 - sol_eag.xk)
sol_eag = (xk = xk1,)
end
endThe API docs have some detail on the convention for naming iterative steps. Let's check out the residuals for each method.
using Plots, LaTeXStrings
plot(
1:max_iter, [residuals.frb, residuals.prg, residuals.eag];
xlabel=L"Iterations ($k$)",
ylabel=L"$||\mathbf{x}_k - \mathbf{x}_{k+1}||$",
labels=[
"Forward-reflected-backward"
"Projected reflected gradient"
"Extra anchored gradient"
] |> permutedims,
linewidth=2
)