A note on chapter 3 of Sutton & Barto: Finite Markov Decision Process (MDP)

index

Danger

This article was originally written in Vietnamese. The following is an English translation created with the assistance of Gemini 3.0 Pro to make the content accessible to a broader audience. You can find the original Vietnamese post here.

Also, this post represents my personal notes and best effort to understand and explain the deep concepts from the Sutton & Barto book: “Reinforcement Learning: An Introduction” []. I really love this book and this is my attempt to make it feels more accessible.

1. The Agent-Environment Interaction

Important

Before diving into the main content, we need a proper mental model. Reinforcement Learning is not about working with a static dataset like Machine Learning or Deep Learning; it is learning through interaction - a continuous loop of action and feedback.

The two main components of Reinforcement Learning (RL) are the Agent and the Environment. The environment is the world that the agent inhabits and interacts with. At each step (or time step), the agent perceives itself to be in a state of the environment and, based on that state, decides what it needs to do.

Tip

According to [], there are two definitions we need to distinguish: state and observation. Formally, a state $s$ is a complete description of the state of the world. For example, in chess, a state $s$ tells us exactly where every piece is on the board, including the opponent’s pieces.

However, in poker, the agent only knows the cards in its own hand, not the cards held by others or those remaining in the deck. We call an incomplete description of the environment an observation $o$ .

In this context, however, we assume that descriptions are always complete, so $o = s$ (the observation is the state).

Note (The Boundary Between Agent and Environment)

A crucial point Sutton & Barto make: This boundary is not physical (like the skin of a robot). Anything the Agent cannot change arbitrarily is considered part of the Environment.

Example: The robot’s battery level, the motors in a robotic arm… The Agent cannot simply command “Battery, become full,” so the Battery is part of the Environment (specifically, part of the State).

By performing an action on the environment, the agent receives a reward signal (or simply reward). The reward is a real number indicating how good or bad the current state of the environment is. The agent’s goal is to maximize the cumulative reward it receives over time. We call this accumulated reward the Return.

Definition (What is an MDP?)

A Markov Decision Process (MDP) is a mathematical framework used to formalize sequential decision making problems. In this problem, at each time step, the agent must make a decision (action) based on the state. Crucially, the agent’s action influences not just the immediate reward and next state, but also future rewards and states.

MDPs assume that regardless of the details of the agent (like torques, sensors, etc.) and regardless of the goal the agent is trying to achieve, any goal-directed problem can be reduced to three signals passing between the agent and its environment: a signal representing the choices made by the agent (action), a signal representing the basis on which choices are made (state), and a signal defining the agent’s goal (reward).

The term “Markov” in MDP refers to the Markov property—the future depends only on the present state, not the history.

We can formalize this process as follows:

At each time step $t$ , the agent receives a representation of the environment’s state, $S_t \in \mathcal{S}$ , where $\mathcal{S}$ is the set of all possible states.
Based on $S_t$ , the agent selects an Action $A_t \in \mathcal{A}(S_t)$ (note that the set of available actions may differ per state).
One time step later (at $t+1$ ), the agent receives a reward $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$ and finds itself in a new state $S_{t+1} \in \mathcal{S}$ .

This sequence of interactions forms a Trajectory:

S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, \dots

Note

Notice that we need an initial state $S_0$ and an initial action $A_0$ to start the learning process. Choosing $S_0$ and $A_0$ can be critically important (read more at here).

Definition (Dynamics of the MDP)

In a finite MDP, the sets $\mathcal{S}, \mathcal{A}, \mathcal{R}$ are all finite. Therefore, the random variables $S_t$ and $R_t$ have well-defined discrete probability distributions that depend only on the preceding state and action ( $S_{t-1}, A_{t-1}$ ). We call this the Dynamics of the MDP:

p(s', r \mid s, a) = \text{Pr} \{ S_{t} = s', R_{t} =r \mid S_{t-1} = s, A_{t-1} = a \}

Since $p$ is a probability distribution, the sum of probabilities for all possible pairs $(s', r)$ must equal 1:

\sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s', r \mid s, a) = 1, \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s)

In a finite MDP, the agent’s world (the environment) follows rules. These rules are probabilistic, and these rules are exactly the dynamics of the environment.

The next key point of MDPs is the Markov Property: The probability of $S_t$ and $R_t$ depends only on $S_{t-1}$ and $A_{t-1}$ , not on the longer history. The current state encapsulates all information necessary to make decisions for the future.

From the dynamics $p$ , we can compute other important information such as:

State-transition probability: $p(s' \mid s, a) = \sum_{r \in \mathcal{R}} p(s', r \mid s, a)$
Expected reward: The reward we expect to receive when taking action $a$ in state $s$ . $\begin{aligned} r(s, a) &= \mathbb{E}[R_{t} \mid S_{t-1} = s, A_{t-1} = a] \\ &= \sum_{r \in \mathcal{R}} r p(r \mid s, a) \\ &= \sum_{r \in \mathcal{R}} r \left[ \sum_{s' \in \mathcal{S}} p(s', r \mid s, a) \right] \end{aligned}$

2. Goals, Rewards, and Returns

Definition (The Reward Hypothesis)

That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

Based on the reward hypothesis, to achieve a goal, we must model that goal and the problem into rewards; the agent then simply learns based on those rewards. Furthermore, it is crucial to remember that the reward is a way to tell the agent what needs to be achieved, not how to achieve it.

So, what is the Agent’s true goal? It is not to get the highest reward immediately, but to maximize the total accumulated reward in the long run. We call this accumulated sum the Return ( $G_t$ ).

Note

To be precise, $G_t$ is the sum of rewards the agent accumulates in the future (i.e., from time step $t+1$ until the end). Why future sums instead of past sums? Recall the agent’s purpose is to maximize the reward it accumulates - like constantly looking ahead and choosing the future path that yields the most reward.

Depending on the task type, the calculation of $G_t$ differs:

Episodic Tasks: Interaction breaks down into subsequences called episodes. Each episode ends in a terminal state at time $T$ .

G_t = R_{t+1} + R_{t+2} + \dots + R_T

Continuing Tasks: Interaction goes on forever ( $T = \infty$ ). To prevent $G_t \to \infty$ , we use Discounting with a parameter $0 \leq \gamma \leq 1$ .

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^\infty \gamma^k R_{t+k+1}

Note (Why Discounting?)

There are two simple reasons:

Uncertainty: We cannot be certain about the future, so rewards further down the line are less “sure” than immediate rewards.
Patience: In finance, for example, money held now (immediate reward) is worth more than money received later. $\gamma$ can be viewed as the agent’s “patience.” If $\gamma = 0$ , the agent is myopic (greedy), caring only about the immediate reward.

Mathematically, $0 \leq \gamma \leq 1$ ensures the infinite sum $G_t$ converges (in continuing tasks). If $\gamma = 1$ , the sum might not converge.

We can observe a recursive relationship:

\begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... \\ &= R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + ...) \\ &= R_{t+1} + G_{t+1} \end{aligned}

The formula above is a critically important recursive property. It means: The return of the current sequence equals the immediate reward plus the discounted return of the next sequence.

To define a single return formula for both cases, we use:

G_{t} = \sum_{k=t+1}^T \gamma^{k - (t + 1)} R_{k}

In continuing cases, $T = \infty$ and $\gamma < 1$ . In episodic cases, $\gamma = 1$ (usually) and $T$ is finite.

3. Policies and Value Functions

How does an agent know if a state is “good”? A state is only good if it leads to a high return. But the return depends on how the agent behaves in the future. Therefore, the agent needs to know how to select and evaluate its actions.

We have Value Functions to estimate the goodness of a state (or a state-action pair). The goodness of a state $s$ is simply the expected return achievable from that state. Meanwhile, the Policy is the agent’s brain—it helps the agent make decisions based on the current state.

Definition (Policy)

A Policy $\pi$ is a mapping from states to probabilities of selecting each possible action. If an agent follows policy $\pi$ at time $t$ , then $\pi(a|s)$ is the probability that $A_t=a$ given $S_t=s$ . Since $\pi(a|s)$ is a probability distribution, $\sum_{a \in \mathcal{A}(s)} \pi(a \mid s) = 1$ .

There are two main types of Value Functions:

State-value function $v_{\pi}(s)$ : The expected return when starting in $s$ and following policy $\pi$ thereafter.

v_{\pi}(s) = \mathbb{E}_{\pi}[G_{t} \mid S_{t} = s] = \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^{k} R_{k+1} \mid S_{t} = s \right]

Action-value function $q_{\pi}(s, a)$ : The expected return starting from $s$ , taking action $a$ , and then following policy $\pi$ .

q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_{t} \mid S_{t} = s, A_{t} = a] = \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^{k} R_{k+1} \mid S_t = s, A_t = a \right]

The Q in the famous Q-Learning algorithm (and later Deep Q-Networks - DQN) stands for this action-value function $q(s, a)$ .

Note

A key phrase here is following policy $\pi$ . In other words, at any state $s$ , we can choose many different policies to generate an action. The Value Function only approximates (or gives the expected return) for a specific policy $\pi$ , assuming all future states are also handled using that same policy $\pi$ .

4. The Bellman Equation

Because $G_t$ is recursive, the Value Function must also be recursive. This relationship allows us to compute Value Functions by “backtracking” (or backing up) information from the next state to the current state. We can write $v_{\pi}$ recursively as:

\begin{aligned} v_{\pi}(s) &= \mathbb{E}[G_{t} \mid S_{t} = s] \\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_{t} = s] \\ &= \mathbb{E}[R_{t+1} \mid S_{t} = s] + \gamma\mathbb{E}[G_{t+1} \mid S_{t} = s] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\pi}(s')] \\ &= \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s] \end{aligned}

The equation above is the Bellman equation for the state-value function (see proof in Appendix A).

The Bellman equation states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way. This is the Consistency Condition.
It shows that the value of a state is the weighted average of the immediate rewards the agent expects, plus the discounted value of the states the agent enters next.

Similarly, we have the Bellman equation for the action-value function $q_{\pi}(s, a)$ :

\begin{aligned} q_{\pi}(s, a) &= \mathbb{E}[G_{t} \mid S_{t} = s, A_{t} = a] \\ &= \sum_{s', r}p(s', r \mid s, a)\left[ r +  \gamma  \sum_{a' \in \mathcal{A}(s')} \pi(a' \mid s')q_{\pi}(s', a') \right] \end{aligned}

Proof provided in Appendix B.

Note

Why do we need $q(s, a)$ if we already have $v(s)$ ?

If we only know how “good” a state is ( $v(s)$ ), we still need the environment dynamics to calculate which action leads to that good state (based on the Bellman equation for $v(s)$ ). However, if we know $q(s, a)$ , we can simply choose the action with the best $q$ -value without knowing anything about the dynamics. This is the motivation for the next section and the core idea behind Model-Free Reinforcement Learning.

Another important concept is the Backup Diagram, which visualizes the Bellman equation. Based on the equation, we can interpret the Backup Diagram as a tree where leaves are future states, and non-leaf nodes represent the weighted average of the nodes below them.

Root (open circle, representing state): State $s$ .
First branch: Policy $\pi(a \mid s)$ .
First node (solid circle, representing state-action pair): Pair $(s, a)$ .
Second branch: Environment dynamics $p(s', r \mid s, a)$ .
Leaves: Next state $s'$ , where the value $v_{\pi}(s')$ has been computed previously (to calculate current $s$ , we look ahead to $s'$ ).

Remark (Backup Operation)

We call this a Backup. Unlike simulation (which goes forward in time), a
backup transfers information from the future ( $s'$ ) back to the present ( $s$ ).
All RL algorithms (Dynamic Programming, TD Learning, Monte Carlo) revolve
around approximating this Bellman equation.

5. Bellman Optimality Equation

We have defined value functions for a specific policy $\pi$ . But we don’t just want to evaluate a policy; we want to find the best policy (the Optimal Policy).

Definition (Optimal Policy)

A policy $\pi$ is defined to be better than or equal to policy $\pi'$ if its expected return is greater than or equal to that of $\pi'$ for all states $s \in \mathcal{S}$ . In other words:

\pi \geq \pi' \iff v_\pi(s) \geq v_{\pi'}(s), \ \forall s \in \mathcal{S}

There is always at least one policy that is better than or equal to all other policies. We call this the optimal policy, denoted by $\pi_{\ast}$ .

\pi^{\ast} = \arg \max_{\pi} v_{\pi}(s), \ \forall s \in \mathcal{S}

All optimal policies share the same state-value function, called the optimal state-value function:

v_{\ast}(s) = \max_{\pi} v_{\pi}(s), \ \forall s \in \mathcal{S}

Similarly, optimal policies share the same optimal action-value function:

q_{\ast}(s, a) = \max_{\pi} q_{\pi}(s, a), \ \forall s \in \mathcal{S}, \forall a \in \mathcal{A}(s)

Just like the value function for a specific policy $\pi$ , the value function for the optimal policy $\pi^{\ast}$ can be written in a Bellman form, called the Bellman Optimality Equation.

The relationship between optimal action-value and optimal state-value is:

v_{\ast}(s) = \max_{a \in \mathcal{A}(s)} q_{\ast}(a, s)

Bellman Optimality Equation for state-value function:

\begin{aligned} v_{\ast}(s) &= \max_{a \in \mathcal{A}(s)} q_{\ast}(a, s) \\ &= \max_{a \in \mathcal{A}(s)} \mathbb{E}[R_{t+1} + \gamma v_{\ast}(S_{t+1}) \mid S_{t} = s, A_{t} = a] \\ &= \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\ast}(s')] \end{aligned}

Bellman Optimality Equation for action-value function:

\begin{aligned} q_{\ast}(a, s) &= \mathbb{E}[R_{t+1} + \gamma v_{\ast}(S_{t+1}) \mid S_{t} = s, A_{t} = a] \\ &= \mathbb{E}[R_{t+1} + \gamma \max_{A_{t+1} \in \mathcal{A}(S_{t+1})} q(S_{t+1}, A_{t+1}) \mid S_{t} = s, A_{t} = a] \\ &= \sum_{s', r} p(s', r \mid s, a) [r + \gamma \max_{a' \in \mathcal{A}(s')} q(s', a')] \end{aligned}

Backup Diagram 2 — Backup Diagram for the Bellman Optimality Equation. The arc between branches represents taking the max instead of the weighted average.

Warning (Why is the Bellman Optimality Equation hard to solve?)

The standard Bellman equation is a System of Linear Equations, as proven in Appendix D. It can be solved using matrix operations.

However, the Bellman Optimality Equation contains the $\max$ operator. This makes it Non-linear. We cannot use linear algebra to solve it in “one shot” (closed form). We are forced to use Iterative methods like Value Iteration or Q-Learning to approximate the solution gradually.

Remark

Remember to read Appendix E.

6. References

Spinning Up in Deep Reinforcement Learning, Achiam, Joshua
2018
https://spinningup.openai.com/en/latest/index.html
Reinforcement Learning: An Introduction, Sutton, Richard S. and Barto, Andrew G.
The MIT Press, 2018
http://incompleteideas.net/book/the-book-2nd.html
Reinforcement Learning: Theory and Algorithms, Alekh Agarwal, Nan Jiang Sham, M. Kakade and Wen Sun
2021
https://rltheorybook.github.io/
Deriving Bellman Equation in Reinforcement Learning, Amelio Vazquez-Reina (https://stats.stackexchange.com/users/2798/amelio-vazquez-reina)

https://stats.stackexchange.com/q/243384
How to setup the Bellman Equation as a linear system of equation, krishnab (https://cs.stackexchange.com/users/17922/krishnab)

https://cs.stackexchange.com/q/142128

Appendix

A. Proof of Bellman Equation for State-Value Function

We have:

\begin{aligned} v_{\pi}(s) &= \mathbb{E}[G_{t} \mid S_{t} = s] \\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_{t} = s] \\ &= \mathbb{E}[R_{t+1} \mid S_{t} = s] + \gamma\mathbb{E}[G_{t+1} \mid S_{t} = s] \end{aligned}

There are two expectations we need to resolve: the expectation of the immediate reward and the expectation of the next return $G_{t+1}$ .

Expectation of the next reward:

We have:

\mathbb{E}[R_{t+1} \mid S_t = s] = \sum_{r} r p(r \mid s)

We need to “decompose” the probability $p(r \mid s)$ into two parts: the policy $\pi(a \mid s)$ and the dynamics $p(r, s' \mid s, a)$ . Applying the sum rule and product rule (more details here), we get:

\begin{aligned} p(r \mid s) &= \sum_{a \in \mathcal{A}(s)} p(r, a \mid s) \\ &= \sum_{a \in \mathcal{A}(s)} p(r \mid a, s)p(a \mid s) \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ \sum_{s'} p(r, s' \mid a, s) \right] \end{aligned}

Substituting this back into the expectation:

\begin{aligned} \mathbb{E}[R_{t+1} \mid S_t = s] &= \sum_{r} r p(r \mid s) \\ &= \sum_{r} r \left[ \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left\{  \sum_{s'} p(r, s' \mid a, s)  \right\} \right] \end{aligned}

Assuming the policy $\pi(a \mid s)$ does not depend on the summation over $r$ , we can “swap” the sums. Finally:

\mathbb{E}[R_{t+1} \mid S_{t} = s] = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ \sum_{s' \in S, r \in \mathcal{R}} p(r, s' \mid a, s) r \right]

Expectation of the return at the next time step:

Recall

\mathbb{E}[X] = \sum_{y}\mathbb{E}[X \mid Y = y] \text{Pr} \{ Y = y \}

Applying the property above:

\begin{aligned} \mathbb{E}[G_{t+1} \mid S_{t} = s] &= \sum_{a \in \mathcal{A}(s)} \mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a] \text{Pr} \{A_{t} = a \mid S_{t} = s \}  \\ &= \sum_{a \in \mathcal{A}(s)} \mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a]\pi(a \mid s) \end{aligned}

Further expanding the conditional expectation:

\mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a] = \sum_{r, s'} p(r, s' \mid s, a) \mathbb{E}[G_{t+1} \mid S_{t+1} = s', S_{t} = s, R_{t+1} = r, A_{t}= a]

Definition (Conditional Independence)

If $P(A \mid B, C) = P(A \mid C)$ , we say $A$ and $B$ are conditionally independent given $C$ .

Due to the Markov property of MDPs, $G_{t+1}$ and $S_t = s$ are conditionally independent given $S_{t+1} = s'$ (same applies to $A_t$ and $R_{t+1}$ ). Thus:

\begin{aligned} \mathbb{E}[G_{t+1} \mid S_{t} =s, A_{t} = a] &= \sum_{s', r} p(r, s' \mid s, a) \mathbb{E}[G_{t+1} \mid S_{t+1} = s'] \\ &= \sum_{s', r}p(r, s' \mid s, a) v_{\pi}(s') \end{aligned}

Combining everything, we arrive at:

v_{\pi}(s) = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ \sum_{s', r} p(s', r \mid s, a) \{ \gamma v_{\pi}(s') + r \} \right]

B. Proof of Bellman Equation for Action-Value Function

From the proof in Appendix A, we see:

\mathbb{E}[v_{\pi}(S_{t+1}) \mid S_{t} = s, A_{t} = a] = \sum_{s',r} p(s', r \mid s, a) v_{\pi}(s') = \mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a]

Also:

\begin{aligned} \sum_{s', r} p(s', r \mid s, a) \{ \gamma v_{\pi}(s') + r \} &= \mathbb{E}[\gamma v_{\pi}(S_{t+1}) + R_{t+1} \mid S_{t} = s, A_{t} = a] \\ &= \mathbb{E}[R_{t+1} \mid S_{t} = s, A_{t} = a] + \gamma \mathbb{E}[v_{\pi}(S_{t+1}) \mid S_{t} = s, A_{t} = a] \\ &= \mathbb{E}[R_{t+1} \mid S_{t} = s, A_{t} = a] + \gamma \mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a] \\ &= q_{\pi}(s, a) \end{aligned}

To satisfy the recursive nature of the Bellman equation, we need to express $q_{\pi}$ in terms of another $q_{\pi}$ . First:

\begin{aligned} v_{\pi}(s) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ \sum_{s', r} p(s', r \mid s, a) \{ \gamma v_{\pi}(s') + r \} \right] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s)q_{\pi}(s, a) \\ \Leftrightarrow v_{\pi}(s') &= \sum_{a' \in \mathcal{A}(s')} \pi(a' \mid s') q_{\pi}(s', a') \end{aligned}

Substituting this into $q_{\pi}(s, a)$ , we get the final result:

q_{\pi}(s, a) = \sum_{s', r}p(s', r \mid s, a)\left[ r +  \gamma  \sum_{a' \in \mathcal{A}(s')} \pi(a' \mid s')q_{\pi}(s', a') \right]

Note

Notice that we can write the state-value function $v_{\pi}(s)$ as a sum over action-value functions $q_{\pi}(s, a)$ :

\begin{aligned} v_{\pi}(s) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) q_{\pi}(s, a) \\ &= \mathbb{E}[q_{\pi}(S_t, A_t) \mid S_t = s] \end{aligned}

C. Unified Notation for Episodic and Continuing Tasks

Let $\mathcal{S}$ be the set of non-terminal states and $\mathcal{S}^+$ be the set of all states (including terminal states).

Note

For the dynamics $p(r, s' \mid s, a)$ to remain consistent in both cases, we need to define the probability that the agent “escapes” the terminal state (or rather, transitions when in a terminal state).

Looking closer, a terminal state is a state the agent cannot escape. Thus, the dynamics $p(s', r \mid S_{T}, a)$ will be $0$ for all $s' \neq S_T$ and equal to 1 only if $s' = S_T$ . In other words:

p(s', r \mid S_{t}, a) = \begin{cases} 1 \ \text{if $s' = S_{t}$ and $r = 0$} \\ 0 \ \text{otherwise} \end{cases}

Where:

$s' = S_T$ : If we are in the terminal state, we are “stuck” there forever. We call this an absorbing state.
$r = 0$ : The reward for the terminal state is 0.

Since the dynamics remain consistent:

\sum_{s' \in \mathcal{S}^{+}} \sum_{r \in \mathcal{R}} p(s', r \mid s, a) = 1, \quad \forall\text{$a \in \mathcal{A}(s)$, $\forall s \in \mathcal{S}^{+}$}

By simply defining dynamics for the terminal state and switching to $\mathcal{S}^+$ , we have unified both continuing and episodic cases into one.

D. Linearity of the Bellman Equation

For each state $s$ , we have a value $v_{\pi}(s)$ . If we assume $n$ states, we can write $n$ equations, turning the task of finding $v_{\pi}$ into solving a system of equations:

\begin{cases} v_{\pi}(s_{1}) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s_{1}', r} p(s_{1}', r \mid s_{1}, a)[r + \gamma v_{\pi}(s_{1}')] \\ v_{\pi}(s_{2}) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s_{2}', r} p(s_{2}', r \mid s_{2}, a)[r + \gamma v_{\pi}(s_{2}')] \\ &\dots \\ v_{\pi}(s_{n}) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s_{n}', r} p(s_{n}', r \mid s_{n}, a)[r + \gamma v_{\pi}(s_{n}')] \\ \end{cases}

This is beautiful because we can write this system in matrix form, showing that the Bellman equation for a specific policy is linear. First, define:

State-transition matrix $P^{\pi}(s' \mid s)$ : If the agent is in state $s$ , what is the probability of transitioning to state $s'$ , averaged over the actions of policy $\pi$ ? $P^{\pi}(s' \mid s) = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{r} p(s', r \mid s, a)$
Expected reward vector $R^{\pi}(s)$ : If the agent is in state $s$ , what is the expected reward, averaged over the actions of policy $\pi$ ? $R^{\pi}(s) = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a)r$

Thus, the Bellman equation becomes:

v_{\pi}(s) = R^{\pi}(s) + \gamma \sum_{s' \in \mathcal{S}} P^{\pi}(s' \mid s) v_{\pi}(s')

Writing this as a system:

\underbrace{\begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ \vdots \\ v_\pi(s_n) \end{bmatrix}}_{\mathbf{v}_\pi} = \underbrace{\begin{bmatrix} R^\pi(s_1) \\ R^\pi(s_2) \\ \vdots \\ R^\pi(s_n) \end{bmatrix}}_{\mathbf{r}_{\pi}} + \gamma \underbrace{\begin{bmatrix} P^\pi(s_1|s_1) & P^\pi(s_2|s_1) & \dots & P^\pi(s_n|s_1) \\ P^\pi(s_1|s_2) & P^\pi(s_2|s_2) & \dots & P^\pi(s_n|s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P^\pi(s_1|s_n) & P^\pi(s_2|s_n) & \dots & P^\pi(s_n|s_n) \end{bmatrix}}_{\mathbf{P}_\pi} \underbrace{\begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ \vdots \\ v_\pi(s_n) \end{bmatrix}}_{\mathbf{v}_\pi}

Finally, the matrix form is:

\mathbf{v}_\pi = \mathbf{r}_\pi + \gamma \mathbf{P}_\pi \mathbf{v}_\pi

We can find the value of any policy $\pi$ by solving this matrix equation:

\mathbf{v}_{\pi} = (\mathbf{I} - \gamma \mathbf{P}_{\pi})^{-1} \mathbf{r}_{\pi}

Note

We can theoretically calculate $v_{\pi}$ exactly, but in practice, we rarely do because:

First, matrix inversion (via LU decomposition or Gaussian elimination) has a complexity of $\mathcal{O}(n^3)$ where $n$ is the number of states. If $n$ is small, it works. But for chess, where $n \approx 10^{50}$ , inversion is impossible.
Second, we often do not know the dynamics $p(s', r \mid s, a)$ of the environment (the rules of the world). Therefore, we cannot calculate $\mathbf{P}^{\pi}$ or $\mathbf{r}^{\pi}$ .

E. How to Find the Optimal Policy

We have the following formulas:

\begin{aligned} \pi^{\ast} &= \arg \max_{\pi} v_{\pi}(s), \\ v^{\ast}(s) &= \max_{\pi} v_{\pi}(s), \\ v^{\ast}(s) &= \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\ast}(s')] \\ \implies \pi^{\ast} &= \arg \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\ast}(s')] \\ &= \arg \max_{a \in \mathcal{A}(s)} q_{\ast}(s, a) \end{aligned}

There are two ways to select the optimal policy, based on $v_{\ast}$ (the second to last formula) and $q_{\ast}$ (the last formula).

Based on optimal state-value function:

First, to find $\pi^{\ast}$ knowing $v_{\ast}$ , we must know the environment dynamics $p(s', r \mid s,a)$ . We must also perform a one-step (look-ahead) search, summing over all possible next states $s'$ and rewards $r$ .
This selection appears greedy because we only look at the immediate reward $r$ and the value of the next state $v_{\ast}(s')$ . However, this yields the optimal result for the long term. This works because $v_{\ast}(s')$ already accounts for the optimal reward sequence in the future.

Based on optimal action-value function: In this case, things are much more tractable. The agent does not need to perform a one-step search; it simply picks the action with the highest $q_{\ast}(s, a)$ . Furthermore, this method does not rely on environment dynamics—something we often don’t know.