🍐 nguyen

A note on chapter 3 of Sutton & Barto: Finite Markov Decision Process (MDP)

December 8, 2025
20 min read
index
Danger

This article was originally written in Vietnamese. The following is an English translation created with the assistance of Gemini 3.0 Pro to make the content accessible to a broader audience. You can find the original Vietnamese post here.

Also, this post represents my personal notes and best effort to understand and explain the deep concepts from the Sutton & Barto book: “Reinforcement Learning: An Introduction” []. I really love this book and this is my attempt to make it feels more accessible.

1. The Agent-Environment Interaction

Important

Before diving into the main content, we need a proper mental model. Reinforcement Learning is not about working with a static dataset like Machine Learning or Deep Learning; it is learning through interaction—a continuous loop of action and feedback.

The two main components of Reinforcement Learning (RL) are the Agent and the Environment. The environment is the world that the agent inhabits and interacts with. At each step (or time step), the agent perceives itself to be in a state of the environment and, based on that state, decides what it needs to do.

Tip

According to [], there are two definitions we need to distinguish: state and observation. Formally, a state ss is a complete description of the state of the world. For example, in chess, a state ss tells us exactly where every piece is on the board, including the opponent’s pieces.

However, in poker, the agent only knows the cards in its own hand, not the cards held by others or those remaining in the deck. We call an incomplete description of the environment an observation oo.

In this context, however, we assume that descriptions are always complete, so o=so = s (the observation is the state).

Note (The Boundary Between Agent and Environment)

A crucial point Sutton & Barto make: This boundary is not physical (like the skin of a robot). Anything the Agent cannot change arbitrarily is considered part of the Environment.

Example: The robot’s battery level, the motors in a robotic arm… The Agent cannot simply command “Battery, become full,” so the Battery is part of the Environment (specifically, part of the State).

By performing an action on the environment, the agent receives a reward signal (or simply reward). The reward is a real number indicating how good or bad the current state of the environment is. The agent’s goal is to maximize the cumulative reward it receives over time. We call this accumulated reward the Return.

MDP Image
Definition (What is an MDP?)

A Markov Decision Process (MDP) is a mathematical framework used to formalize sequential decision making problems. In this problem, at each time step, the agent must make a decision (action) based on the state. Crucially, the agent’s action influences not just the immediate reward and next state, but also future rewards and states.

MDPs assume that regardless of the details of the agent (like torques, sensors, etc.) and regardless of the goal the agent is trying to achieve, any goal-directed problem can be reduced to three signals passing between the agent and its environment:

  • A signal representing the choices made by the agent (action).
  • A signal representing the basis on which choices are made (state).
  • A signal defining the agent’s goal (reward).

We can formalize this process as follows:

  • At each time step tt, the agent receives a representation of the environment’s state, StSS_t \in \mathcal{S}, where S\mathcal{S} is the set of all possible states.
  • Based on StS_t, the agent selects an Action AtA(St)A_t \in \mathcal{A}(S_t) (note that the set of available actions may differ per state).
  • One time step later (at t+1t+1), the agent receives a reward Rt+1RRR_{t+1} \in \mathcal{R} \subset \mathbb{R} and finds itself in a new state St+1SS_{t+1} \in \mathcal{S}.

This sequence of interactions forms a Trajectory:

S0,A0,R1,S1,A1,R2,S2,A2,S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, \dots
Note

Notice that we need an initial state S0S_0 and an initial action A0A_0 to start the learning process. Choosing S0S_0 and A0A_0 can be critically important (read more here).

Definition (Dynamics of the MDP)

In a finite MDP, the sets S,A,R\mathcal{S}, \mathcal{A}, \mathcal{R} are all finite. Therefore, the random variables StS_t and RtR_t have well-defined discrete probability distributions that depend only on the preceding state and action (St1,At1S_{t-1}, A_{t-1}). We call this the Dynamics of the MDP:

p(s,rs,a)=Pr{St=s,Rt=rSt1=s,At1=a}p(s', r \mid s, a) = \text{Pr} \{ S_{t} = s', R_{t} =r \mid S_{t-1} = s, A_{t-1} = a \}

Since pp is a probability distribution, the sum of probabilities for all possible pairs (s,r)(s', r) must equal 1:

sSrRp(s,rs,a)=1,sS,aA(s)\sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s', r \mid s, a) = 1, \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s)

In a finite MDP, the agent’s world (the environment) follows rules. These rules are probabilistic, and these rules are exactly the dynamics of the environment.

The next key point of MDPs is the Markov Property: The probability of StS_t and RtR_t depends only on St1S_{t-1} and At1A_{t-1}, not on the longer history. The current state encapsulates all information necessary to make decisions for the future.

From the dynamics pp, we can compute other important information such as:

  • State-transition probability:
p(ss,a)=rRp(s,rs,a)p(s' \mid s, a) = \sum_{r \in \mathcal{R}} p(s', r \mid s, a)
  • Expected reward: The reward we expect to receive when taking action aa in state ss.
r(s,a)=E[RtSt1=s,At1=a] =rRrp(rs,a) =rRr[sSp(s,rs,a)]\begin{aligned} r(s, a) &= \mathbb{E}[R_{t} \mid S_{t-1} = s, A_{t-1} = a]  \\ &= \sum_{r \in \mathcal{R}} r p(r \mid s, a)  \\ &= \sum_{r \in \mathcal{R}} r \left[ \sum_{s' \in \mathcal{S}} p(s', r \mid s, a) \right] \end{aligned}

2. Goals, Rewards, and Returns

Definition (The Reward Hypothesis)

That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

Based on the reward hypothesis, to achieve a goal, we must model that goal and the problem into rewards; the agent then simply learns based on those rewards. Furthermore, it is crucial to remember that the reward is a way to tell the agent what needs to be achieved, not how to achieve it.

So, what is the Agent’s true goal? It is not to get the highest reward immediately, but to maximize the total accumulated reward in the long run. We call this accumulated sum the Return (GtG_t).

Note

To be precise, GtG_t is the sum of rewards the agent accumulates in the future (i.e., from time step t+1t+1 until the end). Why future sums instead of past sums? Recall the agent’s purpose is to maximize the reward it accumulates—like constantly looking ahead and choosing the future path that yields the most reward.

Depending on the task type, the calculation of GtG_t differs:

  1. Episodic Tasks: Interaction breaks down into subsequences called episodes. Each episode ends in a terminal state at time TT.
Gt=Rt+1+Rt+2++RTG_t = R_{t+1} + R_{t+2} + \dots + R_T
  1. Continuing Tasks: Interaction goes on forever (T=T = \infty). To prevent GtG_t \to \infty, we use Discounting with a parameter 0γ10 \leq \gamma \leq 1.
Gt=Rt+1+γRt+2+γ2Rt+3+...=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^\infty \gamma^k R_{t+k+1}
Note (Why Discounting?)

There are two simple reasons:

  • Uncertainty: We cannot be certain about the future, so rewards further down the line are less “sure” than immediate rewards.
  • Patience: In finance, for example, money held now (immediate reward) is worth more than money received later. γ\gamma can be viewed as the agent’s “patience.” If γ=0\gamma = 0, the agent is myopic (greedy), caring only about the immediate reward.

Mathematically, γ<1|\gamma| < 1 ensures the infinite sum GtG_t converges (in continuing tasks). If γ=1\gamma = 1, the sum might not converge.

We can observe a recursive relationship:

Gt=Rt+1+γRt+2+γ2Rt+3+...=Rt+1+γ(Rt+2+γRt+3+...)=Rt+1+Gt+1\begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... \\ &= R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + ...) \\ &= R_{t+1} + G_{t+1} \end{aligned}

The formula above is a critically important recursive property. It means: The return of the current sequence equals the immediate reward plus the discounted return of the next sequence.

To define a single return formula for both cases, we use:

Gt=k=t+1Tγk(t+1)RkG_{t} = \sum_{k=t+1}^T \gamma^{k - (t + 1)} R_{k}

In continuing cases, T=T = \infty and γ<1\gamma < 1. In episodic cases, γ=1\gamma = 1 (usually) and TT is finite.

3. Policies and Value Functions

How does an agent know if a state is “good”? A state is only good if it leads to a high return. But the return depends on how the agent behaves in the future. Therefore, the agent needs to know how to select and evaluate its actions.

We have Value Functions to estimate the goodness of a state (or a state-action pair). The goodness of a state ss is simply the expected return achievable from that state. Meanwhile, the Policy is the agent’s brain—it helps the agent make decisions based on the current state.

Definition (Policy)

A Policy π\pi is a mapping from states to probabilities of selecting each possible action. If an agent follows policy π\pi at time tt, then π(as)\pi(a|s) is the probability that At=aA_t=a given St=sS_t=s. Since π(as)\pi(a|s) is a probability distribution, aA(s)π(as)=1\sum_{a \in \mathcal{A}(s)} \pi(a \mid s) = 1.

There are two main types of Value Functions:

  1. State-value function vπ(s)v_{\pi}(s): The expected return when starting in ss and following policy π\pi thereafter.
vπ(s)=Eπ[GtSt=s]=Eπ[k=0γkRk+1St=s]v_{\pi}(s) = \mathbb{E}_{\pi}[G_{t} \mid S_{t} = s] = \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^{k} R_{k+1} \mid S_{t} = s \right]
  1. Action-value function qπ(s,a)q_{\pi}(s, a): The expected return starting from ss, taking action aa, and then following policy π\pi.
qπ(s,a)=Eπ[GtSt=s,At=a]=Eπ[k=0γkRk+1St=s,At=a]q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_{t} \mid S_{t} = s, A_{t} = a] = \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^{k} R_{k+1} \mid S_t = s, A_t = a \right]

The Q in the famous Q-Learning algorithm (and later Deep Q-Networks - DQN) stands for this action-value function q(s,a)q(s, a).

Note

A key phrase here is following policy π\pi. In other words, at any state ss, we can choose many different policies to generate an action. The Value Function only approximates (or gives the expected return) for a specific policy π\pi, assuming all future states are also handled using that same policy π\pi.

4. The Bellman Equation

Because GtG_t is recursive, the Value Function must also be recursive. This relationship allows us to compute Value Functions by “backtracking” (or backing up) information from the next state to the current state. We can write vπv_{\pi} recursively as:

vπ(s)=E[GtSt=s]=E[Rt+1+γGt+1St=s]=E[Rt+1St=s]+γE[Gt+1St=s]=aA(s)π(as)s,rp(s,rs,a)[r+γvπ(s)]=E[Rt+1+γvπ(St+1)St=s]\begin{aligned} v_{\pi}(s) &= \mathbb{E}[G_{t} \mid S_{t} = s] \\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_{t} = s] \\ &= \mathbb{E}[R_{t+1} \mid S_{t} = s] + \gamma\mathbb{E}[G_{t+1} \mid S_{t} = s] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\pi}(s')] \\ &= \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s] \end{aligned}

The equation above is the Bellman equation for the state-value function (see proof in Appendix A).

  • The Bellman equation states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way. This is the Consistency Condition.
  • It shows that the value of a state is the weighted average of the immediate rewards the agent expects, plus the discounted value of the states the agent enters next.

Similarly, we have the Bellman equation for the action-value function qπ(s,a)q_{\pi}(s, a):

qπ(s,a)=E[GtSt=s,At=a]=s,rp(s,rs,a)[r+ γ aA(s)π(as)qπ(s,a)]\begin{aligned} q_{\pi}(s, a) &= \mathbb{E}[G_{t} \mid S_{t} = s, A_{t} = a] \\ &= \sum_{s', r}p(s', r \mid s, a)\left[ r +  \gamma  \sum_{a' \in \mathcal{A}(s')} \pi(a' \mid s')q_{\pi}(s', a') \right] \end{aligned}

Proof provided in Appendix B.

Note

Why do we need q(s,a)q(s, a) if we already have v(s)v(s)?

If we only know how “good” a state is (v(s)v(s)), we still need the environment dynamics to calculate which action leads to that good state (based on the Bellman equation for v(s)v(s)). However, if we know q(s,a)q(s, a), we can simply choose the action with the best qq-value without knowing anything about the dynamics. This is the motivation for the next section and the core idea behind Model-Free Reinforcement Learning.

Backup Diagram
The Backup Diagram can be seen as a visualization of the Bellman equation. Source: https://towardsdatascience.com/all-about-backup-diagram-fefb25aaf804/

Another important concept is the Backup Diagram, which visualizes the Bellman equation. Based on the equation, we can interpret the Backup Diagram as a tree where leaves are future states, and non-leaf nodes represent the weighted average of the nodes below them.

  • Root (open circle, representing state): State ss.
  • First branch: Policy π(as)\pi(a \mid s).
  • First node (solid circle, representing state-action pair): Pair (s,a)(s, a).
  • Second branch: Environment dynamics p(s,rs,a)p(s', r \mid s, a).
  • Leaves: Next state ss', where the value vπ(s)v_{\pi}(s') has been computed previously (to calculate current ss, we look ahead to ss').
Remark (Backup Operation)

We call this a Backup. Unlike simulation (which goes forward in time), a backup transfers information from the future (ss') back to the present (ss). All RL algorithms (Dynamic Programming, TD Learning, Monte Carlo) revolve around approximating this Bellman equation.

5. Bellman Optimality Equation

We have defined value functions for a specific policy π\pi. But we don’t just want to evaluate a policy; we want to find the best policy (the Optimal Policy).

Definition (Optimal Policy)

A policy π\pi is defined to be better than or equal to policy π\pi' if its expected return is greater than or equal to that of π\pi' for all states sSs \in \mathcal{S}. In other words:

ππ    vπ(s)vπ(s), sS\pi \geq \pi' \iff v_\pi(s) \geq v_{\pi'}(s), \ \forall s \in \mathcal{S}

There is always at least one policy that is better than or equal to all other policies. We call this the optimal policy, denoted by π\pi_{\ast}.

π=argmaxπvπ(s), sS\pi^{\ast} = \arg \max_{\pi} v_{\pi}(s), \ \forall s \in \mathcal{S}

All optimal policies share the same state-value function, called the optimal state-value function:

v(s)=maxπvπ(s), sSv_{\ast}(s) = \max_{\pi} v_{\pi}(s), \ \forall s \in \mathcal{S}

Similarly, optimal policies share the same optimal action-value function:

q(s,a)=maxπqπ(s,a), sS,aA(s)q_{\ast}(s, a) = \max_{\pi} q_{\pi}(s, a), \ \forall s \in \mathcal{S}, \forall a \in \mathcal{A}(s)

Just like the value function for a specific policy π\pi, the value function for the optimal policy π\pi^{\ast} can be written in a Bellman form, called the Bellman Optimality Equation.

The relationship between optimal action-value and optimal state-value is:

v(s)=maxaA(s)q(a,s)v_{\ast}(s) = \max_{a \in \mathcal{A}(s)} q_{\ast}(a, s)

Bellman Optimality Equation for state-value function:

v(s)=maxaA(s)q(a,s)=maxaA(s)E[Rt+1+γv(St+1)St=s,At=a]=maxaA(s)s,rp(s,rs,a)[r+γv(s)]\begin{aligned} v_{\ast}(s) &= \max_{a \in \mathcal{A}(s)} q_{\ast}(a, s) \\ &= \max_{a \in \mathcal{A}(s)} \mathbb{E}[R_{t+1} + \gamma v_{\ast}(S_{t+1}) \mid S_{t} = s, A_{t} = a] \\ &= \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\ast}(s')] \end{aligned}

Bellman Optimality Equation for action-value function:

q(a,s)=E[Rt+1+γv(St+1)St=s,At=a]=E[Rt+1+γmaxAt+1A(St+1)q(St+1,At+1)St=s,At=a]=s,rp(s,rs,a)[r+γmaxaA(s)q(s,a)]\begin{aligned} q_{\ast}(a, s) &= \mathbb{E}[R_{t+1} + \gamma v_{\ast}(S_{t+1}) \mid S_{t} = s, A_{t} = a] \\ &= \mathbb{E}[R_{t+1} + \gamma \max_{A_{t+1} \in \mathcal{A}(S_{t+1})} q(S_{t+1}, A_{t+1}) \mid S_{t} = s, A_{t} = a] \\ &= \sum_{s', r} p(s', r \mid s, a) [r + \gamma \max_{a' \in \mathcal{A}(s')} q(s', a')] \end{aligned}
Backup Diagram 2
Backup Diagram for the Bellman Optimality Equation. The arc between branches represents taking the max instead of the weighted average.
Warning (Why is the Bellman Optimality Equation hard to solve?)

The standard Bellman equation is a System of Linear Equations, as proven in Appendix D. It can be solved using matrix operations.

However, the Bellman Optimality Equation contains the max\max operator. This makes it Non-linear. We cannot use linear algebra to solve it in “one shot” (closed form). We are forced to use Iterative methods like Value Iteration or Q-Learning to approximate the solution gradually.

Remark

Remember to read Appendix E.

6. References

  1. Spinning Up in Deep Reinforcement Learning, Achiam, Joshua
    2018
    https://spinningup.openai.com/en/latest/index.html
  2. Reinforcement Learning: An Introduction, Sutton, Richard S. and Barto, Andrew G.
    The MIT Press, 2018
    http://incompleteideas.net/book/the-book-2nd.html
  3. Reinforcement Learning: Theory and Algorithms, Alekh Agarwal, Nan Jiang Sham, M. Kakade and Wen Sun
    2021
    https://rltheorybook.github.io/
  4. Deriving Bellman Equation in Reinforcement Learning, Amelio Vazquez-Reina (https://stats.stackexchange.com/users/2798/amelio-vazquez-reina)
    https://stats.stackexchange.com/q/243384
  5. How to setup the Bellman Equation as a linear system of equation, krishnab (https://cs.stackexchange.com/users/17922/krishnab)
    https://cs.stackexchange.com/q/142128

Appendix

A. Proof of Bellman Equation for State-Value Function

We have:

vπ(s)=E[GtSt=s]=E[Rt+1+γGt+1St=s]=E[Rt+1St=s]+γE[Gt+1St=s]\begin{aligned} v_{\pi}(s) &= \mathbb{E}[G_{t} \mid S_{t} = s] \\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_{t} = s] \\ &= \mathbb{E}[R_{t+1} \mid S_{t} = s] + \gamma\mathbb{E}[G_{t+1} \mid S_{t} = s] \end{aligned}

There are two expectations we need to resolve: the expectation of the immediate reward and the expectation of the next return Gt+1G_{t+1}.

Expectation of the next reward:

We have:

E[Rt+1St=s]=rrp(rs)\mathbb{E}[R_{t+1} \mid S_t = s] = \sum_{r} r p(r \mid s)

We need to “decompose” the probability p(rs)p(r \mid s) into two parts: the policy π(as)\pi(a \mid s) and the dynamics p(r,ss,a)p(r, s' \mid s, a). Applying the sum rule and product rule (more details here), we get:

p(rs)=aA(s)p(r,as)=aA(s)p(ra,s)p(as)=aA(s)π(as)[sp(r,sa,s)]\begin{aligned} p(r \mid s) &= \sum_{a \in \mathcal{A}(s)} p(r, a \mid s) \\ &= \sum_{a \in \mathcal{A}(s)} p(r \mid a, s)p(a \mid s) \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ \sum_{s'} p(r, s' \mid a, s) \right] \end{aligned}

Substituting this back into the expectation:

E[Rt+1St=s]=rrp(rs)=rr[aA(s)π(as){ sp(r,sa,s) }]\begin{aligned} \mathbb{E}[R_{t+1} \mid S_t = s] &= \sum_{r} r p(r \mid s) \\ &= \sum_{r} r \left[ \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left\{  \sum_{s'} p(r, s' \mid a, s)  \right\} \right] \end{aligned}

Assuming the policy π(as)\pi(a \mid s) does not depend on the summation over rr, we can “swap” the sums. Finally:

E[Rt+1St=s]=aA(s)π(as)[sS,rRp(r,sa,s)r]\mathbb{E}[R_{t+1} \mid S_{t} = s] = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ \sum_{s' \in S, r \in \mathcal{R}} p(r, s' \mid a, s) r \right]

Expectation of the return at the next time step:

Recall
E[X]=yE[XY=y]Pr{Y=y}\mathbb{E}[X] = \sum_{y}\mathbb{E}[X \mid Y = y] \text{Pr} \{ Y = y \}

Applying the property above:

E[Gt+1St=s]=aA(s)E[Gt+1St=s,At=a]Pr{At=aSt=s} =aA(s)E[Gt+1St=s,At=a]π(as)\begin{aligned} \mathbb{E}[G_{t+1} \mid S_{t} = s] &= \sum_{a \in \mathcal{A}(s)} \mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a] \text{Pr} \{A_{t} = a \mid S_{t} = s \}  \\ &= \sum_{a \in \mathcal{A}(s)} \mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a]\pi(a \mid s) \end{aligned}

Further expanding the conditional expectation:

E[Gt+1St=s,At=a]=r,sp(r,ss,a)E[Gt+1St+1=s,St=s,Rt+1=r,At=a]\mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a] = \sum_{r, s'} p(r, s' \mid s, a) \mathbb{E}[G_{t+1} \mid S_{t+1} = s', S_{t} = s, R_{t+1} = r, A_{t}= a]
Definition (Conditional Independence)

If P(AB,C)=P(AC)P(A \mid B, C) = P(A \mid C), we say AA and BB are conditionally independent given CC.

Due to the Markov property of MDPs, Gt+1G_{t+1} and St=sS_t = s are conditionally independent given St+1=sS_{t+1} = s' (same applies to AtA_t and Rt+1R_{t+1}). Thus:

E[Gt+1St=s,At=a]=s,rp(r,ss,a)E[Gt+1St+1=s]=s,rp(r,ss,a)vπ(s)\begin{aligned} \mathbb{E}[G_{t+1} \mid S_{t} =s, A_{t} = a] &= \sum_{s', r} p(r, s' \mid s, a) \mathbb{E}[G_{t+1} \mid S_{t+1} = s'] \\ &= \sum_{s', r}p(r, s' \mid s, a) v_{\pi}(s') \end{aligned}

Combining everything, we arrive at:

vπ(s)=aA(s)π(as)[s,rp(s,rs,a){γvπ(s)+r}]v_{\pi}(s) = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ \sum_{s', r} p(s', r \mid s, a) \{ \gamma v_{\pi}(s') + r \} \right]

B. Proof of Bellman Equation for Action-Value Function

From the proof in Appendix A, we see:

E[vπ(St+1)St=s,At=a]=s,rp(s,rs,a)vπ(s)=E[Gt+1St=s,At=a]\mathbb{E}[v_{\pi}(S_{t+1}) \mid S_{t} = s, A_{t} = a] = \sum_{s',r} p(s', r \mid s, a) v_{\pi}(s') = \mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a]

Also:

s,rp(s,rs,a){γvπ(s)+r}=E[γvπ(St+1)+Rt+1St=s,At=a]=E[Rt+1St=s,At=a]+γE[vπ(St+1)St=s,At=a]=E[Rt+1St=s,At=a]+γE[Gt+1St=s,At=a]=qπ(s,a)\begin{aligned} \sum_{s', r} p(s', r \mid s, a) \{ \gamma v_{\pi}(s') + r \} &= \mathbb{E}[\gamma v_{\pi}(S_{t+1}) + R_{t+1} \mid S_{t} = s, A_{t} = a] \\ &= \mathbb{E}[R_{t+1} \mid S_{t} = s, A_{t} = a] + \gamma \mathbb{E}[v_{\pi}(S_{t+1}) \mid S_{t} = s, A_{t} = a] \\ &= \mathbb{E}[R_{t+1} \mid S_{t} = s, A_{t} = a] + \gamma \mathbb{E}[G_{t+1} \mid S_{t} = s, A_{t} = a] \\ &= q_{\pi}(s, a) \end{aligned}

To satisfy the recursive nature of the Bellman equation, we need to express qπq_{\pi} in terms of another qπq_{\pi}. First:

vπ(s)=aA(s)π(as)[s,rp(s,rs,a){γvπ(s)+r}]=aA(s)π(as)qπ(s,a)vπ(s)=aA(s)π(as)qπ(s,a)\begin{aligned} v_{\pi}(s) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ \sum_{s', r} p(s', r \mid s, a) \{ \gamma v_{\pi}(s') + r \} \right] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s)q_{\pi}(s, a) \\ \Leftrightarrow v_{\pi}(s') &= \sum_{a' \in \mathcal{A}(s')} \pi(a' \mid s') q_{\pi}(s', a') \end{aligned}

Substituting this into qπ(s,a)q_{\pi}(s, a), we get the final result:

qπ(s,a)=s,rp(s,rs,a)[r+ γ aA(s)π(as)qπ(s,a)]q_{\pi}(s, a) = \sum_{s', r}p(s', r \mid s, a)\left[ r +  \gamma  \sum_{a' \in \mathcal{A}(s')} \pi(a' \mid s')q_{\pi}(s', a') \right]
Note

Notice that we can write the state-value function vπ(s)v_{\pi}(s) as a sum over action-value functions qπ(s,a)q_{\pi}(s, a):

vπ(s)=aA(s)π(as)qπ(s,a)=E[qπ(St,At)St=s]\begin{aligned} v_{\pi}(s) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) q_{\pi}(s, a) \\ &= \mathbb{E}[q_{\pi}(S_t, A_t) \mid S_t = s] \end{aligned}

C. Unified Notation for Episodic and Continuing Tasks

Let S\mathcal{S} be the set of non-terminal states and S+\mathcal{S}^+ be the set of all states (including terminal states).

Note

For the dynamics p(r,ss,a)p(r, s' \mid s, a) to remain consistent in both cases, we need to define the probability that the agent “escapes” the terminal state (or rather, transitions when in a terminal state).

Looking closer, a terminal state is a state the agent cannot escape. Thus, the dynamics p(s,rST,a)p(s', r \mid S_{T}, a) will be 00 for all sSTs' \neq S_T and equal to 1 only if s=STs' = S_T. In other words:

p(s,rSt,a)={1 if s=St and r=00 otherwisep(s', r \mid S_{t}, a) = \begin{cases} 1 \ \text{if $s' = S_{t}$ and $r = 0$} \\ 0 \ \text{otherwise} \end{cases}

Where:

  • s=STs' = S_T: If we are in the terminal state, we are “stuck” there forever. We call this an absorbing state.
  • r=0r = 0: The reward for the terminal state is 0.

Since the dynamics remain consistent:

sS+rRp(s,rs,a)=1,aA(s)sS+\sum_{s' \in \mathcal{S}^{+}} \sum_{r \in \mathcal{R}} p(s', r \mid s, a) = 1, \quad \forall\text{$a \in \mathcal{A}(s)$, $\forall s \in \mathcal{S}^{+}$}

By simply defining dynamics for the terminal state and switching to S+\mathcal{S}^+, we have unified both continuing and episodic cases into one.

D. Linearity of the Bellman Equation

For each state ss, we have a value vπ(s)v_{\pi}(s). If we assume nn states, we can write nn equations, turning the task of finding vπv_{\pi} into solving a system of equations:

{vπ(s1)=aA(s)π(as)s1,rp(s1,rs1,a)[r+γvπ(s1)]vπ(s2)=aA(s)π(as)s2,rp(s2,rs2,a)[r+γvπ(s2)]vπ(sn)=aA(s)π(as)sn,rp(sn,rsn,a)[r+γvπ(sn)]\begin{cases} v_{\pi}(s_{1}) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s_{1}', r} p(s_{1}', r \mid s_{1}, a)[r + \gamma v_{\pi}(s_{1}')] \\ v_{\pi}(s_{2}) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s_{2}', r} p(s_{2}', r \mid s_{2}, a)[r + \gamma v_{\pi}(s_{2}')] \\ &\dots \\ v_{\pi}(s_{n}) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s_{n}', r} p(s_{n}', r \mid s_{n}, a)[r + \gamma v_{\pi}(s_{n}')] \\ \end{cases}

This is beautiful because we can write this system in matrix form, showing that the Bellman equation for a specific policy is linear. First, define:

  • State-transition matrix Pπ(ss)P^{\pi}(s' \mid s): If the agent is in state ss, what is the probability of transitioning to state ss', averaged over the actions of policy π\pi?
Pπ(ss)=aA(s)π(as)rp(s,rs,a)P^{\pi}(s' \mid s) = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{r} p(s', r \mid s, a)
  • Expected reward vector Rπ(s)R^{\pi}(s): If the agent is in state ss, what is the expected reward, averaged over the actions of policy π\pi?
Rπ(s)=aA(s)π(as)s,rp(s,rs,a)rR^{\pi}(s) = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a)r

Thus, the Bellman equation becomes:

vπ(s)=Rπ(s)+γsSPπ(ss)vπ(s)v_{\pi}(s) = R^{\pi}(s) + \gamma \sum_{s' \in \mathcal{S}} P^{\pi}(s' \mid s) v_{\pi}(s')

Writing this as a system:

[vπ(s1)vπ(s2)vπ(sn)]vπ=[Rπ(s1)Rπ(s2)Rπ(sn)]rπ+γ[Pπ(s1s1)Pπ(s2s1)Pπ(sns1)Pπ(s1s2)Pπ(s2s2)Pπ(sns2)Pπ(s1sn)Pπ(s2sn)Pπ(snsn)]Pπ[vπ(s1)vπ(s2)vπ(sn)]vπ\underbrace{\begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ \vdots \\ v_\pi(s_n) \end{bmatrix}}_{\mathbf{v}_\pi} = \underbrace{\begin{bmatrix} R^\pi(s_1) \\ R^\pi(s_2) \\ \vdots \\ R^\pi(s_n) \end{bmatrix}}_{\mathbf{r}_{\pi}} + \gamma \underbrace{\begin{bmatrix} P^\pi(s_1|s_1) & P^\pi(s_2|s_1) & \dots & P^\pi(s_n|s_1) \\ P^\pi(s_1|s_2) & P^\pi(s_2|s_2) & \dots & P^\pi(s_n|s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P^\pi(s_1|s_n) & P^\pi(s_2|s_n) & \dots & P^\pi(s_n|s_n) \end{bmatrix}}_{\mathbf{P}_\pi} \underbrace{\begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ \vdots \\ v_\pi(s_n) \end{bmatrix}}_{\mathbf{v}_\pi}

Finally, the matrix form is:

vπ=rπ+γPπvπ\mathbf{v}_\pi = \mathbf{r}_\pi + \gamma \mathbf{P}_\pi \mathbf{v}_\pi

We can find the value of any policy π\pi by solving this matrix equation:

vπ=(IγPπ)1rπ\mathbf{v}_{\pi} = (\mathbf{I} - \gamma \mathbf{P}_{\pi})^{-1} \mathbf{r}_{\pi}
Note

We can theoretically calculate vπv_{\pi} exactly, but in practice, we rarely do because:

  • First, matrix inversion (via LU decomposition or Gaussian elimination) has a complexity of O(n3)\mathcal{O}(n^3) where nn is the number of states. If nn is small, it works. But for chess, where n1050n \approx 10^{50}, inversion is impossible.
  • Second, we often do not know the dynamics p(s,rs,a)p(s', r \mid s, a) of the environment (the rules of the world). Therefore, we cannot calculate Pπ\mathbf{P}^{\pi} or rπ\mathbf{r}^{\pi}.

E. How to Find the Optimal Policy

We have the following formulas:

π=argmaxπvπ(s),v(s)=maxπvπ(s),v(s)=maxaA(s)s,rp(s,rs,a)[r+γv(s)]    π=argmaxaA(s)s,rp(s,rs,a)[r+γv(s)]=argmaxaA(s)q(s,a)\begin{aligned} \pi^{\ast} &= \arg \max_{\pi} v_{\pi}(s), \\ v^{\ast}(s) &= \max_{\pi} v_{\pi}(s), \\ v^{\ast}(s) &= \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\ast}(s')] \\ \implies \pi^{\ast} &= \arg \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\ast}(s')] \\ &= \arg \max_{a \in \mathcal{A}(s)} q_{\ast}(s, a) \end{aligned}

There are two ways to select the optimal policy, based on vv_{\ast} (the second to last formula) and qq_{\ast} (the last formula).

Based on optimal state-value function:

  • First, to find π\pi^{\ast} knowing vv_{\ast}, we must know the environment dynamics p(s,rs,a)p(s', r \mid s,a). We must also perform a one-step (look-ahead) search, summing over all possible next states ss' and rewards rr.
  • This selection appears greedy because we only look at the immediate reward rr and the value of the next state v(s)v_{\ast}(s'). However, this yields the optimal result for the long term. This works because v(s)v_{\ast}(s') already accounts for the optimal reward sequence in the future.

Based on optimal action-value function: In this case, things are much more tractable. The agent does not need to perform a one-step search; it simply picks the action with the highest q(s,a)q_{\ast}(s, a). Furthermore, this method does not rely on environment dynamics—something we often don’t know.