Danger
This article was originally written in Vietnamese. The following is an English translation created with the assistance of Gemini 3.0 Pro to make the content accessible to a broader audience. You can find the original Vietnamese post here.
Also, this post represents my personal notes and best effort to understand and explain the deep concepts from the Sutton & Barto book: “Reinforcement Learning: An Introduction”
1. The Agent-Environment Interaction
Important
Before diving into the main content, we need a proper mental model. Reinforcement Learning is not about working with a static dataset like Machine Learning or Deep Learning; it is learning through interaction—a continuous loop of action and feedback.
The two main components of Reinforcement Learning (RL) are the Agent and the Environment. The environment is the world that the agent inhabits and interacts with. At each step (or time step), the agent perceives itself to be in a state of the environment and, based on that state, decides what it needs to do.
Tip
According to
However, in poker, the agent only knows the cards in its own hand, not the cards held by others or those remaining in the deck. We call an incomplete description of the environment an observation .
In this context, however, we assume that descriptions are always complete, so (the observation is the state).
Note (The Boundary Between Agent and Environment)
A crucial point Sutton & Barto make: This boundary is not physical (like the skin of a robot). Anything the Agent cannot change arbitrarily is considered part of the Environment.
Example: The robot’s battery level, the motors in a robotic arm… The Agent cannot simply command “Battery, become full,” so the Battery is part of the Environment (specifically, part of the State).
By performing an action on the environment, the agent receives a reward signal (or simply reward). The reward is a real number indicating how good or bad the current state of the environment is. The agent’s goal is to maximize the cumulative reward it receives over time. We call this accumulated reward the Return.
Definition (What is an MDP?)
A Markov Decision Process (MDP) is a mathematical framework used to formalize sequential decision making problems. In this problem, at each time step, the agent must make a decision (action) based on the state. Crucially, the agent’s action influences not just the immediate reward and next state, but also future rewards and states.
MDPs assume that regardless of the details of the agent (like torques, sensors, etc.) and regardless of the goal the agent is trying to achieve, any goal-directed problem can be reduced to three signals passing between the agent and its environment:
- A signal representing the choices made by the agent (action).
- A signal representing the basis on which choices are made (state).
- A signal defining the agent’s goal (reward).
We can formalize this process as follows:
- At each time step , the agent receives a representation of the environment’s state, , where is the set of all possible states.
- Based on , the agent selects an Action (note that the set of available actions may differ per state).
- One time step later (at ), the agent receives a reward and finds itself in a new state .
This sequence of interactions forms a Trajectory:
Note
Notice that we need an initial state and an initial action to start the learning process. Choosing and can be critically important (read more here).
Definition (Dynamics of the MDP)
In a finite MDP, the sets are all finite. Therefore, the random variables and have well-defined discrete probability distributions that depend only on the preceding state and action (). We call this the Dynamics of the MDP:
Since is a probability distribution, the sum of probabilities for all possible pairs must equal 1:
In a finite MDP, the agent’s world (the environment) follows rules. These rules are probabilistic, and these rules are exactly the dynamics of the environment.
The next key point of MDPs is the Markov Property: The probability of and depends only on and , not on the longer history. The current state encapsulates all information necessary to make decisions for the future.
From the dynamics , we can compute other important information such as:
- State-transition probability:
- Expected reward: The reward we expect to receive when taking action in state .
2. Goals, Rewards, and Returns
Definition (The Reward Hypothesis)
That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
Based on the reward hypothesis, to achieve a goal, we must model that goal and the problem into rewards; the agent then simply learns based on those rewards. Furthermore, it is crucial to remember that the reward is a way to tell the agent what needs to be achieved, not how to achieve it.
So, what is the Agent’s true goal? It is not to get the highest reward immediately, but to maximize the total accumulated reward in the long run. We call this accumulated sum the Return ().
Note
To be precise, is the sum of rewards the agent accumulates in the future (i.e., from time step until the end). Why future sums instead of past sums? Recall the agent’s purpose is to maximize the reward it accumulates—like constantly looking ahead and choosing the future path that yields the most reward.
Depending on the task type, the calculation of differs:
- Episodic Tasks: Interaction breaks down into subsequences called episodes. Each episode ends in a terminal state at time .
- Continuing Tasks: Interaction goes on forever (). To prevent , we use Discounting with a parameter .
Note (Why Discounting?)
There are two simple reasons:
- Uncertainty: We cannot be certain about the future, so rewards further down the line are less “sure” than immediate rewards.
- Patience: In finance, for example, money held now (immediate reward) is worth more than money received later. can be viewed as the agent’s “patience.” If , the agent is myopic (greedy), caring only about the immediate reward.
Mathematically, ensures the infinite sum converges (in continuing tasks). If , the sum might not converge.
We can observe a recursive relationship:
The formula above is a critically important recursive property. It means: The return of the current sequence equals the immediate reward plus the discounted return of the next sequence.
To define a single return formula for both cases, we use:
In continuing cases, and . In episodic cases, (usually) and is finite.
3. Policies and Value Functions
How does an agent know if a state is “good”? A state is only good if it leads to a high return. But the return depends on how the agent behaves in the future. Therefore, the agent needs to know how to select and evaluate its actions.
We have Value Functions to estimate the goodness of a state (or a state-action pair). The goodness of a state is simply the expected return achievable from that state. Meanwhile, the Policy is the agent’s brain—it helps the agent make decisions based on the current state.
Definition (Policy)
A Policy is a mapping from states to probabilities of selecting each possible action. If an agent follows policy at time , then is the probability that given . Since is a probability distribution, .
There are two main types of Value Functions:
- State-value function : The expected return when starting in and following policy thereafter.
- Action-value function : The expected return starting from , taking action , and then following policy .
The Q in the famous Q-Learning algorithm (and later Deep Q-Networks - DQN) stands for this action-value function .
Note
A key phrase here is following policy . In other words, at any state , we can choose many different policies to generate an action. The Value Function only approximates (or gives the expected return) for a specific policy , assuming all future states are also handled using that same policy .
4. The Bellman Equation
Because is recursive, the Value Function must also be recursive. This relationship allows us to compute Value Functions by “backtracking” (or backing up) information from the next state to the current state. We can write recursively as:
The equation above is the Bellman equation for the state-value function (see proof in Appendix A).
- The Bellman equation states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way. This is the Consistency Condition.
- It shows that the value of a state is the weighted average of the immediate rewards the agent expects, plus the discounted value of the states the agent enters next.
Similarly, we have the Bellman equation for the action-value function :
Proof provided in Appendix B.
Note
Why do we need if we already have ?
If we only know how “good” a state is (), we still need the environment dynamics to calculate which action leads to that good state (based on the Bellman equation for ). However, if we know , we can simply choose the action with the best -value without knowing anything about the dynamics. This is the motivation for the next section and the core idea behind Model-Free Reinforcement Learning.
Another important concept is the Backup Diagram, which visualizes the Bellman equation. Based on the equation, we can interpret the Backup Diagram as a tree where leaves are future states, and non-leaf nodes represent the weighted average of the nodes below them.
- Root (open circle, representing state): State .
- First branch: Policy .
- First node (solid circle, representing state-action pair): Pair .
- Second branch: Environment dynamics .
- Leaves: Next state , where the value has been computed previously (to calculate current , we look ahead to ).
Remark (Backup Operation)
We call this a Backup. Unlike simulation (which goes forward in time), a backup transfers information from the future () back to the present (). All RL algorithms (Dynamic Programming, TD Learning, Monte Carlo) revolve around approximating this Bellman equation.
5. Bellman Optimality Equation
We have defined value functions for a specific policy . But we don’t just want to evaluate a policy; we want to find the best policy (the Optimal Policy).
Definition (Optimal Policy)
A policy is defined to be better than or equal to policy if its expected return is greater than or equal to that of for all states . In other words:
There is always at least one policy that is better than or equal to all other policies. We call this the optimal policy, denoted by .
All optimal policies share the same state-value function, called the optimal state-value function:
Similarly, optimal policies share the same optimal action-value function:
Just like the value function for a specific policy , the value function for the optimal policy can be written in a Bellman form, called the Bellman Optimality Equation.
The relationship between optimal action-value and optimal state-value is:
Bellman Optimality Equation for state-value function:
Bellman Optimality Equation for action-value function:
Warning (Why is the Bellman Optimality Equation hard to solve?)
The standard Bellman equation is a System of Linear Equations, as proven in Appendix D. It can be solved using matrix operations.
However, the Bellman Optimality Equation contains the operator. This makes it Non-linear. We cannot use linear algebra to solve it in “one shot” (closed form). We are forced to use Iterative methods like Value Iteration or Q-Learning to approximate the solution gradually.
Remark
Remember to read Appendix E.
6. References
- Spinning Up in Deep Reinforcement Learning, Achiam, Joshua2018https://spinningup.openai.com/en/latest/index.html
- Reinforcement Learning: An Introduction, Sutton, Richard S. and Barto, Andrew G.The MIT Press, 2018http://incompleteideas.net/book/the-book-2nd.html
- Reinforcement Learning: Theory and Algorithms, Alekh Agarwal, Nan Jiang Sham, M. Kakade and Wen Sun2021https://rltheorybook.github.io/
- Deriving Bellman Equation in Reinforcement Learning, Amelio Vazquez-Reina (https://stats.stackexchange.com/users/2798/amelio-vazquez-reina)https://stats.stackexchange.com/q/243384
- How to setup the Bellman Equation as a linear system of equation, krishnab (https://cs.stackexchange.com/users/17922/krishnab)https://cs.stackexchange.com/q/142128
Appendix
A. Proof of Bellman Equation for State-Value Function
We have:
There are two expectations we need to resolve: the expectation of the immediate reward and the expectation of the next return .
Expectation of the next reward:
We have:
We need to “decompose” the probability into two parts: the policy and the dynamics . Applying the sum rule and product rule (more details here), we get:
Substituting this back into the expectation:
Assuming the policy does not depend on the summation over , we can “swap” the sums. Finally:
Expectation of the return at the next time step:
Recall
Applying the property above:
Further expanding the conditional expectation:
Definition (Conditional Independence)
If , we say and are conditionally independent given .
Due to the Markov property of MDPs, and are conditionally independent given (same applies to and ). Thus:
Combining everything, we arrive at:
B. Proof of Bellman Equation for Action-Value Function
From the proof in Appendix A, we see:
Also:
To satisfy the recursive nature of the Bellman equation, we need to express in terms of another . First:
Substituting this into , we get the final result:
Note
Notice that we can write the state-value function as a sum over action-value functions :
C. Unified Notation for Episodic and Continuing Tasks
Let be the set of non-terminal states and be the set of all states (including terminal states).
Note
For the dynamics to remain consistent in both cases, we need to define the probability that the agent “escapes” the terminal state (or rather, transitions when in a terminal state).
Looking closer, a terminal state is a state the agent cannot escape. Thus, the dynamics will be for all and equal to 1 only if . In other words:
Where:
- : If we are in the terminal state, we are “stuck” there forever. We call this an absorbing state.
- : The reward for the terminal state is 0.
Since the dynamics remain consistent:
By simply defining dynamics for the terminal state and switching to , we have unified both continuing and episodic cases into one.
D. Linearity of the Bellman Equation
For each state , we have a value . If we assume states, we can write equations, turning the task of finding into solving a system of equations:
This is beautiful because we can write this system in matrix form, showing that the Bellman equation for a specific policy is linear. First, define:
- State-transition matrix : If the agent is in state , what is the probability of transitioning to state , averaged over the actions of policy ?
- Expected reward vector : If the agent is in state , what is the expected reward, averaged over the actions of policy ?
Thus, the Bellman equation becomes:
Writing this as a system:
Finally, the matrix form is:
We can find the value of any policy by solving this matrix equation:
Note
We can theoretically calculate exactly, but in practice, we rarely do because:
- First, matrix inversion (via LU decomposition or Gaussian elimination) has a complexity of where is the number of states. If is small, it works. But for chess, where , inversion is impossible.
- Second, we often do not know the dynamics of the environment (the rules of the world). Therefore, we cannot calculate or .
E. How to Find the Optimal Policy
We have the following formulas:
There are two ways to select the optimal policy, based on (the second to last formula) and (the last formula).
Based on optimal state-value function:
- First, to find knowing , we must know the environment dynamics . We must also perform a one-step (look-ahead) search, summing over all possible next states and rewards .
- This selection appears greedy because we only look at the immediate reward and the value of the next state . However, this yields the optimal result for the long term. This works because already accounts for the optimal reward sequence in the future.
Based on optimal action-value function: In this case, things are much more tractable. The agent does not need to perform a one-step search; it simply picks the action with the highest . Furthermore, this method does not rely on environment dynamics—something we often don’t know.