Sep 20, Bellman Equation - Chengsheng's Blog

**Published:** September 20, 2024 | **Tags:** #Math #ReinforcementLearning **Last Updated:** March 4, 2025 (Fixed errors in the Practical Example section) <details class="toc-container"> <summary><strong>Table of Content</strong></summary> <ul> <li>Introduction</li> <li>Foundational Concepts <ul> <li>Markov Property</li> <li>Markov Chain</li> <li>Markov Reward Process (MRP)</li> </ul> </li> <li>Deriving the Bellman Equation</li> <li>Practical Example</li> <li>Conclusion</li> </ul> </details> ## 1. Introduction The Bellman Equation is a cornerstone of dynamic programming, used to solve optimization problems across various domains. Named after Richard Bellman, this equation is crucial in breaking down complex problems into manageable subproblems, enabling the identification of global optimal solutions. Its applications span diverse fields, including: - Shortest path problems - Resource allocation - Control theory - Reinforcement learning This blog post will explain the derivation of the Bellman Equation, starting with some key underlying concepts. ## 2. Foundational Concepts ### 2.1 Markov Property The Markov Property characterizes a stochastic process where the future state depends solely on the present state, independent of the sequence of preceding events. Mathematically, this is expressed as: $ P(X_{t+1} = x_{t+1} | X_{0:t} = x_{0:t}) = P(X_{t+1} = x_{t+1} | X_t = x_t) $ Where $X_{0:t}$ represents the set of random variables $X_0, X_1, ..., X_t$, and $x_{0:t}$ is the state sequence. ### 2.2 Markov Chain A Markov Chain is a stochastic process that exhibits the Markov Property. For a random state sequence $s_1, s_2, s_3, ..., s_t$, the next state $s_{t+1}$ depends only on $s_t$. If we define the historical state set as $h_t = \{s_1, s_2, ..., s_t\}$, a Markov process satisfies: $ P(s_{t+1}|s_t) = P(s_{t+1}|h_t) $ Here's an example of a Markov Process: ![[Pasted image 20240920102013.png]] In this diagram, there are four states transitioning among each other. Starting from $s_1$: - It has a 0.1 probability of staying in $s_1$ - A 0.2 probability of transitioning to $s_2$ - A 0.7 probability of transitioning to $s_4$ From $s_4$: - It has a 0.3 probability of transitioning to $s_2$ - A 0.2 probability of transitioning to $s_3$ - A 0.5 probability of staying in $s_4$ ### 2.3 Markov Reward Process (MRP) A Markov Reward Process extends the Markov Chain by incorporating rewards. It models situations where, in addition to state transitions, there's a reward associated with each state or transition. *Key Concepts in MRP* 1. **Return ($G_t$)**: The total cumulative reward an agent receives, starting from a particular state over time. It's calculated as: $ G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \gamma^3 r_{t+4} + ... + \gamma^{T-t-1} r_T $ Where $\gamma$ is the discount factor. 2. **State-Value Function ($V(s)$)**: Represents the expected return starting from a specific state $s$ and following a particular policy or set of transition rules: $ V(s) = E[G_t | s_t = s] $ 3. **Discount Factor ($\gamma$)**: A value between 0 and 1 that determines the present value of future rewards. It helps in: - Preferring immediate rewards over future ones - Handling infinite time horizons by ensuring the total sum of rewards remains finite ## 3. Deriving the Bellman Equation Starting with the definition of the state-value function: $ \begin{align} V(s) &= E[G_t | s_t = s] \tag{1} \\ &= E[r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \gamma^3 r_{t+4} + ... + \gamma^{T-t-1} r_T | s_t = s] \tag{2} \\ &= E[r_{t+1} | s_t = s] + \gamma E[G_{t+1} | s_t = s] \tag{3} \\ &= E[r_{t+1} | s_t = s] + \gamma \sum_{G_{t+1}} P(G_{t+1} | s_t = s) G_{t+1} \tag{4} \\ &= E[r_{t+1} | s_t = s] + \gamma \sum_{s'} \sum_{G_{t+1}} P(G_{t+1} | s_{t+1} = s', s_t = s) P(s_{t+1} = s' | s_t = s) G_{t+1} \tag{5} \\ &= E[r_{t+1} | s_t = s] + \gamma \sum_{s'} E[G_{t+1} | s_{t+1} = s', s_t = s] P(s_{t+1} = s' | s_t = s) \tag{6} \\ &= E[r_{t+1} | s_t = s] + \gamma \sum_{s'} V(s') P(s_{t+1} = s' | s_t = s) \tag{7} \\ &= r(s) + \gamma \sum_{s' \in S} P(s' | s) V(s') \tag{8} \end{align} $ The final form in equation (8) is the Bellman Equation. **Key Step Explanation**: * The transition from step (4) to (5) involves applying the law of total probability: $ \begin{aligned} P(G_{t+1} | s_t = s) &= \frac{P(G_{t+1}, s_t = s)}{P(s_t = s)} \\ &= \frac{\sum_{s'} P(G_{t+1}, s_{t+1} = s', s_t = s)}{P(s_t = s)} \\ &= \sum_{s'} P(G_{t+1} | s_{t+1} = s', s_t = s) P(s_{t+1} = s' | s_t = s) \end{aligned} $ * Another key step occurs from step (6) to (7). Due to the Markov property, once we know the next state $s^{\prime}$, the future return $G_{t+1}$ becomes independent of the current state $s$. Therefore, $E[G_{t+1} | s_{t+1} = s', s_t = s] = E[G_{t+1} | s_{t+1} = s'] = V(s')$. This crucial step leverages the recursive nature of the value function, allowing us to express the expected future return in terms of the value function itself. ## 4. Practical Example Let's illustrate the Bellman Equation with a simple scenario: Imagine you've just scored 100 points on a difficult math exam (state $s_1$), earning a reward of 10 points. You now have two options: 1. Play a video game (state $s_2$) with a probability of 0.8 and a reward of 8 points. 2. Continue studying (state $s_3$) with a probability of 0.4 and a reward of 5 points. Assume a discount factor $\gamma = 0.5$. Given: - $r(s_1) = 10$ - $r(s_2) = 8$, $V(s_2) = 8$ (terminal state) - $r(s_3) = 5$, $V(s_3) = 5$ (terminal state) - $P(s_2 | s_1) = 0.7$ - $P(s_3 | s_1) = 0.3$ Applying the Bellman Equation: $ \begin{aligned} V(s_1) &= r(s_1) + \gamma [P(s_2 | s_1) V(s_2) + P(s_3 | s_1) V(s_3)] \\ &= 10 + 0.5 \times (0.7 \times 8 + 0.3 \times 5) \\ &= 10 + 0.5 \times 7.1 \\ &= 13.55 \end{aligned} $ Therefore, the value of being in state $s_1$ (having just aced the math exam) is 14.2 points. ## 5. Conclusion The Bellman Equation is a powerful tool in reinforcement learning and dynamic programming. It provides a recursive definition of the value function, allowing us to compute the optimal value of a state by considering immediate rewards and the discounted future value of subsequent states. Understanding this equation is crucial for developing efficient algorithms in various optimization problems and decision-making processes.