Profile picture
Kishan Chakraborty
Research Scholar
Home
Blog
Back

Understanding Value and Policy Iteration

In this blog (and the following ones) we will try to answer some fundamental questions common to almost all RL algorithms.

  1. What does the agent see and do in the environment?
  2. What is the objective of the algorithm?
  3. How does the agent explore and experience the environment? Extrinsic Vs Intrinsic?
  4. How does the agent use the learnt experience to achieve it’s objective?
  5. How does the agent learn new things without forgetting what already works?

Value and Policy iteration are very simple algortihms used when the environemnt is fully known. At each time step an agent know the state it is in as well as the probability of the next state for a particular action.
The objective of these algorithm is to find a policy π\piπ with the largest value function for each state sts_tst​. These algorithms does it iteratively. Value iteration algorithm start with random values for each state and updates it based on the observed reward.

Vπ(s)=max⁡a(r(s,a)+γ∑s′P(s′∣s,π(s))Vπ(s′))V_{\pi}(s) = \max_{a} \bigg(r(s, a) + \gamma \sum_{s'} P(s'|s, \pi(s))V_{\pi}(s')\bigg)Vπ​(s)=amax​(r(s,a)+γs′∑​P(s′∣s,π(s))Vπ​(s′))

In the above equation the observed reward is the only true signal which is used to update the arbitrarily initialized value functions.
Policy Iteration is a two step process, policy evaluation and policy improvement.

  • Policy Evaluation: Evaluate the existing policy. V(k)(s)=r(s,π(s))+γ∑s′P(s′∣s,π(s))Vk(s′)V_{(k)}(s) = r(s, \pi(s)) + \gamma \sum_{s'} P(s'|s, \pi(s)) V_{k}(s')V(k)​(s)=r(s,π(s))+γs′∑​P(s′∣s,π(s))Vk​(s′)
  • Policy Improvement: Use the evaluated value functions to update the policy. π(s)=arg max⁡a∈A(r(s,a)+γ∑s′p(s′∣s,a)Vk(s′))\pi(s) = \argmax_{a \in A} \bigg(r(s, a) + \gamma \sum_{s'}p(s'|s,a)V_{k}(s')\bigg)π(s)=a∈Aargmax​(r(s,a)+γs′∑​p(s′∣s,a)Vk​(s′))

For a infinite horizon discounted cost (γ<1\gamma < 1γ<1) MDP both the algorithms are guaranteed to converge to optimal value function. Value and policy iteration also converges for γ=1\gamma = 1γ=1 if the rewards are either positive or negative and the MDP is episodic/ terminating.
The major difference between value and policy iteration is the per iteration cost. The per iteration cost for value iteration is ∣S∣2∣A∣|S|^2|A|∣S∣2∣A∣ and per iteration const for policy iteration is ∣S∣3+∣S∣2∣A∣|S|^3 + |S|^2|A|∣S∣3+∣S∣2∣A∣, the cubic term is because for policy evaluation, a system of implicit equations need to be solved.