Profile picture
Kishan Chakraborty
Research Scholar
Home
Blog
Back

Finite Markov Decision Process

Agent Environment Interface

In finite MDP the states actions and rewards all have a finite no. of elements in it. In this case, the random variable Rt,StR_t, S_tRt​,St​ have well defined discrete probability distributions dependent only on the preceding state and actions. i.e.i.e.i.e.

p(s′,r∣s,a)=Pr{St=s′,Rt=r∣St−1=s,At−1=a}∀s′,s∈S,a∈A(s)p(s', r | s, a) = Pr\{S_t=s', R_t=r|S_{t-1}=s,A_{t-1}=a\} \hspace{0.2 in }\forall s', s \in \mathbb{S}, a \in \mathbb{A}(s)p(s′,r∣s,a)=Pr{St​=s′,Rt​=r∣St−1​=s,At−1​=a}∀s′,s∈S,a∈A(s)

p is called dynamics of the MDP, an arbitrary deterministic function of four variables.

∑s′,ap(s′,r∣s,a)=1∀s∈S,r∈R\sum_{s', a} p(s', r | s, a) = 1 \hspace{0.2 in }\forall s \in \mathbb{S}, r \in \mathbb{R}s′,a∑​p(s′,r∣s,a)=1∀s∈S,r∈R

Markov Property means that the probability for St,RtS_t, R_tSt​,Rt​ depends only on the preceding state. This is best viewed as a restriction not on the environment but the state. The state must include information about all the past agent-environment interaction that make a difference to the future.

Goals and Rewards

Reward Hypothesis: The goal and purpose of the agent can be thought of maximizing the expected cumulative reward. If we want the agent to work as expected, we need to hack reward accordingly.

Returns and Episodes

For a sequence of rewards received after t Rt+1,Rt+2⋯R_{t+1}, R_{t+2} \cdotsRt+1​,Rt+2​⋯, in general we wish to maximize the expected return, which is a function of the reward sequence. i.e.

Gt=⋅Rt+1+Rt+1⋯+RTG_t \overset{\cdot}{=} R_{t+1} + R_{t+1} \cdots + R_{T}Gt​=⋅Rt+1​+Rt+1​⋯+RT​

Here TTT is the final step (makes sense when there is a natural notion of final state called terminal state). This sequence is called an episode. The next episode begins independent of the terminal state.

Tasks where agent environment interaction does not break naturally are called continuing tasks. The previous formulation is problematic because the terminal state maybe at T=∞T = \inftyT=∞. The modified return is called discounted return

Gt=⋅Rt+1+γRt+1+γ2Rt+2+⋯=∑k=0∞γkRt+k+1=Rt+1+γGt+1\begin{align*} G_t &\overset{\cdot}{=} R_{t+1} + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots \\ &= \sum_{k=0}^\infty \gamma^k R_{t+k+1}\\ &= R_{t+1} + \gamma G_{t+1} \end{align*} Gt​​=⋅Rt+1​+γRt+1​+γ2Rt+2​+⋯=k=0∑∞​γkRt+k+1​=Rt+1​+γGt+1​​

0≤γ≤10 \leq \gamma \leq 10≤γ≤1 is called discount factor. The notion of episodic and continuing tasks can be unified by assuming that for episodic tasks there exists a virtual absorption state where once reached the process can not return and always receive a reward of 0.

Policies and Value functions