Finite Markov Decision Process
Agent Environment Interface
In finite MDP the states actions and rewards all have a finite no. of elements in it.
In this case, the random variable Rt,St have well defined discrete probability distributions
dependent only on the preceding state and actions. i.e.
p(s′,r∣s,a)=Pr{St=s′,Rt=r∣St−1=s,At−1=a}∀s′,s∈S,a∈A(s)
p is called dynamics of the MDP, an arbitrary deterministic function of four variables.
s′,a∑p(s′,r∣s,a)=1∀s∈S,r∈R
Markov Property means that the probability for St,Rt depends only on the preceding state.
This is best viewed as a restriction not on the environment but the state. The state must include
information about all the past agent-environment interaction that make a difference to the future.
Goals and Rewards
Reward Hypothesis: The goal and purpose of the agent can be thought of maximizing the expected cumulative reward.
If we want the agent to work as expected, we need to hack reward accordingly.
Returns and Episodes
For a sequence of rewards received after t Rt+1,Rt+2⋯, in general we wish to maximize the expected return,
which is a function of the reward sequence. i.e.
Gt=⋅Rt+1+Rt+1⋯+RT
Here T is the final step (makes sense when there is a natural notion of final state called terminal state).
This sequence is called an episode. The next episode begins independent of the terminal state.
Tasks where agent environment interaction does not break naturally are called continuing tasks. The previous formulation
is problematic because the terminal state maybe at T=∞. The modified return is called discounted return
Gt=⋅Rt+1+γRt+1+γ2Rt+2+⋯=k=0∑∞γkRt+k+1=Rt+1+γGt+1
0≤γ≤1 is called discount factor.
The notion of episodic and continuing tasks can be unified by assuming that for episodic tasks there exists a virtual
absorption state where once reached the process can not return and always receive a reward of 0.
Policies and Value functions