-
Agent: The learner (decision maker) is the called agent.
-
Bellman Optimality Equation: Represents the value function of a state wrt to an optimal policy.
V∗(s)=amaxr(s,a)+γEs′[V∗(s′)]
-
Environment: RL environment is the universe of an RL agent. It consists of states containing the location of the agent.
It has a set of actions from which an agent choose what to do. Corresponding to each action at each state the environment awards
it with a reward and it moves to a new state.
Mathematically an environment is given by an MDP (S,A,P,R,γ)
- S: Set of all possible states.
- A: Set of all possible actions.
- P(s′∣s,a): Transition probability form state s to s′.
- R(s,a,s′): Reward for action a in state s and following state s′.
- γ: Discount factor.
-
Episode : A natural notion of final step where agent environment interaction breaks. This end state often called terminal state.
-
Policy: A mapping between states and probabilities of selecting each possible action. For a policy π, π(a∣s)
is the probability that At=a for St=s.
-
Rewards: The environment produces the reward as a result of the agent’s interaction. An agent seeks to maximize
over time through it’s choice of actions.
Expected reward can be expressed using two/ three tuple
r(s,a)=E[Rt∣St−1=s,At−1=a]=r∈R∑rs′∈S∑p(s′,r∣s,a)
r(s,a,s′)=E[Rt∣St−1=s,At−1=a,St=s′]=r∈R∑rp(s′∣s,a)p(s′,r∣s,a)
-
Value Function: It is a metric to evaluate a policy π. It tells what could an agent achieve starting
from a state st by following a policy π in the long run. Mathematically
Vπ(st)=i=0∑∞γirt+1+i