Profile picture
Kishan Chakraborty
Research Scholar
Home
Blog
Back

Glossary

  1. Agent: The learner (decision maker) is the called agent.

  2. Bellman Optimality Equation: Represents the value function of a state wrt to an optimal policy.

    V∗(s)=max⁡ar(s,a)+γEs′[V∗(s′)] V^*(s) = \max_{a} r(s, a) + \gamma \mathbb{E}_{s'}[V^*(s')]V∗(s)=amax​r(s,a)+γEs′​[V∗(s′)]
  3. Environment: RL environment is the universe of an RL agent. It consists of states containing the location of the agent. It has a set of actions from which an agent choose what to do. Corresponding to each action at each state the environment awards it with a reward and it moves to a new state. Mathematically an environment is given by an MDP (S,A,P,R,γ)(S, A, P, R, \gamma)(S,A,P,R,γ)

    • SSS: Set of all possible states.
    • AAA: Set of all possible actions.
    • P(s′∣s,a)P(s'|s,a)P(s′∣s,a): Transition probability form state sss to s′s's′.
    • R(s,a,s′)R(s, a, s')R(s,a,s′): Reward for action aaa in state sss and following state s′s's′.
    • γ\gammaγ: Discount factor.
  4. Episode : A natural notion of final step where agent environment interaction breaks. This end state often called terminal state.

  5. Policy: A mapping between states and probabilities of selecting each possible action. For a policy π\piπ, π(a∣s)\pi(a|s)π(a∣s) is the probability that At=aA_t = aAt​=a for St=sS_t = sSt​=s.

  6. Rewards: The environment produces the reward as a result of the agent’s interaction. An agent seeks to maximize over time through it’s choice of actions. Expected reward can be expressed using two/ three tuple

    r(s,a)=E[Rt∣St−1=s,At−1=a]=∑r∈Rr∑s′∈Sp(s′,r∣s,a) r(s, a) = \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a] = \sum_{r \in R}r\sum_{s' \in S}p(s', r | s, a)r(s,a)=E[Rt​∣St−1​=s,At−1​=a]=r∈R∑​rs′∈S∑​p(s′,r∣s,a) r(s,a,s′)=E[Rt∣St−1=s,At−1=a,St=s′]=∑r∈Rrp(s′,r∣s,a)p(s′∣s,a)r(s, a, s') = \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a, S_t=s'] = \sum_{r \in R}r\frac{p(s',r|s,a)}{p(s'|s,a)}r(s,a,s′)=E[Rt​∣St−1​=s,At−1​=a,St​=s′]=r∈R∑​rp(s′∣s,a)p(s′,r∣s,a)​
  7. Value Function: It is a metric to evaluate a policy π\piπ. It tells what could an agent achieve starting from a state sts_tst​ by following a policy π\piπ in the long run. Mathematically

    Vπ(st)=∑i=0∞γirt+1+iV_{\pi}(s_t) = \sum_{i=0}^{\infty} \gamma^{i} r_{t+1+i}Vπ​(st​)=i=0∑∞​γirt+1+i​