Profile picture
Kishan Chakraborty
Research Scholar
Home
Blog
Back

Temporal Difference (TD) Learning

The concept of TD learning was introduced by Richard Sutton in the year 1988. It was introduced as a class of incremental learning procedures for prediction problems. Conventional prediction problems are driven by minimizing the difference between prediction
and original value, TD learning is driven by difference between *temporarily successive predictions.
For example, in a task of predicting the probability of rain on saturday an agent can have two approach. One, on Monday it can predict the probability only for saturday or it can predict the probability of each successive day starting from Tuesday.

  • can update it’s probability only after the observation on saturday.
  • It can observe the weather of the next day and update it’s prediction (TD learning).

It estimates it’s value based on another estimate, this is called Bootstrapping. Sutton advocates to view TD learning as a method for efficiently learning arbitrary events, not only the goal-related ones.

Temporal-difference and supervised-learning approaches to prediction

A prediction problem can be formulated as a supervised learning paradigm by taking the first part of the data as input and the second part as label. In the above problem, the measurement on monday can be used to predict the weather of staurday, and repeat for tuesday and so on. This ignores the sequential nature of the problem

Single-step and Multi-step prediction
In single step prediction, correctness of each prediction is revealed at the next step (predicting tuesday’s weather on monday). In multi-step prediction, the actual outcome is not revealed at the immidiate step but partial information is available at each step (Use weather observation on tuesday to update the actual prediction). In single step problem there is no distinction between supervised and TD learning.

Supervised updates are equivalent to TD
Consider the sequence of observation sequence {xt}1m\{x_t\}_1^m{xt​}1m​ and zzz is the outcome of the sequence. For each observations let PtP_tPt​ be the predictions. Assume that PtP_tPt​ depends only on xtx_txt​. It can depend on previous observations in which case the formulation can be modified accordingly. The predictions are calculated using a modifiable parameter www. Learning in this process corresponds to updating weights to make bettern prediction, P(xt,w)P(x_t, w)P(xt​,w).

Weights are updated only after the full observation not intermediate sequence. For each observation an increment Δwt\Delta w_tΔwt​ calculated after the full sequence.

w←w+∑t=1mΔwtw \leftarrow w + \sum_{t=1}^m \Delta w_tw←w+t=1∑m​Δwt​

In supervised approach, the data is represented as (xt,z)1m{(x_t, z)}_1^m(xt​,z)1m​ and after the full sequence the update is given by

Δwt=α(z−Pt)∇wPt\Delta w_t = \alpha (z-P_t)\nabla_w P_tΔwt​=α(z−Pt​)∇w​Pt​

If P(xt,w)=wTxtP(x_t, w) = w^Tx_tP(xt​,w)=wTxt​ (linear) then we have Widrow-Hoff rule

Δwt=α(z−wTxt)xt\Delta w_t = \alpha (z- w^Tx_t)x_tΔwt​=α(z−wTxt​)xt​

The same result can also be obtained using incremental TD learning.
Using z−Pt=∑k=tm(Pk+1−Pk),Pm+1=zz-P_t = \sum_{k=t}^m(P_{k+1} - P_k), P_{m+1} = zz−Pt​=∑k=tm​(Pk+1​−Pk​),Pm+1​=z, we have

Δ=α(Pt+1−Pt)∑k=1t∇wPk\Delta = \alpha (P_{t+1} - P_t) \sum_{k=1}^t \nabla_w P_kΔ=α(Pt+1​−Pt​)k=1∑t​∇w​Pk​

This equation can be computed incrementally.