Temporal Difference (TD) Learning
The concept of TD learning was introduced by Richard Sutton in the year 1988.
It was introduced as a class of incremental learning procedures for prediction problems.
Conventional prediction problems are driven by minimizing the difference between prediction
and original value, TD learning is driven by difference between *temporarily successive predictions.
For example, in a task of predicting the probability of rain on saturday an agent can have two approach.
One, on Monday it can predict the probability only for saturday or it can predict the probability of each
successive day starting from Tuesday.
- can update it’s probability only after the observation on saturday.
- It can observe the weather of the next day and update it’s prediction (TD learning).
It estimates it’s value based on another estimate, this is called Bootstrapping.
Sutton advocates to view TD learning as a method for efficiently learning
arbitrary events, not only the goal-related ones.
Temporal-difference and supervised-learning approaches to prediction
A prediction problem can be formulated as a supervised learning paradigm by taking the first part of the data as
input and the second part as label. In the above problem, the measurement on monday can be used to predict the
weather of staurday, and repeat for tuesday and so on. This ignores the sequential nature of the problem
Single-step and Multi-step prediction
In single step prediction, correctness of each prediction is revealed at the next step (predicting tuesday’s weather
on monday). In multi-step prediction, the actual outcome is not revealed at the immidiate step but partial information
is available at each step (Use weather observation on tuesday to update the actual prediction). In single step problem
there is no distinction between supervised and TD learning.
Supervised updates are equivalent to TD
Consider the sequence of observation sequence {xt}1m and z is the outcome of the sequence. For each observations
let Pt be the predictions. Assume that Pt depends only on xt. It can depend on previous observations in which
case the formulation can be modified accordingly. The predictions are calculated using a modifiable parameter w.
Learning in this process corresponds to updating weights to make bettern prediction, P(xt,w).
Weights are updated only after the full observation not intermediate sequence. For each observation an increment
Δwt calculated after the full sequence.
w←w+t=1∑mΔwt
In supervised approach, the data is represented as (xt,z)1m and after the full sequence the update is given by
Δwt=α(z−Pt)∇wPt
If P(xt,w)=wTxt (linear) then we have Widrow-Hoff rule
Δwt=α(z−wTxt)xt
The same result can also be obtained using incremental TD learning.
Using z−Pt=∑k=tm(Pk+1−Pk),Pm+1=z, we have
Δ=α(Pt+1−Pt)k=1∑t∇wPk
This equation can be computed incrementally.