As usual, I have been playing around with different kinds of reinforcement learning algorithms. I am searching for the best match for HTM-style algorithms (memory predictors), and I think I may have found what I am looking for.
I don’t know of any papers on this, although there probably have been some. I essentially took vanilla policy gradients and made them with respect to Q values. Originally, I just did this by finding the action gradient with respect to Q. The following video shows this in action on a simulated robot I made in Box2D + SFML:
The video above uses my simple FERL (free-energy reinforcement learner) with Q-action gradients. However, it is not without problems. First of all, the architecture used in that video doesn’t scale very well, it is difficult to make it hierarchical. It only works with experience replay at the moment, so it has forgetting issues. Also, it requires that the architecture be differentiable. With HTM’s SDRs (sparse distributed representations), this becomes basically impossible.
So, instead, I found a way of estimating the Q-action gradient. I simply predict the exploratory difference of the last taken action only if that action resulted in a positive temporal difference error. This is similar to the CACLA algorithm (continuous actor-critic learning automaton), but with an additional layer of policy gradient estimation capabilities. Once I have the Q-action gradient estimation, I can then apply it to the action itself, by learning to predict the action with the gradient added on to it.
I have a video similar to the one above using this idea successfully. However, it does it using only a single SDR layer. I am still working on applying it to a full HTM-style hierarchy. That will likely be the subject of the next post! 🙂
Until next time!