Hello again!
As I stated in my previous posts, the precision of the function approximation remains a problem. However, I have found that using SARSA instead of Q learning (on-policy instead of off-policy) mitigates the problem a lot since it is no longer bootstrapping on the largest potential value (subject to overestimation) but rather the potential value of the selected action.
Due to this improvement, I decided to move back to feed-forward neural networks with RMS training. The RBF networks train much faster, but they seem to forget things faster (things not part of the replay chain). I assume this is due to the on-line supervised learning that occurs, which can destroy previously learned information quite easily in favor of better matching new information.
I now have HTMRL performing both pole balancing and the mountain car problem with very little training time. It figures both of them out in under a minute usually.
I started scaling up to the Arcade Learning Environment, and did some short test runs. I also did a lot of optimization, since I want it to run in real-time (Deepmind’s algorithm was not in real-time as far as I know, and took more than just one standard desktop PC). I have not yet trained it enough to get decent results (5 minutes is not much), but I will probably try an overnight run soon.
For the continuous action edition of the algorithm, I experimented with free-energy based reinforcement learning. It learns a value function as usual, but it can easily derive a continuous policy from the value function. I started implementing replay updates for this as well.
Here is a short video of the system learning the mountain car problem:
For those just seeing this for the first time, the source code for this is available here, under the directory “htmrl”:https://github.com/222464/AILib It uses the CMake build system.
It’s getting there!
See you next time!