My Attempt at Outperforming Deepmind’s Atari Results – UPDATE 6


Another update!

I have been working on the image-to-features system. Deepmind did not have a separate feature extractor and reinforcement learner. They used a convolutional neural network to predict Q values for discrete actions. While I have both a discrete and continuous version of HTMRL, I really want to focus on the continuous version since it can be applied to more problems in the future. However, the way I am currently obtaining a continuous policy from a Q function makes it far more efficient to separate the structures for image feature extraction and reinforcement learning instead of combining them.

I fixed several large bugs in my HTM implementation that were causing it to predict poorly. In case this HTM stuff doesn’t work out though, I tested a convolutional restricted boltzmann machine network on the pole balancing problem. In case I do end up pursuing this route instead of HTM, I would need to handle partial observability in the reinforcement learning portion of the algorithm.

In HTMRL, partial observability is handled by the HTM itself: It produces novel patterns that combine both current state and predicted future state. A problem I am currently having with it though is that unless I make the receptive field of the HTM columns very small, it will not perceive small changes in the image. It will instead just ignore them. It is possible to remedy this by adding more and more columns, but this drastically slows down the algorithm. Therefore I am currently experimenting with different parameters and topologies to find what works best. If I absolutely cannot get it to work, I still have that Conv-RBM.

In case I do end up using Conv-RBM instead, I would need to handle partial observability differently. Fortunately though, I was already able to solve problems such as the T-maze problem by simply feeding the hidden units used by the free-energy based reinforcement learner back to the visible units. To be honest, I didn’t expect such a simple solution to work, but it worked. So, unlike Deepmind’s algorithm, I now can learn “infinitely” long-term dependencies (theoretically anyways). Deepmind used a fixed-size history window as far as a know, which both increases the complexity of the resulting MDP and cannot account for long-term dependencies that require knowledge beyond this time window.

In case you missed it, the free-energy based reinforcement learner (FERL) I use for the continuous version of HTMRL is derived from this: I modified it heavily to perform both continuous actions and handle partial observability.

So, to summarize: HTM, or Conv-RBM + recurrent hidden nodes in the reinforcement learner portion. I will probably end up doing both 😉

For those just seeing this for the first time, the source code for this is available here, under the directories “htmrl” and “convrl”: It uses the CMake build system.

Until next time!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s