My Attempt at Outperforming Deepmind’s Atari Results – UPDATE 10

Hello again!

I said in the last post that I would open-source my continuous HTM GPU implementation. Well, here it is! Sorry for the delay! Link: https://github.com/222464/ContinuousHTMGPU

I did a rough performance test on GPU HTM, and I was able to simulate a region with 16384 columns, 65536 cells, 1.3 million column connections and 21 million lateral connections at 80 iterations per second. I don’t know if this is a lot or not, but keep in mind that I have not yet optimized it. My GPU is a Radeon HD5970.

I have been working on my newest version of HTM-RL. The old version was able to solve the pole balancing and mountain car problems quite easily, but had difficulties as soon as one started adding more regions to the HTM. As new regions were added, the HTM would get increasingly stable. However, it gets so stable that one cannot perceive subtle changes in the input from the sparse distributed representation of the topmost region.

So previously, HTMRL could only learn off of very general concepts that don’t change very rapidly (when more than one region is used). So, it could for instance identify where the most enemies in space invaders were located, but it could not distinguish them individually.

I have come up with a new method that will allow it to use information at all levels of the hierarchy. It is a form of hierarchical reinforcement learning.

The idea is as follows: Have each HTM region represent both the state and the last action taken. Then, for each level of the hierarchy, use a function approximator that maps the column outputs to a Q value. Starting at the topmost level in the hierarchy, use a form of gradient descent on the “action” columns to optimize the Q value (make it as large as possible). This becomes the region’s suggested action. From there, descend down the hierarchy, and again optimize the Q values. But, before the optimization takes place, incorporate the next higher layer’s suggested action into the region’s action output (using a weighted average, for instance).

This way we essentially start with the most general, high-level description of the action, and then perturb it into a more and more fine-grained action as we descend down the hierarchy, based on the Q values.

I sort of doubt the biological plausibility of this approach. But I do think it will work (at all). As long as it ends up working, I don’t care about the biological plausibility that much 🙂

Here is a Blender render I made that shows the structure of the function approximator that is attached to a single HTM region. Please excuse my poor modeling skills!

Key: