Sorry for not posting in a while! I said I wanted to get a working version of HTMRL before my next post, so that delayed it a little 🙂
After coming up with many wild theories and spending several afternoons thinking until my head hurt, I finally came up with a solution for incorporating HTM with reinforcement learning. The original idea in my last post is still somewhat related, but it has changed a lot.
At first I wanted to take a “predict what you want to see” approach with HTM. So I perturbed the columns and cells into the direction of maximum reward (just following the Q weights basically). However, I quickly realized that this wasn’t going to work due to the strong interdependence in the columns. I tried to find a way to maximize reward by modifying actions analytically at first, but the solutions I came up were either way too complex or simply didn’t work.
So finally, I took a somewhat strange way out. In order to find the action that maximizes the Q values, I simply used simulated annealing to derive the optimal action.
This solution uses nothing besides HTM and simulated annealing. It is a “pure” approach to HTM reinforcement learning in a way.
Now one would think that this is super slow, inefficient, and in no way biologically plausible. The last one might be true, but it is actually very efficient and fast. I don’t have to derive completely new actions from scratch every time: I can used the prediction of the action that the HTM region provides as a starting point. So, I really only need to have 1 simulated annealing iteration per step, since the HTM region will store intermediate results that can be built on over time. So it’s sort of like annealing over time with memory.
This property also makes the system highly scalable, something I will need once I start learning off of raw Atari game frames.
And it actually works! I just got it to work right before writing this blog post, so my test of its capabilities is kind of lame (mountain car problem), but don’t worry, more complicated tests (the Atari environment) will come!
So here it is doing the mountain car task. I am using my CPU implementation in this video, since it made experimenting easier.