Fun with Reinforcement Learning


As usual, I have been playing around with different kinds of reinforcement learning algorithms. I am searching for the best match for HTM-style algorithms (memory predictors), and I think I may have found what I am looking for.

I don’t know of any papers on this, although there probably have been some. I essentially took vanilla policy gradients and made them with respect to Q values. Originally, I just did this by finding the action gradient with respect to Q. The following video shows this in action on a simulated robot I made in Box2D + SFML:

The video above uses my simple FERL (free-energy reinforcement learner) with Q-action gradients. However, it is not without problems. First of all, the architecture used in that video doesn’t scale very well, it is difficult to make it hierarchical. It only works with experience replay at the moment, so it has forgetting issues. Also, it requires that the architecture be differentiable. With HTM’s SDRs (sparse distributed representations), this becomes basically impossible.

So, instead, I found a way of estimating the Q-action gradient. I simply predict the exploratory difference of the last taken action only if that action resulted in a positive temporal difference error. This is similar to the CACLA algorithm (continuous actor-critic learning automaton), but with an additional layer of policy gradient estimation capabilities. Once I have the Q-action gradient estimation, I can then apply it to the action itself, by learning to predict the action with the gradient added on to it.

I have a video similar to the one above using this idea successfully. However, it does it using only a single SDR layer. I am still working on applying it to a full HTM-style hierarchy. That will likely be the subject of the next post! 🙂

Until next time!

HTSL2, Evolving the Neocortex, and Human-Like Reinforcement Learning

Hello everyone,

It’s been a while since I have made a post here! This is mostly because I recently started working for GoodAI, an AI company that focuses on AGI research. I was kindly given permission to post about some of the stuff I have been working on here!

So, here is the latest!


The original HTSL was a MPF (memory prediction framework) that used very simple learning rules and architecture to understand its environment. I re-implemented it into GoodAI’s Brain Simulator, a handy tool for running CUDA-based AI simulations with reusable nodes.

The original HTSL was quite limited in that the predictions did not really affect the feature extraction process. So, it tried to “compress” everything as opposed to giving more attention to things that improve predictions. HTSL2 fixes this. In additional, it adds traces that help propagate credit for predictions back in time (as well as rewards when used with reinforcement learning).

Aside from some simple tests like the T-maze and character prediction, one of the more interesting things I was able to do was to take music and store it entirely in the working memory of HTSL2. This doesn’t show much generalization (only when it runs off of its own messy predictions), but it memorizes it quite well. We compared it to LSTM superficially on the same task, but HTSL2 gave better results (a clearer sounding audio that doesn’t decay into noise). To be fair though, HTSL2’s strength is scalability, and the audio task (I used 44100 samples/second) is relatively large. Below is some example music, the original (I don’t own it – it’s from Kevin Macleod) followed by the memorized version’s reconstruction.



Another interesting test is that of having HTSL2 learn to perform simple mathematical operations based on images of numbers. The predictions are still a bit messy right now, but it seems to work at all 🙂 . Below is an image of HTSL2 performing addition (left two digits are the first number, second two digits are the second number, and the last two digits are the result).


31 + 19 = 50!

Evolving a better MPF

MPFs are basically “environment understanders”, which can be used as a world model for reinforcement learning (model-based). They are a theory of how the neocortex operates. Nobody has created a MPF sufficient for AGI thus far. HTSL2 is just a starting point, which will hopefully develop into something truly useful eventually.

What if we could have something that designs a MPF for us? One direction of research I have been looking into is that of using genetic algorithms to create a MPF. So, I created an architecture that is generic except that it must use layers of nodes (“columns”). Everything else, the learning rules and firing dynamics included, is evolved. Right now, I have not gotten it to produce the holy grail of MPFs 😉 . I have only run it up to around generation 200 (several hours), due to lack of processing power. We are looking into ways of parallelizing the search to improve performance. It only needs to find the solution once!

Below is an image of the node (column) controller of one particular run. I made this visualizer to help us figure out why it works (when it does).


An idea for MPF-based “Human-Like” reinforcement learning

Say we have the perfect MPF, and can get perfect environmental predictions. How to use this for reinforcement learning?

I recently came up with what I believe to be a realistic explanation of what might be going on in the human brain. It accounts for Pavlovian conditioning experimental results, such as perceiving reward despite not receiving actual reward due to conditioning.

Say we predict the reward, and the action, along with the state. There are 2 kinds of rewards: external (from some sort of reward interpreter or environment), and internal. The final reward used to update the MPF predictions is then some combination of these two, where the internal reward is weaker (multiplied by some sort of decay factor).

So what is the internal reward? As the name suggests, it is generated internally by the system itself. Say that instead of just predicting the next input with our MPF, we leak in some of our own predictions. This causes a “train of thought”, which I have been able to produce in other experiments as well. Now, we can assign an internal reward for predicted rewards as well: When the agent thinks it is going to get a reward, reward the MPF, even if no external reward is present. This internal reward must be weaker than the external reward such that real rewards always have priority.

The end result is that the agent can think ahead of time what to do in order to maximize reward. This “thinking” is spread out over multiple timesteps as predictions leak in to the input.

I am currently trying to implement this idea with HTSL2. I hope that HTSL2 is good enough for this task. I am excited to see if it works like I think it should!

That’s it for now! Until next time!

Time Series Predictions and More


I have been working on improving the algorithms part of HTFE in a new iteration, which so far is not yet open source, but will be eventually. It is called HTSL, the hierarchical temporal sparse learner. It is a further simplification of the concepts of HTFE which in itself was a simplification of HTM. Despite being the simplest iteration yet, HTSL gets the best results out of all my attempts so far.

So here are some tests I made for HTSL!

Music Generation

I started with the piano rolls dataset from here:

First I just tested the memorization capabilities of HTSL. So, I started by feeding a 16×16 and 12×12 (2 layer) HTSL hierarchy a series of 200 notes. After iterating on it for only 30 iterations, it achieved 100% remembrance. So at the very least, it is capable of extremely fast learning (recurrent neural networks like LSTM require iterations in the thousands).

I then tested a weird feature: Generalization of music to produce unique new music. Basically I ran HTSL on a bunch of songs, randomly choosing which ones, and then activated HTSL into a random state. From there, I fed its own predictions to itself as input, so it would start generating its own music based on experience.

Here is an example of some music it generated:

Animation Prediction

Another test was to feed HTSL video data, and have it predict the next frame of an animation.

For this, I just used some sprite sheets. I started with a simple test that was just a sprite sheet of incrementing numbers. So, here is an image of HTSL doing image-based counting!


Left is the input, right is the prediction.

If I feed it its own predictions, it starts counting on its own!

Mouse Movement Prediction

I prepared a super-simple demo, to show off the online learning capabilities of HTSL. In this demo, HTSL tries to predict where your mouse will be in the next 1/12 of a second. I do this by giving HTSL an input to predict every 5 frames (60 fps), and interpolating the results.

So, you should be able to move your mouse in patterns (say, a circle, or an 8 shape), and it will indicating the future position of your cursor in blue.

You do not have to be perfect with your movements for this to work, it seems to generalize very will to different but similar patterns of movement.

Here is a download link:

How does it work?

Unfortunately I cannot reveal too much yet, sorry 😦 Soon though!

But, know that it is still similar to HTFE, and still is based on SDRs (sparse distributed representations).

One thing I will give away: HTSL uses a new sparse coding mechanism, that to my knowledge learns faster than anything else out there right now. It does not need iterative solving for sparse codes, can handle non-contrasting inputs, and produces great reconstructions. Here is an example:


I am also working on using HTSL for reinforcement learning again. Right now HTSL runs on the CPU, so I will be moving to the GPU eventually, so I can have a go at the Atari arcade learning environment!

Until next time!

A Tutorial on Sparse Distributed Representations (Sparse Codes)


When working on HTFE, I spend a lot of time dealing with how one properly creates SDRs (Sparse Distributed Representations, aka sparse codes) for inputs. I have recently created a semi-new method of generating sparse codes in a very fast manner that, to my knowledge, is biologically plausible.

What are SDRs/sparse codes? SDRs are representations of data where most entries are 0 (sparse). The main reason I find this interesting in particular is how it can be used for efficient feature extraction as well as reducing or eliminating catastrophic interference in reinforcement learning. SDRs are also the way the brain stores information. SDR receptive fields tend to exhibit Gabor-wavelet-like behavior like those found experimentally in animals.

So, how do we go about creating our own SDR learning algorithms? By searching the web, you will find several algorithms which produce similar results. However, many of them are not very biologically plausible or are difficult to reproduce. So, I present an algorithm I came up with after researching this topic for a while.

The algorithm consists of a population of nodes which fire when their input is similar to their receptive field. Then, to induce sparsity, these nodes inhibit each other through learned lateral connections. This means that the more often two nodes fire together, the stronger their inhibitory links get. This results in a separation of nodes that respond to similar inputs (decorrelation).

I will try to approach this tutorial from a programming perspective, as I feel it makes it easier to understand what is going on. I will use Python with numpy, as it is relatively easy to understand.

Let’s start by defining some data structures:

self.weights = np.random.rand(sdrSize, numInputs) * (maxWeight - minWeight) + minWeight
self.biases = np.random.rand(sdrSize) * (maxWeight - minWeight) + minWeight
self.activations = np.zeros(sdrSize)
self.inhibition = np.zeros((sdrSize, sdrSize))

Here, weights represents the receptive field. It is a matrix where each row represents a node, and each column represents its weights. The biases act as a sort of threshold weight, it is an array the size of the number of nodes. Activations is also an array the size of the number of nodes, representing the firings of the nodes before inhibition. Finally, inhibition is a matrix of weights that represents the inhibitory connections between nodes (where rows are the current node and columns are the surrounding nodes).

In the above code snippet, the weights and biases are initialized randomly and everything else is set to zero.

To obtain the SDR from these parameters, we do the following:

sums =, inputs.T) + self.biases

self.activations = np.maximum(0.0, sums)

sdr = self.activations -, self.activations)

for i in range(0, self.activations.size):
    if sdr[i] > 0.0:
        sdr[i] = 1.0
        sdr[i] = 0.0

First off, let me say I am not a master Python coder 🙂 Also, I do realize that this code is quite inefficient. But it should work for explanatory purposes.

So, we start by computing sums, which is the integration of each node’s inputs. Then, we apply the activation function to get activations, in this case a simply linear rectifier (max(0, value)).

From there, we get our SDR by inhibiting the activations using the inhibition weights, and then setting each SDR entry to be either 0 or 1.

This completes the SDR generation step. To actually learn proper SDRs, we need to add some additional code:

recon =, sdr.T)

error = inputs - recon

for i in range(0, self.weights.shape[0]):
    lscf = sparsity - sdr[i]

    self.weights[i] += alpha * sdr[i] * error

    self.biases[i] += gamma * lscf

    self.inhibition[i] += beta * lscf * self.activations[i] * self.activations

    self.inhibition[i] = np.maximum(0.0, self.inhibition[i])

    self.inhibition[i][i] = 0.0

Here, alpha, beta, and gamma are learning rates.

We start by getting a reconstruction of the SDR. This is basically the information the system “thinks” it is seeing. What we want to do is get the reconstruction to match the actual input as closely as possible, while forming unique SDRs for unique inputs.

Once we have the reconstruction, we proceed to find the error between the inputs and the reconstruction. Then, we go through each node, and first determine the lscf (lifetime sparsity correction factor). This is a measurement of how close the SDR is to the target sparsity.

The weights are then updated through a simple Hebbian learning rule in a way that reduces reconstruction error. The biases are then updated based only on the lscf, since the biases control the relative sparsity of the node.

Finally, the inhibition weights are updated through another Hebbian-like learning rule. This time, it is a measurement of the correlation between the firings of the current node and the nodes around it. This is then modulated by the lscf to achieve the desired target sparsity. The inhibition weights are kept above 0, so that the remain exclusively inhibitory and not excitatory. Finally, we zero out the inhibitory connection a node has to itself, since it should not be able to inhibit itself.

This is the basic algorithm, I am sure many upgrades exist. The full source code is available below (slightly different from what was shown above, but identical in functionality):

import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

class SparseAutoencoder(object):
    """Sparse Autoencoder"""

    weights = np.matrix([[]])
    biases = np.array([])
    activations = np.array([])
    inhibition = np.matrix([[]])

    def __init__(self, numInputs, sdrSize, sparsity, minWeight, maxWeight):
        self.weights = np.random.rand(sdrSize, numInputs) * (maxWeight - minWeight) + minWeight
        self.biases = np.random.rand(sdrSize) * (maxWeight - minWeight) + minWeight
        self.activations = np.zeros(sdrSize)
        self.inhibition = np.zeros((sdrSize, sdrSize))
    def generate(self, inputs, sparsity, dutyCycleDecay):
        sums =, inputs.T) + self.biases

        self.activations = np.maximum(0.0, sums)

        sdr = self.activations -, self.activations)

        for i in range(0, self.activations.size):
            if sdr[i] > 0.0:
                sdr[i] = 1.0
                sdr[i] = 0.0
        return sdr

    def learn(self, inputs, sdr, sdrPrev, sparsity, alpha, beta, gamma, decay):
        recon = self.reconstruct(sdr)

        error = inputs - recon

        for i in range(0, self.weights.shape[0]):
            lscf = sparsity - sdr[i]

            learn = sdr[i]

            self.weights[i] += alpha * learn * error

            self.biases[i] += gamma * lscf

            self.inhibition[i] += beta * lscf * self.activations[i] * self.activations

            self.inhibition[i] = np.maximum(0.0, self.inhibition[i])

            self.inhibition[i][i] = 0.0
    def reconstruct(self, sdr):
        return, sdr.T)

So, does this actually work? Is it worth even testing out for yourself? Well, let me give you some test results!

I constructed a network of 256 nodes, with 256 inputs. I then trained it by randomly sampling small regions of an image. Afterwards, I ran a convolutional reconstruction on the image, where it had to form SDRs from the image and then reconstruct the image from those SDRs. It is convolutional in the sense that I did it by moving a small window across the image, row by row, and summing the resulting reconstructions.

Here is the image I used for training:


Here is an example of some features I was able to learn:


If you are familiar with sparse coding literature, you will recognize the wave-like receptive fields immediately 🙂

Here is an example SDR created from an image by sampling it in blocks (convolution where the stride equals the size of the receptive field):


Here is a convolutional reconstruction of the image, with a stride of 4 (I got impatient 🙂 ):


And finally, I leave you with an animated gif of the formation of the receptive fields over time:


Until next time!

My Attempt at Outperforming Deepmind’s Atari Results – UPDATE 13

Hello again!

I finally got my HTFERL agent churning at the Atari game “Breakout”. It isn’t at DeepMind levels of proficiency yet, but it is much better than random 😉

I trained it for about 15 minutes, and it quite obviously follows the ball around. It usually gets the first and second bounce after these 15 minutes of training. Before I run it overnight, though, I want to still do some performance improvements to the system so that I don’t have to wait so long.

Picture of it playing Breakout after 15 minutes:


I would have made a video, but since the ALE uses software rendering with SDL I cannot use Fraps on it. I will look for some other recording method for a video soon though.

At this point this is more a proof of concept, but I am sure that eventually I will get somewhere within DeepMind territory of game proficiency. The system has come a long way, and I am quite happy with where it is going right now.

For those who haven’t been following this blog, here is a quick summary of how the system works and what does differently from DeepMind’s system.

HTFERL (hierarchical temporal free energy reinforcement learner) is a derivative of HTM (hierarchical temporal memory). It tries to mimic the algorithmic properties of the human neocortex. It can be thought of as a stack of predictive k-sparse autoencoders with recurrent lateral connections and feed back connections (top-down). Information doesn’t only flow upwards, but also can flow in the reverse direction if it helps with prediction.

Unlike DeepMind’s system, HTFERL doesn’t need experience replay, or any form of stochastic sampling. It is 100% online and real-time, and runs nicely at 60 fps with 400000 neurons on my R290. DeepMind’s system was limited to very few discrete actions (due to the poor complexity of “one hot” vectors), HTFERL can handle hundreds of thousands of continuous actions. HTFERL also incorporates eligibility traces, and has the ability to remember things indefinitely (DeepMind’s system was limited to a few seconds fixed history window). Furthermore, HTFERL has a feature known as temporal pooling, which allows it to group similar events into conceptual “bins” over time.

Since the system runs so fast, I was able to use the full-resolution video feed of the Atari games and able to stay capped at 60 fps. DeepMind had to crop and downsample.

So while it isn’t at DeepMind levels of game proficiency yet (but I think it will get there), this approach has some advantages.

I will push the ALE repository to GitHub soon. If you are just interested in HTFERL, the repository for it is available here:

If you want to experiment with HTFE (without the RL part), I recommend using the Python bindings here:

Until next time!



While working on reinforcement learning with HTFERL, I tried using the algorithm for some other things as well. I tried using it for some natural language processing, with which I have no experience. So here is what happened…

I decided to start out with a single layer of HTFERL for predicting words ahead of time. The algorithm my brother (also interested in NLP!) and I came up with works like this:

  • If the word has never been seen before (not part of some dictionary D), create a new entry in D for this word, and assign it the current predicted word vector as the feature.
  • If the word has been seen before (it already exists in D), update the prediction to match the feature vector of this word.

So the layer of HTFERL goes through the sentence, word by word (or some other tokenizing method), and automatically starts assigning word vectors (features) to words it doesn’t know while keeping predictions up-to-date on words it does know.

This may seem very similar to word2vec, that’s because it is. The features generated by this process describe words by their grammatical properties, without actually knowing what the words mean. Just like word2vec, similar word vectors are similar in meaning, and just like word2vec it is possible to perform arithmetic on the word vectors.

So what makes this special compared to word2vec? Well, the word vectors are really only a side-effect of the technique. The interesting part is when we start using the system to understand sentences.

As the HTFERL layer parses the text, it builds an internal sparse distributed representation (SDR) of the text as whole. It learns whatever is necessary to predict the next word in the sentence, so we can be sure that the SDR contains very complete information about the meaning of the sentence.

From here we can either use the SDRs HTFERL generates as input to a classifier or some other system. Alternatively, it is also possible to use the text predictions for something useful stand-alone.

I have developed and tested a system that predicts what you are about to type based on the current typing pattern and the history of what you typed. I am currently developing a Visual Studio plugin that uses this system as a form of smart code completion.

Another interesting test I did is using the system for sentence generation. If you feed the predicted word back in to the system as input, then it will start generating a sentence. If you perturb the predictions a bit, it will start using different words with similar meanings, and form “random” but still grammatically valid sentences.

The code for Word2SDR (although really it is “text2SDR” 😉 ) is available here:

So there you have it, using a HTM-derivative for NLP!

Until next time!

A new idea: HTFERL


This post is indirectly related to my series of posts on Deepmind-style Atari playing. It is about an idea that I wanted to share for building massive reinforcement learners.

While working on my CHTM implementation, I started to think of ways that it could be simplified. Perhaps to the point where it is no longer biologically plausible even, but while still keeping the functionality intact.

The closest thing I can think of in standard deep learning literature that can do similar things to HTM is the sparse recurrent autoencoder. HTM makes predictions about its own input, similar to a recurrent autoencoder. The sparsity comes in handy to minimize forgetting.

So I started experimenting with some autoencoders, one of which made its way into the spatial pooler of CHTM. Not many papers exist on sparse autoencoders with explicit lateral inhibition, so I played around with some ideas and came up with the following code (Python):

import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

class SparseAutoencoder(object):
    """Sparse Autoencoder with Explicit Lateral Inhibition"""

    weights = np.matrix([[]])
    activations = np.array([])
    dutyCycles = np.array([])

    def __init__(self, numInputs, sdrSize, sparsity, minWeight, maxWeight):
        self.weights = np.random.rand(sdrSize, numInputs) * (maxWeight - minWeight) + minWeight
        self.visibleBiases = np.random.rand(numInputs) * (maxWeight - minWeight) + minWeight
        self.hiddenBiases = np.random.rand(sdrSize) * (maxWeight - minWeight) + minWeight
        self.activations = np.zeros_like(self.hiddenBiases)
        self.dutyCycles = np.array([sparsity for i in range(0, sdrSize)])

    def generate(self, inputs, sparsity, dutyCycleDecay):
        localActivity = np.round(len(self.hiddenBiases) * sparsity)

        # Activation
        sums =, inputs.T) + self.hiddenBiases.T

        self.activations = np.array([sigmoid(sums[i]) for i in range(0, len(sums))])

        sdr = np.array([0.0 for i in range(0, len(self.activations))]);

        for i in range(0, len(self.activations)):
            numHigher = 0.0

            for j in range(0, len(self.activations)):
                if sums[j] > sums[i]:
                    numHigher += 1.0

            if numHigher < localActivity:
                sdr[i] = 1.0

        self.dutyCycles = (1.0 - dutyCycleDecay) * self.dutyCycles + dutyCycleDecay * sdr

        return sdr

    def learn(self, inputs, sdr, sparsity, alpha, beta):
        # Reconstruct
        recon = self.reconstruct(sdr)

        errors = inputs - recon

        hiddenErrors =, errors) / len(errors) * self.activations * (1.0 - self.activations)

        self.weights += alpha * 0.5 * (, np.matrix(sdr)).T +, np.matrix(inputs)))# + beta *[sparsity for i in range(0, len(self.dutyCycles))]) - self.dutyCycles)).T, np.matrix(inputs))

        self.visibleBiases += alpha * errors
        self.hiddenBiases += beta * (np.array([sparsity for i in range(0, len(self.dutyCycles))]) - self.dutyCycles)

        return recon

    def reconstruct(self, sdr):
        return, sdr.T) + self.visibleBiases.T

This autoencoder has been tested on some image reconstruction tasks, where it functions properly.

Now we essentially have a simplified spatial pooler. Make it not-fully-connected, and you can make large layers of it for local SDRs. But how can we encode temporal information as well, such that we instead of reconstructing the current input we reconstruct the input at _t+1_?

One possible solution, which is mostly still just an idea at the moment, is to make recurrent connections at the SDR level. That is, connections between hidden nodes of the autoencoder. Hopefully this will allow us to predict one step ahead of time.

So assuming that that works as intended, how can we scale this up? Naturally we would want to have some sort of layering architecture. We can stack these autoencoders as is standard in deep learning. But how do we make predictions from the autoencoders take information from the above layers into account? In the end we want to predict the input at the lowest layer only.

My proposal is to add recurrent connections to the next layer as well. So first we do an upwards pass on the input so that the hidden representations (SDRs) are formed, and then we go back down and start reconstructing the SDRs using information from previous layers.

If this plan works out, we would have a highly scalable way of predicting a sequence one step ahead of time. From that point, we can apply reinforcement learning in a way similar to CHTM, by only learning to predict when the temporal difference errors is positive. Q values can be stored as a sort of free energy similarly to the way CHTM does it as well.

If all this comes to pass, then the result would have temporal pooling, since the SDRs for spatial and temporal data are shared. This should allow the prediction of very large sequences efficiently.

I am excited to code this concept, here is the repository I have started working on: HTFERL

HTFERL stands for hierarchical temporal free-energy reinforcement learner.

If you are interested in joining this project, let me know! I could use a hand 🙂

Until next time!

My Attempt at Outperforming Deepmind’s Atari Results – UPDATE 12


It has been some time since I last posted here. Since my last post, many changes have occurred to my CHTM architecture.

First off, the reinforcement learning algorithm is now based on CACLA (continuous actor-critic learning automaton), but the actor and critic are part of the same structure. The temporal prediction system only learns to predict the last action when the temporal difference error is positive. Read more about CACLA here:

Second, the Q values are now stored within the HTM cells. Since each cell *should* only be active in a unique state (taking partial observability into account), this means that each cell can effectively store the Q value for that state. This is both more biologically plausible (I guess…), and far more efficient. It can now be thought of as a “smart, self-adapting” look-up table.

Third, cells can now predict far more accurately with the addition of dendrite segments. In standard machine learning terms, this is a small perceptron with one hidden layer and an output layer where all weights are 1 (OR operation). This allows the cell to differentiate more complex patterns, limited only by the number of segments (hidden units).

Fourth, the spatial pooler has been reworked several times, and is now essentially a sparse autoencoder with explicit lateral inhibition. It still uses boosting to help out “dead” columns.

So, what about results? Well, I am not quite satisfied with the results yet, but they have been improving a lot. I am performing pole balancing with a twist: It uses vision data (64 x 32 pixels). I made a little plotter to plot my  reward values, and this is one of the more recent runs:


The system now looks like this when visualized with the volumetric renderer:


You may notice that the SDRs are no longer uniform across the layers. The new spatial pooler leaves columns off when all the inputs are 0. I am mostly doing this for debugging purposes at the moment, since it makes it easier to see what is going on.

So that’s it for now, I will hopefully have more interesting results soon, perhaps with a video. I want to start giving updates regularly again as well!

Continuous HTM – Multiple Layers and Reinforcement Learning

Hello again!

In this post I want to cover how to create a hierarchy of the layers described in my previous posts, as well as how to use it for reinforcement learning!

The previous posts are available here:

Forming Hierarchical Predictions

Continuing from where I left off, we now have layers that produce sparse distributed representations of their input and then predict that input one step ahead of time.

Adding a hierarchy allows predictions to traverse longer distances along the layers. The basic idea is this: Create a hierarchy where the higher layers are typically smaller than lower ones. Then form SDRs in a bottom-up order, such that each layer forms a SDR of the layer below it.

In the bottom-pass we only form SDRs, we don’t form predictions yet. This happens in the next top-down pass.

To get less local predictions at lower layers, we can then perform a top-down pass where each layer gets an additional “temporary” context cell whose state is equivalent to the prediction of the next higher layer at the same coordinates. This temporary cell is treated like all other cells, and the other cells must maintain connections to it that are learned as usual.

In summary: We ascend the hierarchy by forming SDRs, and then descend again and form predictions where each layer gets additional context from the previous layer.

Reinforcement Learning, a Quick Summary

In reinforcement learning, the AI or “agent” seeks to maximize a reward signal. This is similar to how animals and humans learn. Since HTM is inspired by biology, it seems natural to try to use HTM for reinforcement learning as well.

Here we will use temporal difference (TD) learning. More specifically, we will use Q learning. As a result we must somehow maintain a Q function that represents the “quality” or “desirability” of each state. If we can somehow use HTM as a function approximator, we can simply store and update Q values using that.

Recall the Q learning equation (I “stole” this image from wikipedia):

The portion of the equation after and including the learning rate is the temporal difference error, which is usually backpropagated as an error value in standard neural networks to update the Q function.

Function Approximation with HTM

As stated above, we need a function approximator to model the Q function. HTM isn’t well know for use in function approximation, but there is a way of doing so, and it is quite effective! Not only do we get to approximate the Q value for each observable state, but it is trivial to also include the context available from the context cells. So, we can essentially reproduce the results of standard recurrent neural network architectures like LSTM, and can do so without the problem of forgetting previously learned values!

This sounds great and all, but how do we do it?

The secret is as follows: Make another standard neural network pass through all cells/columns and use standard backpropagation techniques to update it.

It is possible to pass just through columns if you don’t care about contextual information, like so:


However, for the most effective reinforcement learning we want each cell to connect to all cells in the previous layer within a receptive radius so that we also have contextual information. So a single column is connected fully to a column of the previous layer like so:


For those of you accustomed to deep learning, you may be wondering why the HTM portion even exists if it only provides pathways for a standard network. You may also be wondering how to update such a monstrosity with backpropagation.

Recall that a layer of HTM is very sparse, both in terms of column and cell states. So, consider that we only activate and train portions of this standard network only if they pass through an active cell/column. Suddenly, training becomes far more manageable, since only very small amounts of the network are active at a time. It also becomes incredibly efficient, since we can effectively ignore cells/columns and their sub-network when they have a very small or equal to 0 state value.

Some of you may notice that this is astoundingly similar to the dropout technique used in standard neural networks. Indeed it is, and also comes with the nice regularization benefits.

Here is an example of a cross-section showing a SDR-routed subnetwork:


Furthermore, this has the benefit of only activating certain subnetworks in certain situations while taking context into account!

So, we can train this as our Q function now using standard backpropagation! The only difference is that in the forward pass we multiply the output by the HTM cell/column state, and in the backward pass the derivative changes a bit as a result.

Action Generation

The Q function is now done, but then how do we get actions from it now?

So here it gets kind of weird. There are several alternative solutions to getting actions from a Q function, but I found that simply backpropagating very high Q values all the way to the inputs and then moving those inputs along the error results in an efficient hierarchical action selection scheme.

So when defining the input to the system (input to the bottom-most layer), we need at least 2 different kinds of inputs: State, and action. Here is an example of how they may be organized:


Where green values are state values and blue values are action values.

The scenario above could for instance be an image presented as a state and 2 action outputs.

So what we do is backpropagate a high positive error value all the way to the action inputs, and the move the action inputs along the error to get an action with a higher Q value.

From there, we can apply some exploration (random perturbation), and ta da, we now have actions!

The number of iterations necessary to produce the actions needs to be at least 1, but it doesn’t need to be much larger than 1 either. Since the action is also presented as input to the HTM, we can use the HTM’s predictions to get a starting point for the action backpropagation process. So we can remember “where we left off” and continue the action search from there the next time we enter the same (or similar) state.


I am currently still in the process of fine-tuning these algorithms (the difficulty is in the details), but what I have presented is the overarching idea. I hope it made sense!

If you would like to follow the development of my GPU implementation of this theory, here is the GitHub repository:

Have any ideas for improvements to the system? Let me know in the comments!

Continuous Hierarchical Temporal Memory Temporal Inference

Hello again! Time for part 2 of my description of the latest CHTM! In this post I will discuss the temporal inference portion of the algorithm. Continuing from where I left off in the last post, we now have a sparse distributed representation of a layer’s input (represented by columns with high state values). Remember that I will use the notation _(t-1) to indicate values from the previous timestep. We now want to activate the cells within the column:

void cellActivate(Layer l, float cellIntensity) {
    foreach (Column col in l.columns) {
        float minPredictionError = 1;

        foreach (Cell cell in col.cells) {
            float predictionError = abs(col.state - cell.prediction_(t-1));

            minPredictionError = min(minPredictionError, predictionError);

        foreach (Cell cell in col.cells) {
            float predictionError = abs(col.state - cell.prediction_(t-1));
            cell.state = exp((minPredictionError - predictionError) * cellIntensity) * col.state;

Here we are running a competitive process among the cells to activate the one that predicted the column the best and deactivate the others. This process forms a context for future predictions, whose predictions will then again be used to form new contexts, and so on. Next, we can form new predictions for each cell:

void cellPredict(Layer l, float predictionIntensity, Radius cellRadius) {
    foreach (Column col in l.columns) {
        foreach (Cell cell in col.cells) {
            float sum = 0;

            foreach (Connection con in cell.lateralConnections)
                sum += con.weight * cells[con.connectionIndex].state;

            cell.prediction = max(0, sigmoid(sum * predictionIntensity) * 2 - 1);

Here we are treating each cell as a perceptron with connections to all cells within a radius (including cells in the same column as well as itself, but this is optional). The activation function is a sigmoid scaled into the [-1, 1] range, and then clamped to always be positive. This change allows us to not include a bias unit and still get accurate predictions. From these cell predictions we derive a prediction for the entire column:

void columnPredict(Layer l) {
    foreach (Column col in l.columns) {
        float maxPrediction = 0;

        foreach (Cell cell in col.cells)
            maxPrediction = max(maxPrediction, cell.prediction);

        col.prediction = maxPrediction;

The output of a column is simply the maximum cell prediction. Finally, we update the cell weights using a simple perceptron learning rule:

void learnCells(Layer l, float learningRate) {
    foreach (Column col in l.columns) {
        float error = learningRate * (col.state - col.prediction_(t-1));

        foreach (Connection con in cell.lateralConnections)
            con.weight += error * cells[con.connectionIndex].state_(t-1);

The error for all cells in a column is the same: It is the difference between what we predicted for this column the last timestep and what we actually got. That’s the basic algorithm for a single layer of CHTM! In the next post I will discuss how to use multiple layers to make more accurate predictions! Until then!