Hello!
This post is indirectly related to my series of posts on Deepmind-style Atari playing. It is about an idea that I wanted to share for building massive reinforcement learners.
While working on my CHTM implementation, I started to think of ways that it could be simplified. Perhaps to the point where it is no longer biologically plausible even, but while still keeping the functionality intact.
The closest thing I can think of in standard deep learning literature that can do similar things to HTM is the sparse recurrent autoencoder. HTM makes predictions about its own input, similar to a recurrent autoencoder. The sparsity comes in handy to minimize forgetting.
So I started experimenting with some autoencoders, one of which made its way into the spatial pooler of CHTM. Not many papers exist on sparse autoencoders with explicit lateral inhibition, so I played around with some ideas and came up with the following code (Python):
import numpy as np def sigmoid(x): return 1.0 / (1.0 + np.exp(-x)) class SparseAutoencoder(object): """Sparse Autoencoder with Explicit Lateral Inhibition""" weights = np.matrix([[]]) activations = np.array([]) dutyCycles = np.array([]) def __init__(self, numInputs, sdrSize, sparsity, minWeight, maxWeight): self.weights = np.random.rand(sdrSize, numInputs) * (maxWeight - minWeight) + minWeight self.visibleBiases = np.random.rand(numInputs) * (maxWeight - minWeight) + minWeight self.hiddenBiases = np.random.rand(sdrSize) * (maxWeight - minWeight) + minWeight self.activations = np.zeros_like(self.hiddenBiases) self.dutyCycles = np.array([sparsity for i in range(0, sdrSize)]) def generate(self, inputs, sparsity, dutyCycleDecay): localActivity = np.round(len(self.hiddenBiases) * sparsity) # Activation sums = np.dot(self.weights, inputs.T) + self.hiddenBiases.T self.activations = np.array([sigmoid(sums[i]) for i in range(0, len(sums))]) sdr = np.array([0.0 for i in range(0, len(self.activations))]); for i in range(0, len(self.activations)): numHigher = 0.0 for j in range(0, len(self.activations)): if sums[j] > sums[i]: numHigher += 1.0 if numHigher < localActivity: sdr[i] = 1.0 self.dutyCycles = (1.0 - dutyCycleDecay) * self.dutyCycles + dutyCycleDecay * sdr return sdr def learn(self, inputs, sdr, sparsity, alpha, beta): # Reconstruct recon = self.reconstruct(sdr) errors = inputs - recon hiddenErrors = np.dot(self.weights, errors) / len(errors) * self.activations * (1.0 - self.activations) self.weights += alpha * 0.5 * (np.dot(np.matrix(errors).T, np.matrix(sdr)).T + np.dot(np.matrix(hiddenErrors).T, np.matrix(inputs)))# + beta * np.dot(np.matrix((np.array([sparsity for i in range(0, len(self.dutyCycles))]) - self.dutyCycles)).T, np.matrix(inputs)) self.visibleBiases += alpha * errors self.hiddenBiases += beta * (np.array([sparsity for i in range(0, len(self.dutyCycles))]) - self.dutyCycles) return recon def reconstruct(self, sdr): return np.dot(self.weights.T, sdr.T) + self.visibleBiases.T
This autoencoder has been tested on some image reconstruction tasks, where it functions properly.
Now we essentially have a simplified spatial pooler. Make it not-fully-connected, and you can make large layers of it for local SDRs. But how can we encode temporal information as well, such that we instead of reconstructing the current input we reconstruct the input at _t+1_?
One possible solution, which is mostly still just an idea at the moment, is to make recurrent connections at the SDR level. That is, connections between hidden nodes of the autoencoder. Hopefully this will allow us to predict one step ahead of time.
So assuming that that works as intended, how can we scale this up? Naturally we would want to have some sort of layering architecture. We can stack these autoencoders as is standard in deep learning. But how do we make predictions from the autoencoders take information from the above layers into account? In the end we want to predict the input at the lowest layer only.
My proposal is to add recurrent connections to the next layer as well. So first we do an upwards pass on the input so that the hidden representations (SDRs) are formed, and then we go back down and start reconstructing the SDRs using information from previous layers.
If this plan works out, we would have a highly scalable way of predicting a sequence one step ahead of time. From that point, we can apply reinforcement learning in a way similar to CHTM, by only learning to predict when the temporal difference errors is positive. Q values can be stored as a sort of free energy similarly to the way CHTM does it as well.
If all this comes to pass, then the result would have temporal pooling, since the SDRs for spatial and temporal data are shared. This should allow the prediction of very large sequences efficiently.
I am excited to code this concept, here is the repository I have started working on: HTFERL
HTFERL stands for hierarchical temporal free-energy reinforcement learner.
If you are interested in joining this project, let me know! I could use a hand 🙂
Until next time!
Hi, it is very interesting to read your thoughts about thinking HTM as a recurrent deep learning network. I have had this idea just recently, when I realized that SDR’s could be constructed by Autoencoders using regularization (L1 probably), the active and predictive states can be reformulated by using recurrent connections between consecutive time steps, all the the distal dendrite connections can be thought the recurrent connections in the hidden layers and back propagation could be minimizing the difference between the current spatial representation and the predictive state (i.e. the recurrent prediction of the last time step…). The question how you stack these layers are not so clear for me. Do you have a python code for how the activation, learning and reconstruction works with layers as well? Or could you explain it a little bit in more detail how exactly this works? (I’m not so good at reading c++ opencl code 😦 Also one other idea that I like, is slow feature analysis, especially the incremental version of it, so I think SFA could be a very good candidate to solve the real temporal pooling task (i.e. having stable temporal representations by going up in the hierarchy). One could combine the SFA optimization constraints with the Autoencoder reconstruction constraints fairly easily, although the details could be more complicated…What do you think?
LikeLike
Hello!
I unfortunately do not have Python code for layering, it’s all in C++. I have a Python interface to it (it’s called pyhtfe), but all the important bits are in C++. However, the idea behind the layering isn’t very complicated. Basically, you just include the state of the next layer as part of the input to the current layer’s autoencoder. It then attempts to reconstruct its own input, that of the previous timestep, and that of the next layer. This way it learns a compressed representation of its input, past self, and next layer.
This is how it works in the latest version of HTFERL: First, an upwards pass is performed on the layers, in a way similar to a regular stack of autoencoders. This extracts spatial features. Then, a downwards pass is performed, where a second autoencoder (2 per layer) forms spatio-temporal features.
As for SFA, I don’t know much about it, I will look into it though!
I hope this made some sense 🙂
LikeLike
yes this makes sense. Still there are couple of options how to combine the states of different layers and time steps. The bottom up forward pass is obvious, at each layer I reconstruct the input(x) and update the weights
accordingly (basically minimize (WW^T*x-x) so we will have activations (a) of hidden layers at each layer. We assume that at each layer we have another AE that tries to learn temporal patterns. On each layer this AE has connections between the hidden layers in consecutive time steps. Lets denote the weights of this recurrent connection by R, then at each layer I reconstruct the spatial activation a by minimizing RR^T*a-a and thus updating R accordingly? And the activation R*a will be the prediction to the next time step? Is it correct? Or I take the prediction of the last time step into the game as well? So I take prediction of last time stamp (b_{t-1}) and spatial activation a_{t} at current time step and how do I exactly use them to make prediction b_{t+1} for the next time step?
LikeLike
The temporal autoencoder, or rather spatio-temporal autoencoder (separate from the upwards-pass spatial-only autoencoder) learns to reconstruct the spatial autoencoder’s hidden layer, the past hidden layer of itself, and the hidden layer of the next higher layer’s spatio-temporal autoencoder. So there are 3 things it learns compressed forms of. Then, to make a prediction, another separate set of weights is used, that learn to predict the next timestep’s spatial autoencoder hidden states using the current timestep’s spatio-temporal autoencoder hidden states. It is simply updated using the normal perceptron learning rule: As soon as the next state arrives, take the error between the prediction and the actual result, and updated the weights with this.
So now we can predict the spatial hidden states one step ahead of time. From there, all one must do is reconstruct the input from those predicted hidden states to get the predicted input (of the first layer).
LikeLike
Hello, it’s very interesting post. I’ve write python version of your Encoder which have similar structure, feel free to check this out https://gitlab.com/kaihoankalina/RecurrentSparseAutoencoder
LikeLike