Text2SDR

Hello!

While working on reinforcement learning with HTFERL, I tried using the algorithm for some other things as well. I tried using it for some natural language processing, with which I have no experience. So here is what happened…

I decided to start out with a single layer of HTFERL for predicting words ahead of time. The algorithm my brother (also interested in NLP!) and I came up with works like this:

  • If the word has never been seen before (not part of some dictionary D), create a new entry in D for this word, and assign it the current predicted word vector as the feature.
  • If the word has been seen before (it already exists in D), update the prediction to match the feature vector of this word.

So the layer of HTFERL goes through the sentence, word by word (or some other tokenizing method), and automatically starts assigning word vectors (features) to words it doesn’t know while keeping predictions up-to-date on words it does know.

This may seem very similar to word2vec, that’s because it is. The features generated by this process describe words by their grammatical properties, without actually knowing what the words mean. Just like word2vec, similar word vectors are similar in meaning, and just like word2vec it is possible to perform arithmetic on the word vectors.

So what makes this special compared to word2vec? Well, the word vectors are really only a side-effect of the technique. The interesting part is when we start using the system to understand sentences.

As the HTFERL layer parses the text, it builds an internal sparse distributed representation (SDR) of the text as whole. It learns whatever is necessary to predict the next word in the sentence, so we can be sure that the SDR contains very complete information about the meaning of the sentence.

From here we can either use the SDRs HTFERL generates as input to a classifier or some other system. Alternatively, it is also possible to use the text predictions for something useful stand-alone.

I have developed and tested a system that predicts what you are about to type based on the current typing pattern and the history of what you typed. I am currently developing a Visual Studio plugin that uses this system as a form of smart code completion.

Another interesting test I did is using the system for sentence generation. If you feed the predicted word back in to the system as input, then it will start generating a sentence. If you perturb the predictions a bit, it will start using different words with similar meanings, and form “random” but still grammatically valid sentences.

The code for Word2SDR (although really it is “text2SDR” 😉 ) is available here: https://github.com/222464/AILib/blob/master/Source/text/Word2SDR.h

So there you have it, using a HTM-derivative for NLP!

Until next time!

10 thoughts on “Text2SDR

  1. hi cire! i’m using word2vec/doc2vec and would like to take text2sdr for a spin. any idea how i can make it call the code from python – windows initially? thanks in advance!

    Like

    • Hello!

      If you give me a day, I’ll make a Python version for you, using some newer algorithms I now have.
      It’s not too much work for me since I already have a lot of sparse coding code written in Python.

      If it is alright with you, I will send you an email when it is done!

      ~ CireNeikual

      Like

      • I won’t get around to test it until the weekend or early next anyway so this will be perfect! I’ve been using the word2vec/doc2vec package from gensim a lot recently but am really looking for word2vec/doc2vec vectors as sparse representations. Preferably, if the length of the sparse vectors can be specified beforehand. I’m assuming that the text2sdr vectors are all of the same length. Thanks!

        Like

  2. Didn’t have much luck, mainly because of length of processing time. Can provide details via email – dinesh [dot] vadhia [@] outlook [dot] com

    Like

Leave a comment