Text2SDR

Hello!

While working on reinforcement learning with HTFERL, I tried using the algorithm for some other things as well. I tried using it for some natural language processing, with which I have no experience. So here is what happened…

I decided to start out with a single layer of HTFERL for predicting words ahead of time. The algorithm my brother (also interested in NLP!) and I came up with works like this:

If the word has never been seen before (not part of some dictionary D), create a new entry in D for this word, and assign it the current predicted word vector as the feature.
If the word has been seen before (it already exists in D), update the prediction to match the feature vector of this word.

So the layer of HTFERL goes through the sentence, word by word (or some other tokenizing method), and automatically starts assigning word vectors (features) to words it doesn’t know while keeping predictions up-to-date on words it does know.

This may seem very similar to word2vec, that’s because it is. The features generated by this process describe words by their grammatical properties, without actually knowing what the words mean. Just like word2vec, similar word vectors are similar in meaning, and just like word2vec it is possible to perform arithmetic on the word vectors.

So what makes this special compared to word2vec? Well, the word vectors are really only a side-effect of the technique. The interesting part is when we start using the system to understand sentences.

As the HTFERL layer parses the text, it builds an internal sparse distributed representation (SDR) of the text as whole. It learns whatever is necessary to predict the next word in the sentence, so we can be sure that the SDR contains very complete information about the meaning of the sentence.

From here we can either use the SDRs HTFERL generates as input to a classifier or some other system. Alternatively, it is also possible to use the text predictions for something useful stand-alone.

I have developed and tested a system that predicts what you are about to type based on the current typing pattern and the history of what you typed. I am currently developing a Visual Studio plugin that uses this system as a form of smart code completion.

Another interesting test I did is using the system for sentence generation. If you feed the predicted word back in to the system as input, then it will start generating a sentence. If you perturb the predictions a bit, it will start using different words with similar meanings, and form “random” but still grammatically valid sentences.

The code for Word2SDR (although really it is “text2SDR” 😉 ) is available here: https://github.com/222464/AILib/blob/master/Source/text/Word2SDR.h

So there you have it, using a HTM-derivative for NLP!

Until next time!

10 thoughts on “Text2SDR”

Mike Az says:

April 16, 2015 at 3:12 am

Thank you very much!
Very useful… Will give a try!! 🙂

LikeLike

Reply
Mike Az says:

April 17, 2015 at 7:58 am

Thank you for sharing! Very interesting your blog.
I will give a try to the code…

LikeLike

Reply
Dinesh Vadhia says:

July 1, 2015 at 12:18 pm

hi cire! i’m using word2vec/doc2vec and would like to take text2sdr for a spin. any idea how i can make it call the code from python – windows initially? thanks in advance!

LikeLike

Reply
- cireneikual says:
  
  July 1, 2015 at 4:36 pm
  
  Hello!
  
  If you give me a day, I’ll make a Python version for you, using some newer algorithms I now have.
  It’s not too much work for me since I already have a lot of sparse coding code written in Python.
  
  If it is alright with you, I will send you an email when it is done!
  
  ~ CireNeikual
  
  LikeLike
  
  Reply
  - Dinesh Vadhia says:
    
    July 1, 2015 at 4:47 pm
    
    I won’t get around to test it until the weekend or early next anyway so this will be perfect! I’ve been using the word2vec/doc2vec package from gensim a lot recently but am really looking for word2vec/doc2vec vectors as sparse representations. Preferably, if the length of the sparse vectors can be specified beforehand. I’m assuming that the text2sdr vectors are all of the same length. Thanks!
    
    LikeLike
  - cireneikual says:
    
    July 2, 2015 at 1:13 am
    
    Hello again,
    
    Here is a basic functioning version of Text2SDR in Python!
    
    here
    
    LikeLike
  - Dinesh Vadhia says:
    
    July 2, 2015 at 7:46 am
    
    That was quick! Many thanks! Could we continue to chat via email?
    
    LikeLike
Dinesh Vadhia says:

July 3, 2015 at 1:00 pm

Didn’t have much luck, mainly because of length of processing time. Can provide details via email – dinesh [dot] vadhia [@] outlook [dot] com

LikeLike

Reply
- cireneikual says:
  
  July 3, 2015 at 4:07 pm
  
  Hmm, I cannot seem to email you, my mail keeps on getting rejected.
  
  LikeLike
  
  Reply
  - Dinesh Vadhia says:
    
    July 3, 2015 at 4:32 pm
    
    weird! try dineshbvadhia [at] hotmail [dot] com
    
    LikeLike