I am now (almost) satisfied with the performance of the system on pole balancing. The main issue I am having right now is with the function approximators for the actor/critic. They are slow and lack enough precision. Fortunately though, the switch to advantage(lambda) learning helped remedy the inaccuracy in the function approximation to some extent. I need massive Tau values to get decent performance, so increasing the accuracy of the function approximators is a top priority.
Aside from that, I worked on the generalization capabilities of the function approximators. I added regularization and early stopping. I also started using a combination of epsilon-greedy and softmax action selection.
I also experimented with using separate function approximators for each action (for the discrete action version). The intuition behind this is that this will reduce error interference and make it easier to learn the larger differences in outputs that result from advantage learning.
I added plotting capabilities using sfml-plot, allowing me to better observe what effects my changes have.
Here is an image of the pole balancing experiment with plotting:
Until next time!