One of my favorite papers I’ve read is “The Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction.”
TLDR – Like the name suggests, it presents a reinforcement learning agent architecture that is able to learn, and make tens of thousands of predictions at any time step. These predictions are based in low level sensorimotor observations. Each of these predictions are made via general value functions, some of which can be off policy predictions. In other words (for those not as familiar with RL terminology), these predictions are predictions IF the agent were to behave in a manner, not necessarily the same as the way it is currently behaving. For example, while driving home from work, I could start to learn (and in the future predict) how long it would take me to get to my local grocery store. I don’t intend to go into the implementation details, but the high level intuition is that the agent making this prediction (how long to drive to the grocery store) while doing something different (driving home) is able to do so, since there is behavioral overlap between these two behaviors. Where there is overlap, the agent can learn.
This off policy learning seems to be powerful. An agent is able to learn something about behavior not necessarily the same as the current target behavior. This is especially powerful when the agent does not directly control the behavior. However, there is some downside. The algorithm is algorithmically more complicated. With GTD lambda for example, a second weight vector needs to be stored and updated. Importance sampling via rho needs to also be calculated. The extra weight vector also increases storage / memory cost. So the off policy advantage comes at a cost.
I wonder, in cases like Horde, where the agent controls the behavior policy directly, and the purpose is to learn (rather than maximize some reward), how another approach would work. Perhaps each GVF could be implemented as on policy. Rather than GTD Lambda, a simple SARSA algorithm could be used. And that policy would be followed for a certain amount of time steps. Because all of the experience would be relevant, learning would happen faster in theory. Furthermore, there would be less computational overhead, so the slack that was generated could be used for planning or some other task. Another approach, rather than just enumerating through GVFs to determine the behavior policy, perhaps you could choose a GVF that needed learning the most, follow that policy for a while, and then move on to the next. “Needing learning the most” … what would that mean? I suppose it could be based on the current average TD error, combined with the number of times it’s been followed. This approach would be similar to some curiosity inspired exploration.
This idea of comparing behavioral approaches in Horde (what would the performance measure be for comparing the different approaches) seems like it could be interesting .