I’ve worked with a few reinforcement learning environments that are partially observable. The observations seen in these environments lack information to do a good job of identifying the current state within the environment. In these situations, different states are aliased – since many of them look the exact same.
What is an environment that is partially observable? Imagine standing in an extreme snow storm in the middle of an empty field. No matter where you look, no matter how you stand, all you see is white. All you feel is cold. Each state you are in, is hard to differentiate from the next. In this environment however, consider that there is a bench in the middle of the field. When you are right next to the bench, you can see it, and you can touch it. But as soon as you take a few steps away from it, it is no longer visible. There’s no way to sense it.
In this world, if you are to rely directly on your immediate observations (what you see and what you feel), all states look the exact same, with the exception of the state where you’re directly beside the bench. But as soon as you step one step away from the bench, you’re back to a state that is aliased with almost all the others.
There are two approaches that come to mind when creating a feature representation, that may get over this state aliasing problem for such environments. Both involve some element of using “memory,” so that the feature representation is not only comprised of what the agent currently sees.
- A recurrent memory based approach. In this approach the feature representation of each state consists of the observation of the current time step PLUS the observations from the previous n time steps. For example, if all I see is white snow, but I also know that I saw a bench last time step, I know quite precisely where I am (one step away from a bench).
- A predictive based approach. In this approach, the feature representation of each state consists of the observation of the current time step PLUS the predictions from the previous time step. For example, if all I see is white snow, but at the previous timestep I predicted that I was two steps away from the bench if I kept moving forward, I would now know where I was (again, one step away from a bench).
Both approaches seem to incorporate some form of memory (apologies for using “memory” quite loosely). The latter approach has a forward view of memory. Instead of looking back at the previous n time steps and summarizes what did happen, it looks ahead into the next n time steps and summarizes what will happen. I wonder how these two approaches would compare.
One thing that comes to mind would be that a forward view of memory might generalize better. In other words, regardless of how I got to a given state, if the predictions of the future are the same, wouldn’t you want each of these states to generalize the same?