Are we an online only learning agent? Is the best agent one that only learns online?

Do humans learn fully online? I’ll define “online” as when learning only leverages the data incoming from the current observation. It learns only from this new data. In such a world, batch processing and experience replay are not “online.” Is that how humans work?

I suspect not. I think there’s a balance between online learning, and planning with humans. Furthermore, I thank that we balance between the two depending on the situation. When a human is faced with an intense situation – for example a professional hockey player in the middle of their shift – all of the data processing is in the foreground. They take the immediate observation, and act upon it. There is no background planning taking place. In contrast, and at the other extreme is when the athlete is back home daydreaming just before they fall asleep (or while they are actually sleeping). At this time, the human is not doing much, if any foreground processing. The observations they are receiving are minimal – given they’re laying in a dark room. And their action decisions are even more minimal. But in such a state of dreaming, their mind is free to plan (with the model of the world that has been learned).

This brings me to an ideal agent. I think an ideal agent will maximize its computational power. There is no idle time for such an agent. It learns as much as it can at every moment. There are no sleep(s)! Therefore, such an agent, if in a time where immediate action is required (like that of an athlete), this computational power must dedicate itself completely to foreground processing. But later, the same agent, when in a safe place, has the opportunity to plan using the model of the world it has learned. I usually hate suggesting that humans are the the default inspiration for designing the best agent (ie. just accepting that neural nets are the defacto best function approximater because humans implement them) … but in this case, I think we have it right.

History of state indexing

The trajectory of an MDP is given as:

screen shot 2019-01-12 at 10.42.18 am

An agent is given a state, takes an action, receives a state, a reward … and the cycle continues. Why is, by convention, the indexing of states, actions, and rewards done in this way?

I think it’s important to note that this is convention, and convention only. R1 above could just as easily been denoted as R0. I think the decision of indexes comes down to which fits our mental model of how things work. You could argue both sides.

One mental model is that a given state, action and resulting reward are each required at each learning step in most algorithms, then it would make sense that these states, actions, and rewards share the same index. In such a mental model, The above would be organized as S0, A0, R0. This would seem reasonable.

But I think of the stream of states, actions, and rewards from a more temporally consistent data stream perspective. From this view, states and rewards enter the agent at the exact same temporal step. The indexes have temporal significance, so the state and reward should share the same index.  Therefore the above indexing sequence makes most sense to me.

Driving behavior to learn abstract predictive state and knowledge

As humans born into the world, we are exposed to nothing but a stream of raw sensorimotor data – visual, audio, touch, smell. Using this data alone, we begin to form increasingly abstract knowledge.

One approach for artificial intelligent agents to represent such knowledge is to do so through predictions. The “perception as prediction” is something I’ve written about before.  One such approach would be to use the General Value function (GVF). Given such an approach, a GVF(s) could be created which would predict some raw value in the observation stream. The value of such a prediction could then be used in the question of a higher level prediction. On and on it goes, using the GVF at each layer, until more abstract knowledge is captured. But for a system attempting to learn this abstract knowledge, how would such an agent drive behavior? There are several challenges which I’m encountering with my current research in this field:

  • The GVFs at higher levels require the GVFs at the lower level to already have been learned. Therefore, it’s most efficient to follow behavior policies that optimize learning the early layers. Perhaps the agent should, in sequence follow behaviors that foster learning these low level GVFs. In other words, follow the behavior policies from the GVFs directly – with some minimal exploration. Then and only then, would the agent be able to learn other higher level GVFs. In a away, this would be sequential on policy learning, rather than parallel off policy learning.
  • The later levels are depending on input from the earlier layers. So until these earlier layers are learned, the later layers are being handed incorrect data to learn from. This ties up unnecessary computation time, and initializes the weight vector to these later GVFs to incorrect values. Often very incorrect values.
  • If this was the approach however, you can not completely forget to follow policies that learn lower level GVFs. In other words, the policy for the first layer GVF must on occasion be followed. Because we are using function approximation, rather than tabular, if the policy is never followed, the predictive value for that GVF will gradually be eroded. It’s like a human that learns that if it at the edge of a lake and extends its foot, it will feel the sensation of “wetness.” Once it has learned that, it must on occasion, experiencing the feeling of wetness by extending its foot. This is because, there are many other similar states for which the agent will NOT experience this sensation. Because of function approximation, each time the agent does not experience this sensation, the value of the GVF predicting wetness will move closer to NOT experiencing it. So the agent needs to be reminded occasionally, how it feels to experience this sensation, and which states lead to this. So this begs the question, when should the agent relearn and revisit these values and behavior policies of earlier GVFs?

 

Additionally, is there a place for an agent that is motivated to learn this knowledge? Shouldn’t it be following behavior which accomplishes some other goal? Perhaps while doing planning, the goal of learning such knowledge would be more appropriate, since this allows for accomplishing goals across multiple tasks.

 

Predictive vs. Historical features

Several of my experiments in Reinforcement learning have involved generating predictive state representation (where by part, or some of the features are predictive in nature). These learned predictive features look at the current observation, make predictions about some future cumulant, and incorporate this prediction into the current state representation. In the Cycle world setting, I’ve demonstrated how this can be useful. In a way, predictive state representations look at an aspect of the future data stream and add this information to the current state representation.

In contrast, another way to construct a state representation would be from a historical perspective. The agent would encapsulate previous observations and incorporate them into the current state observation.

Given these two perspectives, it got me thinking, is there REALLY a place for predictive state representations? What is that place? Can they just be replaced by historical ones? What follows invokes more questions than answers … but is my way of flushing out some thoughts regarding this topic.

First off, With respect to predictive features, I’m going to define useful as being able to provide information that allows the agent to identify the state. A useful feature would allow you to generalize across similar states and likewise, allow the agent to differentiate from unlike states.

For both predictive and historical features, I think they’re really “only” useful when the data stream of observation is sparse in information. In other words, it lacks information at each tilmestep which would allow the agent to uniquely identify the current state. The tabular grid world where current state is uniquely identified is the opposite of such an information lacking observation stream. In contrast to the tabular word, information sparse / partially observable settings, historical or predictive features can fill the gaps where the current observation is lacking information. If informative data exists in the past data, then historical features become useful. Likewise, if informative data exists in the future data stream, then predictive features become more useful.

The B MDP where there is unique information at the very beginning of each episode, which directly informs the agent which action to take in the future, is somewhat of a special case where all of the useful information in the stream exists only in the past (the first state). Therefor, predictive features are of no use here. But in continuous (real life) tasks, I don’t think this would ever be the case. Informative information “should”? always be observable in the future, if the agent takes the correct steps So if that’s the case (as long as the agent can navigate back to these states, then predictive features “can do whatever the historical ones” can. It’s only when the agent can’t get back that historical features are required. But these cases seem somewhat contrived.
In either case though … maybe this is stating the obvious … but historical and predictive features seem to share something in common: They are useful when the current observation is starving for more informative data. Historical features help when there is informative data in the rear view mirror. Predictive when there’s informative data ahead such as the following MDP.

cycleworld
So which (predictive or historical) really depends on what the data stream looks like … and where you are within it.