As humans born into the world, we are exposed to nothing but a stream of raw sensorimotor data – visual, audio, touch, smell. Using this data alone, we begin to form increasingly abstract knowledge.
One approach for artificial intelligent agents to represent such knowledge is to do so through predictions. The “perception as prediction” is something I’ve written about before. One such approach would be to use the General Value function (GVF). Given such an approach, a GVF(s) could be created which would predict some raw value in the observation stream. The value of such a prediction could then be used in the question of a higher level prediction. On and on it goes, using the GVF at each layer, until more abstract knowledge is captured. But for a system attempting to learn this abstract knowledge, how would such an agent drive behavior? There are several challenges which I’m encountering with my current research in this field:
- The GVFs at higher levels require the GVFs at the lower level to already have been learned. Therefore, it’s most efficient to follow behavior policies that optimize learning the early layers. Perhaps the agent should, in sequence follow behaviors that foster learning these low level GVFs. In other words, follow the behavior policies from the GVFs directly – with some minimal exploration. Then and only then, would the agent be able to learn other higher level GVFs. In a away, this would be sequential on policy learning, rather than parallel off policy learning.
- The later levels are depending on input from the earlier layers. So until these earlier layers are learned, the later layers are being handed incorrect data to learn from. This ties up unnecessary computation time, and initializes the weight vector to these later GVFs to incorrect values. Often very incorrect values.
- If this was the approach however, you can not completely forget to follow policies that learn lower level GVFs. In other words, the policy for the first layer GVF must on occasion be followed. Because we are using function approximation, rather than tabular, if the policy is never followed, the predictive value for that GVF will gradually be eroded. It’s like a human that learns that if it at the edge of a lake and extends its foot, it will feel the sensation of “wetness.” Once it has learned that, it must on occasion, experiencing the feeling of wetness by extending its foot. This is because, there are many other similar states for which the agent will NOT experience this sensation. Because of function approximation, each time the agent does not experience this sensation, the value of the GVF predicting wetness will move closer to NOT experiencing it. So the agent needs to be reminded occasionally, how it feels to experience this sensation, and which states lead to this. So this begs the question, when should the agent relearn and revisit these values and behavior policies of earlier GVFs?
Additionally, is there a place for an agent that is motivated to learn this knowledge? Shouldn’t it be following behavior which accomplishes some other goal? Perhaps while doing planning, the goal of learning such knowledge would be more appropriate, since this allows for accomplishing goals across multiple tasks.