In natural and artificial intelligence, “perception” is thought to be the ability to become aware of something through our senses. Agents, human or artificial, may perceive that a fire is burning, or that there is a baseball traveling towards a glove. In either case, the intelligent system has become knowledgable through it’s senses.
Traditional artificial intelligent systems have demonstrated perception by detecting objects. A robot may learn about chairs, and then later perceive that a chair in front of it. Such a system could learn to perceive chairs through supervised learning. The system is shown thousands or millions of examples of chairs, so that when encountering a new one, the agent perceives it’s presence.
A supervised learning approach such as this is demonstrably powerful, but a new reinforcement learning approach to achieving “perception” through experience may also be possible.
Consider a new born baby. In a matter of hours, the baby is able to “perceive” it’s mothers breast. It does so without a teacher showing thousands of flashcards containing pictures of breasts. It does so because of it’s interactions with it’s mother. Through these interactions, in certain states, the baby is able to perceive that it’s in the presence of it’s mothers breast because it can predict drawing milk if it were to perform a sequence of actions. Perception, and this predictive approach to knowledge. can be systematically achieved in a similar way using reinforcement learning.
Sutton’s “Horde – A Scalable Real time architecture for learning knowledge from Sensorimotor Interaction”, and White’s “Developing a predictive approach to knowledge” form the foundation of a computational way to learn to perceive. The high level take aways from both papers are included below.
Key ideas and notes:
- “General value functions” (GVF’s) are the computational unit that answer predictive questions about the environment. They provide the semantics for asking, and answering these predictive questions.
- These GVF’s are, not surprisingly, a generalization of value functions familiar throughout the Reinforcement Learning literature.
- GVF’s have “targets” rather than total reward. They have “cumulants” rather than rewards.
- The goal when using GVF’s is not to maximize the cumulant. It’s to be able to predict the cumulant by passively observing the sensorimotor stream of experience.
- There are several “levers” at the designers disposal to design the questions.
- The designer can specify the cumulant, which is effectively the signal of interest in predicting. For example, one might be interested in predicting the amount of energy that will flow into the robots battery, similar to the analogy with the baby.
- The designer can specify the time scale of interest. This can be specified with the value used for gamma. Using the baby example yet again, a predictive question about how much energy could be derived in the next minute bay be asked by using gamma.
- The designer can control the policy. ie. How much energy will I draw if I drive straight forward continually?
- Unique to GVF’s, the value for gamma is state dependent. This is how you can effectively turn off the question.
- Learning GVF’s is similar to regular value functions. It uses function approximation (for example tile coding) for state estimation, TD errors, and eligibility traces.
- Once learned, the approximate GVF value (answer) is achieved through multiplying the learned weight vector with the feature vector.
- Multiple GVF’s (demons) can be learning at the same time. They each can learn off policy, using snippets of relevant experience generated by the behavioral policy of the agent.
Horde presents some demonstrations of robots that are able to learn multiple different GVF’s. It seems however, that a missing step is actually leveraging these predictions, either to optimize control, or to build other predictions. The latter is a major topic of Mark Ring’s work. In particular, “Better generalization with forecasts.” http://www.ijcai.org/Proceedings/13/Papers/246.pdf
- [Sutton et al., 2011] Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White, and Doina Precup. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems – Volume 2, AAMAS ’11, pages 761–768,
- White, Adam. Developing a predictive approach to knowledge. Diss. University of Alberta, 2015.