Dynamic Horde of General Value functions

Just finished documenting my work on creating an architecture for a dynamic horde of General value functions.

General value functions (GVFs) have proven to be effective in answering predictive questions about the future. However, simply answering a single predictive question has limited utility. Others have demonstrated further utility by using these GVFs to dynamically compose more abstract questions, or to optimize control. In other words, to feed the prediction back into the system. But these demonstrations have relied on a static set of GVFs, handcrafted by a human designer …

https://github.com/dquail/RLGenerateAndTest

Dynamic Horde

I’m just finishing up a research project about an architecture for a dynamic set of general value functions. I’ll look to share the documentation and source on http://github.com/dquail as per usual. But in the mean time, wanted to share the abstract.

General value functions (GVFs) have proven to be effective in answering predictive questions about the future. However, simply answering a single predictive question has limited utility. Others have demonstrated further utility by using these GVFs to dynamically compose more abstract questions (Ring 2017), or to optimize control (Modayil & Sutton 2014). In other words, to feed the prediction back into the system. But these demonstrations have relied on a static set of GVFs, handcrafted by a human designer.

In this paper, we look to extend the Horde architecture (Sutton et al. 2011) to not only feed the GVFs back into the system, but to do so dynamically. In doing so, we explore ways to control the lifecycle of GVFs contained in a Horde – mainly to create, test, cull, and recreate GVFs, in an attempt to maximize some objective.

Real time Reinforcement learning examples

Over the last 4 months, I’ve completed several different projects using Reinforcement learning on a dynamixel servo providing real time continuous sensorimotor data to the learning algorithms. In particular, I’ve experimented with creating, running, and measuring thousands of GVF demons making predictions in parallel, policy gradient actor critic methods, and pavlovian control.

I’ve attempted to document what I learned as best as I could. All code and experimental writeups can be found on my github page at:

www.github.com/dquail/RobotPerception

Taking a break from my “smart” phone

I think we’re underestimating the importance of continued serendipitous day dreaming. Creative thoughts and problem solving insights have often come to me while laying in bed, walking to the coffee store, or while sitting at a red light. But so many of these moments are being interrupted by the crack cocaine that is my iPhone. I wish I had the will power to not give in; but short of that, I’m going to experiment with giving it up and going to a dumb old flip phone. I did this about a year ago and lasted a month – until I took a water ski trip into Sacramento. The last second economist in me couldn’t take it.

Wish me luck!

How would you feel?

  • If you were beaten up. And 3 years later told by the bully, that they ought to have taken your lunch money while they were at at … and that maybe next time they will.
  • If your rich neighbor erected a giant fence to keep you out of their yard. And told you that you must pay for it.
  • If someone a few blocks away said you couldn’t visit their block. Even though you have nowhere else to go. And that you may have been living there lawfully in the past. Or have family members currently living there.
  • If you were told you wouldn’t be prioritized because of your religion.

http://www.cnn.com/2017/01/29/politics/donald-trump-reality-check-first-week/

Perception as Prediction – A predictive approach to knowledge

Summary:

In natural and artificial intelligence, “perception” is thought to be the ability to become aware of something through our senses. Agents, human or artificial, may perceive that a fire is burning, or that there is a baseball traveling towards a glove. In either case, the intelligent system has become knowledgable through it’s senses.

Traditional artificial intelligent systems have demonstrated perception by detecting objects. A robot may learn about chairs, and then later perceive that a chair in front of it. Such a system could learn to perceive chairs through supervised learning. The system is shown thousands or millions of examples of chairs, so that when encountering a new one, the agent perceives it’s presence.

A supervised learning approach such as this is demonstrably powerful, but a new reinforcement learning approach to achieving “perception” through experience may also be possible.

Consider a new born baby. In a matter of hours, the baby is able to “perceive” it’s mothers breast. It does so without a teacher showing thousands of flashcards containing pictures of breasts. It does so because of it’s interactions with it’s mother. Through these interactions, in certain states, the baby is able to perceive that it’s in the presence of it’s mothers breast because it can predict drawing milk if it were to perform a sequence of actions. Perception, and this predictive approach to knowledge. can be systematically achieved in a similar way using reinforcement learning.

Sutton’s “Horde – A Scalable Real time architecture for learning knowledge from Sensorimotor Interaction”, and White’s “Developing a predictive approach to knowledge” form the foundation of a computational way to learn to perceive. The high level take aways from both papers are included below.

Key ideas and notes:

  • “General value functions” (GVF’s) are the computational unit that answer predictive questions about the environment. They provide the semantics for asking, and answering these predictive questions.
  • These GVF’s are, not surprisingly, a generalization of value functions familiar throughout the Reinforcement Learning literature.
  • GVF’s have “targets” rather than total reward. They have “cumulants” rather than rewards.
  • The goal when using GVF’s is not to maximize the cumulant. It’s to be able to predict the cumulant by passively observing the sensorimotor stream of experience.
  • There are several “levers” at the designers disposal to design the questions.
    • The designer can specify the cumulant, which is effectively the signal of interest in predicting. For example, one might be interested in predicting the amount of energy that will flow into the robots battery, similar to the analogy with the baby.
    • The designer can specify the time scale of interest. This can be specified with the value used for gamma. Using the baby example yet again, a predictive question about how much energy could be derived in the next minute bay be asked by using gamma.
    • The designer can control the policy. ie. How much energy will I draw if I drive straight forward continually?
  • Unique to GVF’s, the value for gamma is state dependent. This is how you can effectively turn off the question.
  • Learning GVF’s is similar to regular value functions. It uses function approximation (for example tile coding) for state estimation, TD errors, and eligibility traces.
  • Once learned, the approximate GVF value (answer) is achieved through multiplying the learned weight vector with the feature vector.
  • Multiple GVF’s (demons) can be learning at the same time. They each can learn off policy, using snippets of relevant experience generated by the behavioral policy of the agent.

Horde presents some demonstrations of robots that are able to learn multiple different GVF’s. It seems however, that a missing step is actually leveraging these predictions, either to optimize control, or to build other predictions. The latter is a major topic of Mark Ring’s work. In particular, “Better generalization with forecasts.” http://www.ijcai.org/Proceedings/13/Papers/246.pdf

References:

  • [Sutton et al., 2011] Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White, and Doina Precup. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems – Volume 2, AAMAS ’11, pages 761–768,
  • White, Adam. Developing a predictive approach to knowledge. Diss. University of Alberta, 2015.

Perspectives and Problems in Motor Learning – notes

This is my notes after reading “Perspectives and Problems in Motor Learning” – Daniel Wolpert, 2001.

Summary

The paper is intended to explore the mechanics that humans and computers use to “learn” motor skills. This includes comparing how humans learn motor skills with how different fields of artificial intelligence including supervised, unsupervised, and reinforcement learning may accomplish the same thing. It also suggests some of the problems and barriers that existing technology exist to truly scale this type of learning.

 

Notes (representing what I surmise is suggested in the paper – not necessarily what I agree with):

  • The brain’s whole purpose in life is to produce movement, since movement is the only way to truly interact with the world (even speech is a result of movement).
  • We are born with some innate motor controls, and this is demonstrated via sensory deprived babies (blind and deaf) that don’t need to learn to smile. They just do it.
  • However, despite the innate motor skills, learning is still clearly required, as teh world is non-stationary. The world clearly changes, as do our bodies (not just kids turning into adults, and people gaining weight, but people losing teach, growing fingernails, etc).
  • Motor learning could deal with the brain learning how to send control signals, or the body (muscles) learning how to evolve. Most of motor learning deals with the former
  • Motor learning is really the mapping between motor command, to sensory consequences, and the inverse.
    • The forward view (motor command => sensory consequences) creates a model predicting what senses will be felt if the agent behaves in a certain way. This is where supervised learning is most obvious. Taking an input and mapping it to an output as the self generated training data.
    • The inverse view (sensory consequences => motor command) are what allows the agent to decide which action to perform to achieve a desired goal. This clearly uses reinforcement learning since there are many different paths to a goal but the optimal one is difficult to find.
    • Unsupervised learning seems to be what allows motor primitives to be found. With these primitives, other more sophisticated actions can be learned to generate the desired consequence.
  • There appears to be evidence that this model (supervised + unsupervised + reinforcement learning) is similar to how the brain works using dopamine and the cerebellum.
  • But there are significant problems with this model. One such example is the rate at which visual data is processed doesn’t seem to be fast enough to make these foreground decisions. Furthermore, representations are needed, and given the massive state action space (600 muscles could either be flexed or not flexed), 2exp600 is a massive number.

 

Further reading and thoughts