Real time Reinforcement learning examples

Over the last 4 months, I’ve completed several different projects using Reinforcement learning on a dynamixel servo providing real time continuous sensorimotor data to the learning algorithms. In particular, I’ve experimented with creating, running, and measuring thousands of GVF demons making predictions in parallel, policy gradient actor critic methods, and pavlovian control.

I’ve attempted to document what I learned as best as I could. All code and experimental writeups can be found on my github page at:

Taking a break from my “smart” phone

I think we’re underestimating the importance of continued serendipitous day dreaming. Creative thoughts and problem solving insights have often come to me while laying in bed, walking to the coffee store, or while sitting at a red light. But so many of these moments are being interrupted by the crack cocaine that is my iPhone. I wish I had the will power to not give in; but short of that, I’m going to experiment with giving it up and going to a dumb old flip phone. I did this about a year ago and lasted a month – until I took a water ski trip into Sacramento. The last second economist in me couldn’t take it.

Wish me luck!

How would you feel?

  • If you were beaten up. And 3 years later told by the bully, that they ought to have taken your lunch money while they were at at … and that maybe next time they will.
  • If your rich neighbor erected a giant fence to keep you out of their yard. And told you that you must pay for it.
  • If someone a few blocks away said you couldn’t visit their block. Even though you have nowhere else to go. And that you may have been living there lawfully in the past. Or have family members currently living there.
  • If you were told you wouldn’t be prioritized because of your religion.

Perception as Prediction – A predictive approach to knowledge


In natural and artificial intelligence, “perception” is thought to be the ability to become aware of something through our senses. Agents, human or artificial, may perceive that a fire is burning, or that there is a baseball traveling towards a glove. In either case, the intelligent system has become knowledgable through it’s senses.

Traditional artificial intelligent systems have demonstrated perception by detecting objects. A robot may learn about chairs, and then later perceive that a chair in front of it. Such a system could learn to perceive chairs through supervised learning. The system is shown thousands or millions of examples of chairs, so that when encountering a new one, the agent perceives it’s presence.

A supervised learning approach such as this is demonstrably powerful, but a new reinforcement learning approach to achieving “perception” through experience may also be possible.

Consider a new born baby. In a matter of hours, the baby is able to “perceive” it’s mothers breast. It does so without a teacher showing thousands of flashcards containing pictures of breasts. It does so because of it’s interactions with it’s mother. Through these interactions, in certain states, the baby is able to perceive that it’s in the presence of it’s mothers breast because it can predict drawing milk if it were to perform a sequence of actions. Perception, and this predictive approach to knowledge. can be systematically achieved in a similar way using reinforcement learning.

Sutton’s “Horde – A Scalable Real time architecture for learning knowledge from Sensorimotor Interaction”, and White’s “Developing a predictive approach to knowledge” form the foundation of a computational way to learn to perceive. The high level take aways from both papers are included below.

Key ideas and notes:

  • “General value functions” (GVF’s) are the computational unit that answer predictive questions about the environment. They provide the semantics for asking, and answering these predictive questions.
  • These GVF’s are, not surprisingly, a generalization of value functions familiar throughout the Reinforcement Learning literature.
  • GVF’s have “targets” rather than total reward. They have “cumulants” rather than rewards.
  • The goal when using GVF’s is not to maximize the cumulant. It’s to be able to predict the cumulant by passively observing the sensorimotor stream of experience.
  • There are several “levers” at the designers disposal to design the questions.
    • The designer can specify the cumulant, which is effectively the signal of interest in predicting. For example, one might be interested in predicting the amount of energy that will flow into the robots battery, similar to the analogy with the baby.
    • The designer can specify the time scale of interest. This can be specified with the value used for gamma. Using the baby example yet again, a predictive question about how much energy could be derived in the next minute bay be asked by using gamma.
    • The designer can control the policy. ie. How much energy will I draw if I drive straight forward continually?
  • Unique to GVF’s, the value for gamma is state dependent. This is how you can effectively turn off the question.
  • Learning GVF’s is similar to regular value functions. It uses function approximation (for example tile coding) for state estimation, TD errors, and eligibility traces.
  • Once learned, the approximate GVF value (answer) is achieved through multiplying the learned weight vector with the feature vector.
  • Multiple GVF’s (demons) can be learning at the same time. They each can learn off policy, using snippets of relevant experience generated by the behavioral policy of the agent.

Horde presents some demonstrations of robots that are able to learn multiple different GVF’s. It seems however, that a missing step is actually leveraging these predictions, either to optimize control, or to build other predictions. The latter is a major topic of Mark Ring’s work. In particular, “Better generalization with forecasts.”


  • [Sutton et al., 2011] Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White, and Doina Precup. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems – Volume 2, AAMAS ’11, pages 761–768,
  • White, Adam. Developing a predictive approach to knowledge. Diss. University of Alberta, 2015.

Perspectives and Problems in Motor Learning – notes

This is my notes after reading “Perspectives and Problems in Motor Learning” – Daniel Wolpert, 2001.


The paper is intended to explore the mechanics that humans and computers use to “learn” motor skills. This includes comparing how humans learn motor skills with how different fields of artificial intelligence including supervised, unsupervised, and reinforcement learning may accomplish the same thing. It also suggests some of the problems and barriers that existing technology exist to truly scale this type of learning.


Notes (representing what I surmise is suggested in the paper – not necessarily what I agree with):

  • The brain’s whole purpose in life is to produce movement, since movement is the only way to truly interact with the world (even speech is a result of movement).
  • We are born with some innate motor controls, and this is demonstrated via sensory deprived babies (blind and deaf) that don’t need to learn to smile. They just do it.
  • However, despite the innate motor skills, learning is still clearly required, as teh world is non-stationary. The world clearly changes, as do our bodies (not just kids turning into adults, and people gaining weight, but people losing teach, growing fingernails, etc).
  • Motor learning could deal with the brain learning how to send control signals, or the body (muscles) learning how to evolve. Most of motor learning deals with the former
  • Motor learning is really the mapping between motor command, to sensory consequences, and the inverse.
    • The forward view (motor command => sensory consequences) creates a model predicting what senses will be felt if the agent behaves in a certain way. This is where supervised learning is most obvious. Taking an input and mapping it to an output as the self generated training data.
    • The inverse view (sensory consequences => motor command) are what allows the agent to decide which action to perform to achieve a desired goal. This clearly uses reinforcement learning since there are many different paths to a goal but the optimal one is difficult to find.
    • Unsupervised learning seems to be what allows motor primitives to be found. With these primitives, other more sophisticated actions can be learned to generate the desired consequence.
  • There appears to be evidence that this model (supervised + unsupervised + reinforcement learning) is similar to how the brain works using dopamine and the cerebellum.
  • But there are significant problems with this model. One such example is the rate at which visual data is processed doesn’t seem to be fast enough to make these foreground decisions. Furthermore, representations are needed, and given the massive state action space (600 muscles could either be flexed or not flexed), 2exp600 is a massive number.


Further reading and thoughts



Scaling Horizontal in Reinforcement Learning

The ability to learn on it’s own, through experimentation with the world, rather than through human labeled training sets, is a major strength of Reinforcement Learning. Combined with deep learning, this is a fantastic strategy in problems such as learning to play Backgammon and Go. Learning these games requires millions of training episodes, all of which can be simulated and planned by the agent using approximated models.

However, the temporal nature of Reinforcement can be a limiting factor in it’s ability to scale with real world problems. Learning non stationary problems such as irrigation control on crops, seems like a perfect application of Reinforcement Learning. And in many ways it is. The problem is, unlike training Backgammon, you simply can’t run millions of simulations of the game of “growing plants.” You have to wait on nature to learn from your experience and consequences of applying more or less water.

One way to solve this would be through better models. Observing the health of the plant at a finer level, that changes by the minute or even second, could provide enough to help learn. But those types of sensors may not exist before the prevalence of a second tool. What I’d call call horizontal reinforcement learning (I’m sure there’s a much better term that someone has coined). Somewhat similar to multiple agents collaborating together to solve a problem. But subtly different. They are iid (independent and identically distributed) agents that pool their experience together to learn faster. Imagine a crop that is sectioned off into 1 million different tiles. Each of with it’s own irrigation system. One one night, you could learn from 1 million different experiments, rather than just one. In such distributed systems, the temporal difference algorithms all hold up when pooling experience, and the computational challenge is that of any distributed system, whose solutions  (latency and concurrency) are relatively strong in computer science.


“Goals” in reinforcement learning

I want to create a mind that, through it’s own learning, makes real time decisions. These decisions would likely be optimized to accomplish a goal – which can be represented by maximizing reward. So in this sense, goals help shape behavior and decision making.

Constraining this intelligence around goals however, seems minimizing. Imagine, however we had an intelligent mind that was able to take an input stream, and project all possible futures, based on it’s action. An ability to make these predictions would produce the ultimate mind.

However, projecting an unconstrained future in this way, is impractical for 2 reasons. At least now.  The first is for performance. The second is for learning. For performance, without goals, the agent is essentially performing a random walk. It’s akin to a water skier with no goal, performing random actions. Without a goal, the skier will fall almost immediately. Learning without a goal is equally problematic. With a goal, an agent is able to try different actions, and access how successful they were towards an end goal. It’s a pruning mechanism to determine what temporal based actions to continue to learn about. For a water skier learning to slalom ski, they’re able to continue to try different levels of aggression of edge change. If the goal however was to learn how to trick ski, the actions tried would be much different.

Markov Decision Process in Reinforcement Learning

I’ve been reading and taking notes of “Introduction to Reinforcement Learning” by Sutton and Barto. As I do so, I’ve been taking notes of each chapter.

Chapter 4 – Markov Decision Process took more time than I’d like to admit to understand. I’ll blame it on the larger than usual amount of statistics/math combined with the 10+ years away from University.

Nonetheless I took notes as I slogged my way to a bit of a better understanding. Hopefully someone finds the attached notes useful below.