Behavior policy for multi GVF (Horde) learning

One of my favorite papers I’ve read is “The Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction.”

TLDR – Like the name suggests, it presents a reinforcement learning agent architecture that is able to learn, and make tens of thousands of predictions at any time step. These predictions are based in low level sensorimotor observations. Each of these predictions are made via general value functions, some of which can be off policy predictions. In other words (for those not as familiar with RL terminology), these predictions are predictions IF the agent were to behave in a manner, not necessarily the same as the way it is currently behaving. For example, while driving home from work, I could start to learn (and in the future predict) how long it would take me to get to my local grocery store. I don’t intend to go into the implementation details, but the high level intuition is that the agent making this prediction (how long to drive to the grocery store) while doing something different (driving home) is able to do so, since there is behavioral overlap between these two behaviors. Where there is overlap, the agent can learn.

This off policy learning seems to be powerful. An agent is able to learn something about behavior not necessarily the same as the current target behavior. This is especially powerful when the agent does not directly control the behavior. However, there is some downside. The algorithm is algorithmically more complicated. With GTD lambda for example, a second weight vector needs to be stored and updated. Importance sampling via rho needs to also be calculated. The extra weight vector also increases storage / memory cost. So the off policy advantage comes at a cost.

I wonder, in cases like Horde, where the agent controls the behavior policy directly, and the purpose is to learn (rather than maximize some reward), how another approach would work. Perhaps each GVF could be implemented as on policy. Rather than GTD Lambda, a simple SARSA algorithm could be used. And that policy would be followed for a certain amount of time steps. Because all of the experience would be relevant, learning would happen faster in theory. Furthermore, there would be less computational overhead, so the slack that was generated could be used for planning or some other task. Another approach, rather than just enumerating through GVFs to determine the behavior policy, perhaps you could choose a GVF that needed learning the most, follow that policy for a while, and then move on to the next. “Needing learning the most” … what would that mean? I suppose it could be based on the current average TD error, combined with the number of times it’s been followed. This approach would be similar to some curiosity inspired exploration.

This idea of comparing behavioral approaches in Horde (what would the performance measure be for comparing the different approaches) seems like it could be interesting .

The maths

I must admit that at times, I’ve been frustrated by a lack of mathematical competency during my graduate studies in Reinforcement Learning. Other times though, I feel like my current level of mathematical competency is perfectly fine. I generally have enough intuition to understand the papers I’m reading. Part of me feels that when I allow myself to be frustrated, I’m allowing a certain type of author win – the type of researcher dressing up fairly underwhelming results in overly complex mathematics to make it look more impressive than it really is. Often times, complex math is needed … but other times it feels like I’m being shown dancing girls by a biz dev expert.

That said, the frustration is real.

I finished my undergraduate degree in 2002 (Currently 15 years ago) with a major in Computer Science. I took the required mathematics courses (2 calculus courses, a linear algebra course, statistics, and a numerical analysis course). I thrived in these courses (I’d have to look but I’d guess my average was around 95% where the course average was 60%) and generally enjoyed them. I’m not exactly sure why I didn’t pursue more. That said, the knowledge I gained from these courses is both distant and lacking to what would be required for easily understanding the papers I read. For example, I was reading a paper the other day talking about Fourier Transforms. A simple enough concept that I’ve studied before, but distant enough that I couldn’t quite remember the details I needed to parse the paper. The week before “LSI Systems.” In both cases I googled each term which unearthed 10 more terms and mathematical concepts I was unfamiliar with. Each term may require a weeks dedication, if not more to truly understand. It felt like pealing an onion.

I’ve thought about possible solutions. In my first year (last year) I audited a few classes – linear algebra and a statistics course. I should probably do more of that during my grad studies. But as always, 3 hours of lectures a week + time getting to and from adds up and is a pretty expensive investment. Not to mention the fact that to truly get the most out of these courses, one must not just attend classes, but do the assignments, prepare for exams etc. So this bottom up approach (read analysis, probability text books, MOOCs) is problematic because of it’s time cost when you have conferences to attend, pressure to publish, other classes to take for credit etc. Furthermore, the signal to noise ratio in this approach is pretty low. This statement isn’t mean to minimize the value in understanding the entire content of a mathematics textbook. But for the utility of applying it to research directly, a lot of pages simply aren’t relevant. However, the only way to unearth the valuable bits might be to read the entire content. In a perfect world, there’d be a system in place that would give me just enough content to understand the concept I wished. This system would understand my current level of knowledge. I could query “Langevin flow” and be returned a list of pointers to articles / moocs, etc. given my current level of knowledge. Google cant do this.

I’ve thought about a more top down approach – googling every term that’s unfamiliar – but as I stated before – that seems like exponential tree expansion.

Another more extreme solution would be spending a semester or two doing nothing but studying the appropriate math courses. This would be taking the actual courses (maybe for credit or at least doing assignments etc)? But obviously, as stated before, grad students have deadlines and should really publish / advance research … not just take background courses. This would have to be framed as a pseudo – sabbatical.

All this said, maybe it’s much ado about nothing. I have a general intuitive understanding of most concepts I’ve encountered. And for those that required a more intimate understanding, I’ve been able to pick this up. Perhaps I could even argue that a beginner understanding of mathematical concepts could come as an advantage to some researchers more mathematically minded – treating their mastery of equations and proofs as a hammer seeking a nail – solving problems that don’t really matter. My lack of mathematical know how hasn’t narrowed my research interests into purely application based areas (although I suppose that wouldn’t be a problem if it did. There’s nothing wrong with application research … as opposed to more theoretical).

None the less, the frustration is real. I desire to understand a bit more. And I don’t think I’m merely being tricked by dancing girls and flashy math equations. A better mathematical foundation would help. I’ll continue to use a more top down approach (researching terms and ideas on demand ad hoc) while sprinkling in a bit more bottom up. Listening to math lectures when I have the chance. Doing the latter is easier said than done though as “down time” is something that doesn’t occur that often. Perhaps I’ll dedicate myself to a MOOC to force the issue (the question of which content / MOOC not withstanding).

I’d love to hear other solutions to this frustration though ….

Perception vs. Knowledge in artificial intelligence

“Perception” and “knowledge.” I believe I’ve been guilty of using these words in the past to communicate ideas of research, without being overly clear in what they mean to me. When borrowing from other disciplines words such as these and ones such as “memory,” one should use them as consistent to their original meaning as possible. But I don’t believe there is anything wrong with choosing words that are out of their original context, so long as you clearly articulate the meaning of the words you’re choosing.

“Perception” and “knowledge” were two such words I was using today when talking with a fellow graduate student at the University of Alberta. Part way through our conversation, it became clear that I wasn’t being overly clear with how I was defining either word. We were discussing the idea of “perception as prediction.” An idea I have blogged about briefly in the past. The basic premise of the idea is that what we “perceive” to be true, are actually predictions that we believe to be true if we were to take certain actions. A baby perceives that it’s in the presence of it’s mothers breast because it predicts that if it were to take certain actions, it would receive sensorimotor stimulants (breast milk). A golfer perceives that she is standing on the green because she predicts that if she were to strike the ball, it would roll evenly over the ground. This theory of perception could be taken further to suggest that I perceive that Donald Trump is the current President of the United States because I predict if I were to enter his name into Google, I would discover that he was indeed the president.

The first two examples (the baby with breast milk, and the golfer) seem to fit nicely into a “perception” framework. To me, perception seems to be grounded in making sense of the immediate sensorimotor stream of data. “Making sense” perhaps means making predictions of what future implications would be to the sensorimotor stream would be if certain actions were taken now. However, the example of perceiving that Donald Trump is the president, seems to state something beyond my immediate sensorimotor stream of data. It’s less about “perceiving” what these observations mean. And more like “knowledge” that can be drawn upon to make other predictions. Knowledge seems to be based on making sense of what WOULD happen to my sensorimotor stream if I took certain actions now. Is this differentiation arbitrary? Is perception just a special case of knowledge – it’s knowledge grounded in the immediate senses vs. future senses? I believe that most people separate knowledge from perception in this way. But how do we “draw upon” knowledge when it’s needed? Is this knowledge hidden somehow in the sensorimotor data? Or is it part of a recurrent layer in a network? Or perhaps, it really is no different than “perception” in that this knowledge regarding Donald Trump could be one of millions of general value functions in a network of general value functions. If this is the case, then couldn’t you, in addition to saying “perception as prediction,” also say “knowledge as prediction?” I believe this is, actually exactly what researchers have argued in papers such as “Representing knowledge as Predictions.”

To me, perception refers to an ability to make further, more abstract sense of my immediate senses. To make “further, more abstract sense” means to be able to know that I’m “in a car” if I see street signs, and hear the hum of a motor. This more abstract sense of being “in a car” could be modeled as a set of predictions that are true, given the current sensorimotor input. Knowledge, on the other hand, can be modeled the exact same way. Through a set of predictions grounded in immediate senses. The difference, perhaps, is that perception, deals with more subjective information. It’s about making more abstract sense of my immediate senses. But the mechanism to compute perception and knowledge, could be the same. If that is the case, and these computations take the same input, and whose output is used the same, then what is the difference? Perhaps nothing?

 

Startup ideas are products of environment

I used to fashion myself as “idea” guy … in terms of the ideas of the startup variety. I felt like I could, at any point rattle off 4 or 5 startup ideas …. and not ALL of them would be laughed at by an investor. Somehow I believed there was something innate in this ability. However, since coming back for grad school at the university of alberta, my perspective on this “ability,” has changed.

It’s been over a year since starting my research in reinforcement learning. During this time, I can’t say I’ve had a single startup idea (not quite true, but close). You’d think I could say that it was because I didn’t give any thought to it … but that’s not entirely true. There’s been moments where I’ve tried to think of ideas. And each time I’ve grasped at straws. Maybe the lack of ideas, stems from the fact that I approach the search for ideas, armed with my reinforcement learning hammer, seeking a nail. This approach (trying to find a problem your technology expertise can solve), could be argued as an anti-pattern for coming up with good ideas. Furthermore, it seams that reinforcement learning is in the infancy of being applied to real world problems, and still struggles to find traction because of the lack of data (or to be more precise, the time it takes to acquire real life data). I blogged about this challenge before in a post called “Scaling horizontally in reinforcement learning.” But I think the root of it might be environmental. My time spent at the water cooler is spent talking about how to optimize algorithms. Or how to leverage GVFs to form predictive state representations. It’s not spent talking about creating an app for that, as it was when spending my days in the Bay area startup scene. For the record … I’m not complaining about this change.

The effect seems intuitive. Obviously someone who is immersed in an environment where everyone and their dog are talking about startup ideas, is going to have a few ideas of their own. But until I was immersed in the academic environment, immediately after spending several years working in SOMA in startups, I didn’t realize how much of an effect the environment has on the type of problems a person is attempting to solve.

 

Perhaps time really does move faster as you age …

When I was a kid, summer holidays (those two magical months of July and August) seemed like an eternity. Now however, as a 37 year old, it seems like it was just yesterday that I was launching fireworks to celebrate Labor day weekend (we really did. And it was amazing!). That was over 2 months ago. I’m sure most “adults” can relate to this feeling of time moving faster the older you get. But perhaps there’s a reasonable explanation for this effect.

Time is conventionally thought of in the continuous space. I’m sure quantum physicists much more intelligent than I have postulated on a discrete time domain, but for now, I perceive time as continuous. However, I have recently been implementing reinforcement learning algorithms in robotic domains, where time is discretized.

In such domains, the robot agent “wakes up” at certain frequencies for computation. At each instance the agent “wakes up”, the robot must choose an action, take the action, observe the environment, and finally learn.

RL.png

In the robot environments I have worked with, I, the designer, have defined the learning rate. This rate defines how often the agent “wakes up” – where it observes it’s most recent environment, as well as takes an action.  To such an agent, it is easy to imagine that the only definition of “elapsed time” is the number of learning cycles it has processed. It has no concept of what happened, let alone how long it took, between these learning cycles.

It is natural to believe that a young child has a much more active brain, processing at a more frequent rate than an older senior. Imagine if a young child “learns” 1000 times per second, and a senior learns only 100 times. To the child, a year represents 10X more learning cycles, so quite literally feels 10 times as long. Similarly, there is an intuition on the perception of elapsed time (or lack there of) with people waking up from a nights sleep, or from a comatose state.

I am making huge generalizations when comparing the human brain to this simple agent / environment framework. I barely have a basic understanding of neurology. But I suspect that the brain doesn’t just operate on a single discrete observation set at certain frequencies. So this comparison to the simple RL environment is somewhat naive. However, at some level, one could imagine that the computational frequency of the human brain slows down with age. If that is the case, and if you believe that is the only metric we have to perceive the passage of time, it only seems natural that time does indeed speed up as we age.

Dynamic Horde of General Value functions

Just finished documenting my work on creating an architecture for a dynamic horde of General value functions.

General value functions (GVFs) have proven to be effective in answering predictive questions about the future. However, simply answering a single predictive question has limited utility. Others have demonstrated further utility by using these GVFs to dynamically compose more abstract questions, or to optimize control. In other words, to feed the prediction back into the system. But these demonstrations have relied on a static set of GVFs, handcrafted by a human designer …

https://github.com/dquail/RLGenerateAndTest

Dynamic Horde

I’m just finishing up a research project about an architecture for a dynamic set of general value functions. I’ll look to share the documentation and source on http://github.com/dquail as per usual. But in the mean time, wanted to share the abstract.

General value functions (GVFs) have proven to be effective in answering predictive questions about the future. However, simply answering a single predictive question has limited utility. Others have demonstrated further utility by using these GVFs to dynamically compose more abstract questions (Ring 2017), or to optimize control (Modayil & Sutton 2014). In other words, to feed the prediction back into the system. But these demonstrations have relied on a static set of GVFs, handcrafted by a human designer.

In this paper, we look to extend the Horde architecture (Sutton et al. 2011) to not only feed the GVFs back into the system, but to do so dynamically. In doing so, we explore ways to control the lifecycle of GVFs contained in a Horde – mainly to create, test, cull, and recreate GVFs, in an attempt to maximize some objective.