Reinforcement learning startups?

I spent some time mingling with some CTOs of iNovia portfolio companies today. It was fun getting a bit back into the startup world, especially with the obvious AI slant to the meeting today. One subject that I discussed a lot with some of them was the apparent lack of startups leveraging Reinforcement Learning.

I’m assuming that this perception is true, but admit that I have no data on this other than personal and anecdotal. The number of startups leveraging deep learning outnumber those using RL significantly – based on the press clippings I read, and the first hand startup founders I know. So I believe there is something to this. So this begged the natural question of why?

I’ve blogged about this before, but the discussions today a bit beyond this. No one, obviously including myself, claimed to have the answer. That said, I believe there are several reasons for the apparent lack of RL startups, most of them pivoting around the availability of data (or lack there of).

Data in RL is temporally based. Unless you’re simulating games on the atari, it is difficult to come by. Especially because of the often “real world” action based nature of it. You can’t just use amazon turk and label 500,000 images of cats to train your network. You have to generate a real stream of experience. That’s more difficult to do.

More specific challenges of the acquisition of data might be:

  • The cost of experimentation – Exploration is key in creating data used for reinforcement learning; When simulating games of GO, one can experiment and make a move that is thought to be poor. The worst that can happen is that you agents self esteem takes a hit because it loses another game! The stakes are low. But in a real time stock trader, the reinforcement learning suffers a significant cost when experimenting. This exploration cost isn’t suffered in supervised classification type approaches.
  • Delayed value – Any value derived from a reinforcement agent solution doesn’t occur until after the agent has been trained. It’s hard to convince an organization to adopt your product if they have to wait a month for it to learn before providing value. Because of the lack of simulators, these agents must learn on the fly. So when they’re deployed the don’t provide immediate value.
  • Temporal nature of data – I’ve blogged about this before. But the nature of the reinforcement data is temporal in nature. Often times rooted in “real time.” Agriculture data for example. It takes a full calendar year to acquire results of how a crop may perform. Rather than 0.001 seconds to simulate a game of GO. This is similar to a degree for acquiring data for a manufacturing plant. We’re at the early stages of RL adoption, so this data just isn’t there yet, and beginning to acquire it is difficult because of the reason above – why acquire the data in the first place if there’s no immediate pay off?
  • Infancy of the technology – I don’t buy this one, but it was mentioned that RL is a more novel approach than supervised learning. RL, Supervised, Unsupervised. They’ve all been “around” a long time. The computational power and availability of data has been what’s given rise to supervised learning. I’d conclude by saying that for RL we now have the computational power, but we still lake the data.

“Hard” questions

Not to be confused with “the hard thing about hard things!” … which … incidentally is a fantastic read. The TLDR: of which is that nothing in business (and in life?) follows any sort of recipe. You’ll be faced with countless difficult situations in leadership, for which there are no silver bullets. So Ben Horowitz shares a whole bunch of stories about hard situations, and how they were handled (not always well).

Ok … so enough about the book plug. Back to “hard questions,” this time, with respect to “questions” that a General value function is looking to answer.

I’ve thought about this a bit recently. In a family of GVFs (questions) that an agent could ask about it’s data stream, some questions, seem intuitively easy to answer. While other types of questions, seem hopeless. I hadn’t given it much thought until I heard another grad student make a similar reference, during a really interesting talk about general value functions today. Her presentation was comparing different algorithms, used by a GVF in a robot setting. The comment in particular, was that the algorithms all performed similarly if the question “was easy”, but when the question was “hard,” the different algorithms performed differently.

This obviously reminded me again about what a “hard question” really is. What do we mean when a prediction, or GVF is “hard”?

In a robot setting, a GVF answers a question about it’s sensorimotor stream. Given a policy, “how much light will I see in the near future,” how many steps will I take before I hit a wall?” are examples of a GVF question. Some of these seem difficult to answer.

I think it’s fair to say that when we say that a GVF is hard to learn, it means that it is difficult to come up with an accurate prediction.

So the next question then becomes, what makes it difficult to form an accurate prediction? I can see two reasons that a prediction may be difficult to approximate.

  • There isn’t enough relevant experience to learn from. Learning how long it would take to get from Edmonton to Calgary would clearly be “hard” to learn if you lived in Regina, and rarely travelled to Alberta.
  • Your feature representation is insufficient to answer this type of question. I’m less clear about what I mean by this. But it seems as though the features required to answer one question would be quite different than the features required to answer another type of questions. “What color will I see in the near future?” quite clearly requires a different feature representation than “What sound will I hear in the near future?” If the robot agent only represents it’s state with visual sensors, the latter type of question will clearly be incredibly hard to answer. This question, to be fair, might illogical. How can you ask a question about what you’d hear, if you weren’t made available audio senses. So perhaps a better example might be a question like “What is the chance I will see an ambulance pass me on the highway?” If my representation includes audio signals, this question may be easy to answer. Because as soon as I “hear” a siren, I can quite accurately predict that I’d be able to see the ambulance. Without this audio signal however, suddenly this question becomes much more difficult, if not impossible (if you remove my rear view mirrors).

Clearly there’s more to a “hard” question than this. But it seems these two attributes are a good starting place.

Is there a universal “knowledge”

The “knowledge” is in the data.

In other words, predictions are made directly from this data stream, and these predictions are all that you need to represent “knowledge.”

That said, each agent (person) is exposed to a different subset of this data.

Therefore, everyone’s “knowledge” is personal. No?

What does this say about any sort of ground truth about knowledge? Ones “knowledge” may be in direct conflict with anothers. And both may be correct?

So, knowledge can conflict. Well, actually, since knowledge is completely subjective, there is really no universal ground truth to it. It’s personal.

What then about “facts?” Aren’t facts, thought to be bits of knowledge that are “true?” Wait … Oh no … ALTERNATIVE FACTS!!!

Mobile applications as RL test environments

Atari is a popular environment for testing reinforcement learning algorithms. This got me thinking about the possibility of using mobile applications for such purposes. There might be some practical advantages for doing – both from a researcher and commercializers perspective.

For researchers, the ability to be exposed to multiple different sensors might provide value. Not only would you have access to pixel and audio data from the game, but you may also have microphone data from the environment. Also, accelerometer data can be included.

One of the biggest technical challenges to instrumenting an application for testing within this environment might be accessing the pixel stream in an efficient manner. From what I remember, there’s no way to efficiently access the current UIView pixel data. Creating the pixel data from the UIView is expensive computationally (TBD by how much), which may be problematic since this would need to be done at each time step. This is especially problematic if one were to provide RL tools / services to application developers (more on this later), since your solution would introduce latency – something to be avoided like the plague in mobile applications.

There would be obvious commercial appeal to a RL SDK + service that was easily integrated into an application. Such a service might, using senses common to any app (pixel data, sound) rather than needing to access any non observable application state, be able to make predictions about user behavior. Such predictions would be valuable for an application developer. For example, the sdk could trigger a notification that there was a good chance the user would quit the app within 5 seconds. (Yes, enabling application developers to suck users into spending even more time staring at their phones may be a particular type of evil).

Another challenge technically is that a couple test applications would need to be used to develop such an SDK / service. Perhaps a larger company with a popular app would allow you to integrate with their application to perform research. Or perhaps better yet, in an Android environment, this pixel data may be available, so you could experiment with such a solution even without the cooperation of the app developer.

What does “memory” mean to me?

This sounds like a bizarre question. Memory is a pretty fundamental word we all use in our daily lives. But it’s a word that carries a lot of significance in the field of computer science – in particular the field of artificial intelligence. I use it. Often. And usually pretty haphazardly. At the encouragement of others within my department, I’ve been convinced to read a bit more about the psychology / neurology roots of the word. Actually … I should restate that. I have no idea what the etymology of “memory” is. I’m guessing it’s rooted in psychology. And then neurology. So I include those. But I’m sure there’s more fields to look at when studying the definition of the word. I digress ….

I thought before I dove into looking up how these fields view “memory” I should document what I refer to when using the word.

I admit, that my usage is likely fairly naive. When I refer to “memory” it’s generally in the context of a reinforcement learning agent. And it usually loosely means, information, that is from the past, that is available to the agent at the current time-step. So previous observations are memory. Traces are memory. Anything that encapsulates past information is “memory.” hmm. that’s about it. But I realize that is I poor definition. By that definition, a value function is “memory.” It encapsulates information from the past, and makes it available to the agent at the current time step. But that’s clearly not what I mean. So I need to go further. “Memory” to me, is more raw than a general value function. It’s close to an original observation from the past, available in the current time. But that’s not quite right either. I leave a little bit of wiggle room to massage the original observation into a representation available for the current step (but not massaged into a value function).

So there you have it. An incredibly lackluster and flawed definition. I’ll intentionally leave it at that for this post. (I think the point of studying the etymology of “memory” is not only to not annoy psychologists/neurologists by my usage of the term – but to inspire and motivate work within AI.)

Incremental things

In the startup world, calling an idea “incremental” is somewhat of an insult. Startup founders are constantly filled with cliches of “going big or going home”, “disrupting industries,” “building monopolies” and “creating unicorns. According to the script, incremental is boring. It’s unimaginative, It goes against the sprit of innovation. And it’s certainly not something you want to be accused of. If you’re a tech entrepreneur. Especially one raising venture capital.

This world of anti-incrimental, right or wrong, has been the world I’ve lived in for the last 10 years. But today, in a supervisor meeting talking about my thesis work, it was suggested that my current plans are potentially too complicated for a Masters degree. I should save that idea for my PHD, and focus on something incremental instead. I was taken a bit aback by this. I’ve become instinctually dismissive of anything but completely novel approaches in the past (Note – I’m not trying to pretend I’m Elon Musk here creating missions to mars … I’ve started my fair share of fart apps in the past decade. But I do so always with a bit of shame). I need to think about it a bit more, but I think an incremental approach – at least to a masters thesis – makes good sense.

Again – I need to do a bit more navel gazing – but the problem with a completely novel approach within scientific research (in my case, I am considering a new algorithm for discovering a cluster of beneficial general value functions within a reinforcement learning agent), is a matter of scope. Coming up with the algorithm, running a few experiments, demonstrating results is perhaps the easy part. Explaining and justifying each decision point of the algorithm, competing algorithms is the hard part. Not to mention that in a wide open research topic such as discovery (of features), the related work is immense. Each related paper and idea should be thoughtfully considered. For all these reasons, the scope explodes and perhaps exceeds that of a Masters thesis. I shouldn’t make the blanket statement that one can’t invent a new algorithm / architecture within the scope of a masters thesis. But I do believe it to be likely more appropriate for a PHD thesis.

Again, this idea that creating something new is too grand in scope, is foreign in startups. Sure, there is the lean startup manifesto which guides its followers to build things in small increments. But these increments all are intended to add up to something truly disruptive and novel.  In the startup world, you could invent a new algorithm / service. It either works or it doesn’t (Based on engagement). But in the scientific world, whether something “works” or not, isn’t measure as by user engagement. More thought needs to be dedicated to addressing each decision point, and comparisons with other approaches. Note, that achieving either (user engagement vs. a comprehensive description of the thought process and scientific steps taken to achieve a result) can be difficult. In the case of the former, it’s more of a dice roll. Like catching lighting in a bottle. You can get lucky and create something delightful for users in a couple months and “be done.” The idea that you need to go beyond creating something, but define, and defend each decision point, is something I’m still getting acclimatized to.

The nice thing about doing something more incremental during a masters thesis is that much of the groundwork has been laid for you. For example, the Deepmind unreal paper has drawn a lot of attention from us at the University of Alberta (because of our interest in GVFs and our many ties with Deepmind). It’s a fascinating body of work. It challenges the idea of a predictive feature representation – instead using the auxiliary tasks to sculpt the feature representation directly (by finding generally useful features across tasks). But many scientific questions arise because of this work. How would auxiliary tasks compare with a predictive representation in an environment like the compass world or cycle world? What is the sensitivity to the parameters in the auxiliary tasks? What types of environments do auxiliary tasks work best in? These are just a few of the questions that could be thought of as incremental. They’re not creating anything new. But they are contributing many meaningful insights and knowledge towards the field. And could form the basis of a good masters thesis.

That said, thinking about this has only increased my desire to contribute something completely novel to the field. Perhaps the appropriate path to do so is in a PHD once some of the foundation has been laid within a Masters thesis. The work from a masters lends credibility to a PHD author creating something truly novel, not to mention is truly informative to the work done towards a PHD.


Forward vs. Backward view of memory

I’ve worked with a few reinforcement learning environments that are partially observable. The observations seen in these environments lack information to do a good job of identifying the current state within the environment. In these situations, different states are aliased – since many of them look the exact same.

What is an environment that is partially observable? Imagine standing in an extreme snow storm in the middle of an empty field. No matter where you look, no matter how you stand, all you see is white. All you feel is cold. Each state you are in, is hard to differentiate from the next. In this environment however, consider that there is a bench in the middle of the field. When you are right next to the bench, you can see it, and you can touch it. But as soon as you take a few steps away from it, it is no longer visible. There’s no way to sense it.

In this world, if you are to rely directly on your immediate observations (what you see and what you feel), all states look the exact same, with the exception of the state where you’re directly beside the bench. But as soon as you step one step away from the bench, you’re back to a state that is aliased with almost all the others.

There are two approaches that come to mind when creating a feature representation, that may get over this state aliasing problem for such environments. Both involve some element of using “memory,” so that the feature representation is not only comprised of what the agent currently sees.

  1. A recurrent memory based approach. In this approach the feature representation of each state consists of the observation of the current time step PLUS the observations from the previous n time steps. For example, if all I see is white snow, but I also know that I saw a bench last time step, I know quite precisely where I am (one step away from a bench).
  2. A predictive based approach. In this approach, the feature representation of each state consists of the observation of the current time step PLUS the predictions from the previous time step. For example, if all I see is white snow, but at the previous timestep I predicted that I was two steps away from the bench if I kept moving forward, I would now know where I was (again, one step away from a bench).

Both approaches seem to incorporate some form of memory (apologies for using “memory” quite loosely). The latter approach has a forward view of memory. Instead of looking back at the previous n time steps and summarizes what did happen, it looks ahead into the next n time steps and summarizes what will happen. I wonder how these two approaches would compare.

One thing that comes to mind would be that a forward view of memory might generalize better. In other words, regardless of how I got to a given state, if the predictions of the future are the same, wouldn’t you want each of these states to generalize the same?

Behavior policy for multi GVF (Horde) learning

One of my favorite papers I’ve read is “The Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction.”

TLDR – Like the name suggests, it presents a reinforcement learning agent architecture that is able to learn, and make tens of thousands of predictions at any time step. These predictions are based in low level sensorimotor observations. Each of these predictions are made via general value functions, some of which can be off policy predictions. In other words (for those not as familiar with RL terminology), these predictions are predictions IF the agent were to behave in a manner, not necessarily the same as the way it is currently behaving. For example, while driving home from work, I could start to learn (and in the future predict) how long it would take me to get to my local grocery store. I don’t intend to go into the implementation details, but the high level intuition is that the agent making this prediction (how long to drive to the grocery store) while doing something different (driving home) is able to do so, since there is behavioral overlap between these two behaviors. Where there is overlap, the agent can learn.

This off policy learning seems to be powerful. An agent is able to learn something about behavior not necessarily the same as the current target behavior. This is especially powerful when the agent does not directly control the behavior. However, there is some downside. The algorithm is algorithmically more complicated. With GTD lambda for example, a second weight vector needs to be stored and updated. Importance sampling via rho needs to also be calculated. The extra weight vector also increases storage / memory cost. So the off policy advantage comes at a cost.

I wonder, in cases like Horde, where the agent controls the behavior policy directly, and the purpose is to learn (rather than maximize some reward), how another approach would work. Perhaps each GVF could be implemented as on policy. Rather than GTD Lambda, a simple SARSA algorithm could be used. And that policy would be followed for a certain amount of time steps. Because all of the experience would be relevant, learning would happen faster in theory. Furthermore, there would be less computational overhead, so the slack that was generated could be used for planning or some other task. Another approach, rather than just enumerating through GVFs to determine the behavior policy, perhaps you could choose a GVF that needed learning the most, follow that policy for a while, and then move on to the next. “Needing learning the most” … what would that mean? I suppose it could be based on the current average TD error, combined with the number of times it’s been followed. This approach would be similar to some curiosity inspired exploration.

This idea of comparing behavioral approaches in Horde (what would the performance measure be for comparing the different approaches) seems like it could be interesting .

The maths

I must admit that at times, I’ve been frustrated by a lack of mathematical competency during my graduate studies in Reinforcement Learning. Other times though, I feel like my current level of mathematical competency is perfectly fine. I generally have enough intuition to understand the papers I’m reading. Part of me feels that when I allow myself to be frustrated, I’m allowing a certain type of author win – the type of researcher dressing up fairly underwhelming results in overly complex mathematics to make it look more impressive than it really is. Often times, complex math is needed … but other times it feels like I’m being shown dancing girls by a biz dev expert.

That said, the frustration is real.

I finished my undergraduate degree in 2002 (Currently 15 years ago) with a major in Computer Science. I took the required mathematics courses (2 calculus courses, a linear algebra course, statistics, and a numerical analysis course). I thrived in these courses (I’d have to look but I’d guess my average was around 95% where the course average was 60%) and generally enjoyed them. I’m not exactly sure why I didn’t pursue more. That said, the knowledge I gained from these courses is both distant and lacking to what would be required for easily understanding the papers I read. For example, I was reading a paper the other day talking about Fourier Transforms. A simple enough concept that I’ve studied before, but distant enough that I couldn’t quite remember the details I needed to parse the paper. The week before “LSI Systems.” In both cases I googled each term which unearthed 10 more terms and mathematical concepts I was unfamiliar with. Each term may require a weeks dedication, if not more to truly understand. It felt like pealing an onion.

I’ve thought about possible solutions. In my first year (last year) I audited a few classes – linear algebra and a statistics course. I should probably do more of that during my grad studies. But as always, 3 hours of lectures a week + time getting to and from adds up and is a pretty expensive investment. Not to mention the fact that to truly get the most out of these courses, one must not just attend classes, but do the assignments, prepare for exams etc. So this bottom up approach (read analysis, probability text books, MOOCs) is problematic because of it’s time cost when you have conferences to attend, pressure to publish, other classes to take for credit etc. Furthermore, the signal to noise ratio in this approach is pretty low. This statement isn’t mean to minimize the value in understanding the entire content of a mathematics textbook. But for the utility of applying it to research directly, a lot of pages simply aren’t relevant. However, the only way to unearth the valuable bits might be to read the entire content. In a perfect world, there’d be a system in place that would give me just enough content to understand the concept I wished. This system would understand my current level of knowledge. I could query “Langevin flow” and be returned a list of pointers to articles / moocs, etc. given my current level of knowledge. Google cant do this.

I’ve thought about a more top down approach – googling every term that’s unfamiliar – but as I stated before – that seems like exponential tree expansion.

Another more extreme solution would be spending a semester or two doing nothing but studying the appropriate math courses. This would be taking the actual courses (maybe for credit or at least doing assignments etc)? But obviously, as stated before, grad students have deadlines and should really publish / advance research … not just take background courses. This would have to be framed as a pseudo – sabbatical.

All this said, maybe it’s much ado about nothing. I have a general intuitive understanding of most concepts I’ve encountered. And for those that required a more intimate understanding, I’ve been able to pick this up. Perhaps I could even argue that a beginner understanding of mathematical concepts could come as an advantage to some researchers more mathematically minded – treating their mastery of equations and proofs as a hammer seeking a nail – solving problems that don’t really matter. My lack of mathematical know how hasn’t narrowed my research interests into purely application based areas (although I suppose that wouldn’t be a problem if it did. There’s nothing wrong with application research … as opposed to more theoretical).

None the less, the frustration is real. I desire to understand a bit more. And I don’t think I’m merely being tricked by dancing girls and flashy math equations. A better mathematical foundation would help. I’ll continue to use a more top down approach (researching terms and ideas on demand ad hoc) while sprinkling in a bit more bottom up. Listening to math lectures when I have the chance. Doing the latter is easier said than done though as “down time” is something that doesn’t occur that often. Perhaps I’ll dedicate myself to a MOOC to force the issue (the question of which content / MOOC not withstanding).

I’d love to hear other solutions to this frustration though ….

Perception vs. Knowledge in artificial intelligence

“Perception” and “knowledge.” I believe I’ve been guilty of using these words in the past to communicate ideas of research, without being overly clear in what they mean to me. When borrowing from other disciplines words such as these and ones such as “memory,” one should use them as consistent to their original meaning as possible. But I don’t believe there is anything wrong with choosing words that are out of their original context, so long as you clearly articulate the meaning of the words you’re choosing.

“Perception” and “knowledge” were two such words I was using today when talking with a fellow graduate student at the University of Alberta. Part way through our conversation, it became clear that I wasn’t being overly clear with how I was defining either word. We were discussing the idea of “perception as prediction.” An idea I have blogged about briefly in the past. The basic premise of the idea is that what we “perceive” to be true, are actually predictions that we believe to be true if we were to take certain actions. A baby perceives that it’s in the presence of it’s mothers breast because it predicts that if it were to take certain actions, it would receive sensorimotor stimulants (breast milk). A golfer perceives that she is standing on the green because she predicts that if she were to strike the ball, it would roll evenly over the ground. This theory of perception could be taken further to suggest that I perceive that Donald Trump is the current President of the United States because I predict if I were to enter his name into Google, I would discover that he was indeed the president.

The first two examples (the baby with breast milk, and the golfer) seem to fit nicely into a “perception” framework. To me, perception seems to be grounded in making sense of the immediate sensorimotor stream of data. “Making sense” perhaps means making predictions of what future implications would be to the sensorimotor stream would be if certain actions were taken now. However, the example of perceiving that Donald Trump is the president, seems to state something beyond my immediate sensorimotor stream of data. It’s less about “perceiving” what these observations mean. And more like “knowledge” that can be drawn upon to make other predictions. Knowledge seems to be based on making sense of what WOULD happen to my sensorimotor stream if I took certain actions now. Is this differentiation arbitrary? Is perception just a special case of knowledge – it’s knowledge grounded in the immediate senses vs. future senses? I believe that most people separate knowledge from perception in this way. But how do we “draw upon” knowledge when it’s needed? Is this knowledge hidden somehow in the sensorimotor data? Or is it part of a recurrent layer in a network? Or perhaps, it really is no different than “perception” in that this knowledge regarding Donald Trump could be one of millions of general value functions in a network of general value functions. If this is the case, then couldn’t you, in addition to saying “perception as prediction,” also say “knowledge as prediction?” I believe this is, actually exactly what researchers have argued in papers such as “Representing knowledge as Predictions.”

To me, perception refers to an ability to make further, more abstract sense of my immediate senses. To make “further, more abstract sense” means to be able to know that I’m “in a car” if I see street signs, and hear the hum of a motor. This more abstract sense of being “in a car” could be modeled as a set of predictions that are true, given the current sensorimotor input. Knowledge, on the other hand, can be modeled the exact same way. Through a set of predictions grounded in immediate senses. The difference, perhaps, is that perception, deals with more subjective information. It’s about making more abstract sense of my immediate senses. But the mechanism to compute perception and knowledge, could be the same. If that is the case, and these computations take the same input, and whose output is used the same, then what is the difference? Perhaps nothing?