- “The mind is its own place, and in itself can make a heaven of hell, a hell of a heaven
- “If we study only what is average, we will remain average”
- “Youi can eliminate depression without making someone happy. You can cure anxiety without teaching someone optimism . You can return someone to work without improving their performance. If we strive for a diminished bad, you’ll attain the average and you’ll miss out entirely on the opportunity to exceed average
- Brilliant p[people sometimes do the most unintelligent things – ie overwork or work on the wrong things.
- Negative stress is like second hand smoke. It’s unspoken. The second you enter a room with it, you know.
- Neuroplasticity – It’s been shown that our brains aren’t static. We change.
- Just through thinking, you can literally rewire the brain – ie cause physical change. This says a lot.
- Pleasure, engagement, meaning. All very core to happiness. Very akin to autonomy, mastery, purpose
- Happiness = joy we get for striving for our potential
- Negative emotions narrow our thoughts.
- Meditate. Spend money – but not on stuff
- Fitness is key to happiness
- Happiness isn’t about lying to ourselves and being ignorant. But changing our brains to think that we can rise up over circumstances
- World is inherently subjective. You can shape so much
- Your experience literally changes the result. Same experience with 2 people can = a2 different results
- “Creation” of happiness. Not “pursuit.” We need to create it. It’s not something we hunt down. Its already in us.
- Tetris effect. Once you get a negative mindset, that is all you see. You’re locked in.
- Need gratitude. Add 3 positives to a daily log
- Rose colored glasses are good … within limits
- It’s not just about getting up off the mat. It’s about understanding why you were on the mat in the first place and then bouncing back up.
- Things don’t always happen for the best. But you can make the best out of things that happen
- Pessimism = giving up. Seeing a dead end and no way out. It happens
- Sometimes people not only think they could do better (which is often a good growth mindset) but they don’t acknowledge the good they’ve done
- A person gets too emotional their logic is hacked and they literally can’t reason
- Fight until it is a habit. Until it’s done without thinking about it. It takes 21 days
- Will power alone will not get you there. Willpower is finite
- Follow the path of least resistance. Make it easy. Put the guitar in the middle of the room. Not stored in a case in the closet
- Friends. The number one indicator of happiness. To get them to really like you, it’s all the Dale Carnegy stuff. Show genuine interest. Be present
- Happiness is contagious. Just as the opposite is true
- OKR (Objective and Key Result)
- Effective KRs are specific, time bound, aggressive but realistic. They are measurable. It’s not a result if it doesn’t have a number
- You either meet a result … or you don’t
- Review ORRs often
- Goals are a fact of life for high achievers
- Goals linked to broader company mission
- Adherents to OKRs: AOL, Dropbox, Linkedin, Oracle, Slack, Spotify, Twitter
- Small startups love OKRs to make sure people rowing in same direction
- “OKRs are a shared language for execution. They clarify expectations: What do we need to get done (and fast) and who’s working on it? They keep employees aligned, vertically and horizontally”
- OKRs help flatten management.
- Superpower 1: Focus and commit to priorities – Also focus on what DOESNT matter. Precision tools for departments to communicate.
- Superpower 2: Align and connect team – Everyone’s goals are transparent
- Superpower 3: Track for accountability – Periodic checkins and grade ins. Constant feedback and iteration.
- Superpower 4: Stretch for Amazing: “Shoot for stars. Make it there sometimes. Otherwise arrive on moon”
- Andy Grove: “There are so many people working so hard and achieving so little”
- KPIs are ok. But they’re numbers without soul or purpose. OKR has the soul (objective).
- OKR is a tool not a weapon. Not to be used as a performance review punishment or linked to compensation.
- Tell people go to central Europe, some will go to Germany, some to France. But if they’re transparent with their direction at least they’ll know they’re pointing in different directions. ie. Transparency
- “Bad companies are destroyed by crisis, good companies survive them, great companies are improved by them”
- OKR needs an owner
- 3-5 KR per objective
- Annual objectives, measured quarterly (at least)
- Pair quality with quantity in okrs. Not just quantity
- Less is more with goal setting. Better focus. More realistic
- If just starting out with OKRs in a startup, less is definitely more. Start small. Or you’ll get crushed. Start with high level team goals.
- By definition, startups wrestle with ambiguity. The higher you are to the top, the more ambiguous and abstract your role is. OKRs can help crystalize objectives.
- Research shows public goals are more often achieved (transparency for the win)
- Healthy okr environment balances alignment and autonomy, common purpose and creative latitude
- “More meetings” is often compensating for lack of transparency. This should be more adhoc than structured constant meetings.
- When new projects start, ask “where does this fit in with OKRs. If it doesn’t … ask “why are we doing this?”
- Perhaps review OKRs at each sprint review
- Missions are even a higher level abstraction than objectives. Missions are directional. They are the guiding light that generate the objectives
- Stretch goals. You need these too. Shoot for the stars. It’s ok (necessary actually) to only achieve 30% of these. If you hit them all … you’re not setting bar high enough.
- 10x goal. Gotta be 10x better than anything else. Anything else is incremental and just noise.
- Traditional performance reviews suck because of distorted recency bias (among other reasons)
- CFRs – Continuous feedback recognition
- “Not everything that can be counted counts, and not everything that counts can be counted.
- Whole chapter about radical candour management (feedback in real time)
- 1:1s are for the employee. Their agenda. Their time
- Feedback can be awesome. But needs to be specific
- OKRs teach you how to be an executive … before you’re actually an executive
- Need the right culture for OKRS to work. “You need the right people off the bus, the right people on the bus, everyone seated in the right seat … before you can step on the gas
- Need to have the customer as part of the conversation and focus “If you want to give someone a haircut, it’s best to have them in the same room”
- There’s no dogma in the OKR. They are flexible.
- If you can achieve an objective without providing direct user or economic benefit … then why is it an objective? Maybe it’s ill defined.
- KR should be comprehensive of objective. ie. if you complete the KR, you should achieve the objective
Do humans learn fully online? I’ll define “online” as when learning only leverages the data incoming from the current observation. It learns only from this new data. In such a world, batch processing and experience replay are not “online.” Is that how humans work?
I suspect not. I think there’s a balance between online learning, and planning with humans. Furthermore, I thank that we balance between the two depending on the situation. When a human is faced with an intense situation – for example a professional hockey player in the middle of their shift – all of the data processing is in the foreground. They take the immediate observation, and act upon it. There is no background planning taking place. In contrast, and at the other extreme is when the athlete is back home daydreaming just before they fall asleep (or while they are actually sleeping). At this time, the human is not doing much, if any foreground processing. The observations they are receiving are minimal – given they’re laying in a dark room. And their action decisions are even more minimal. But in such a state of dreaming, their mind is free to plan (with the model of the world that has been learned).
This brings me to an ideal agent. I think an ideal agent will maximize its computational power. There is no idle time for such an agent. It learns as much as it can at every moment. There are no sleep(s)! Therefore, such an agent, if in a time where immediate action is required (like that of an athlete), this computational power must dedicate itself completely to foreground processing. But later, the same agent, when in a safe place, has the opportunity to plan using the model of the world it has learned. I usually hate suggesting that humans are the the default inspiration for designing the best agent (ie. just accepting that neural nets are the defacto best function approximater because humans implement them) … but in this case, I think we have it right.
The trajectory of an MDP is given as:
An agent is given a state, takes an action, receives a state, a reward … and the cycle continues. Why is, by convention, the indexing of states, actions, and rewards done in this way?
I think it’s important to note that this is convention, and convention only. R1 above could just as easily been denoted as R0. I think the decision of indexes comes down to which fits our mental model of how things work. You could argue both sides.
One mental model is that a given state, action and resulting reward are each required at each learning step in most algorithms, then it would make sense that these states, actions, and rewards share the same index. In such a mental model, The above would be organized as S0, A0, R0. This would seem reasonable.
But I think of the stream of states, actions, and rewards from a more temporally consistent data stream perspective. From this view, states and rewards enter the agent at the exact same temporal step. The indexes have temporal significance, so the state and reward should share the same index. Therefore the above indexing sequence makes most sense to me.
As humans born into the world, we are exposed to nothing but a stream of raw sensorimotor data – visual, audio, touch, smell. Using this data alone, we begin to form increasingly abstract knowledge.
One approach for artificial intelligent agents to represent such knowledge is to do so through predictions. The “perception as prediction” is something I’ve written about before. One such approach would be to use the General Value function (GVF). Given such an approach, a GVF(s) could be created which would predict some raw value in the observation stream. The value of such a prediction could then be used in the question of a higher level prediction. On and on it goes, using the GVF at each layer, until more abstract knowledge is captured. But for a system attempting to learn this abstract knowledge, how would such an agent drive behavior? There are several challenges which I’m encountering with my current research in this field:
- The GVFs at higher levels require the GVFs at the lower level to already have been learned. Therefore, it’s most efficient to follow behavior policies that optimize learning the early layers. Perhaps the agent should, in sequence follow behaviors that foster learning these low level GVFs. In other words, follow the behavior policies from the GVFs directly – with some minimal exploration. Then and only then, would the agent be able to learn other higher level GVFs. In a away, this would be sequential on policy learning, rather than parallel off policy learning.
- The later levels are depending on input from the earlier layers. So until these earlier layers are learned, the later layers are being handed incorrect data to learn from. This ties up unnecessary computation time, and initializes the weight vector to these later GVFs to incorrect values. Often very incorrect values.
- If this was the approach however, you can not completely forget to follow policies that learn lower level GVFs. In other words, the policy for the first layer GVF must on occasion be followed. Because we are using function approximation, rather than tabular, if the policy is never followed, the predictive value for that GVF will gradually be eroded. It’s like a human that learns that if it at the edge of a lake and extends its foot, it will feel the sensation of “wetness.” Once it has learned that, it must on occasion, experiencing the feeling of wetness by extending its foot. This is because, there are many other similar states for which the agent will NOT experience this sensation. Because of function approximation, each time the agent does not experience this sensation, the value of the GVF predicting wetness will move closer to NOT experiencing it. So the agent needs to be reminded occasionally, how it feels to experience this sensation, and which states lead to this. So this begs the question, when should the agent relearn and revisit these values and behavior policies of earlier GVFs?
Additionally, is there a place for an agent that is motivated to learn this knowledge? Shouldn’t it be following behavior which accomplishes some other goal? Perhaps while doing planning, the goal of learning such knowledge would be more appropriate, since this allows for accomplishing goals across multiple tasks.
Several of my experiments in Reinforcement learning have involved generating predictive state representation (where by part, or some of the features are predictive in nature). These learned predictive features look at the current observation, make predictions about some future cumulant, and incorporate this prediction into the current state representation. In the Cycle world setting, I’ve demonstrated how this can be useful. In a way, predictive state representations look at an aspect of the future data stream and add this information to the current state representation.
In contrast, another way to construct a state representation would be from a historical perspective. The agent would encapsulate previous observations and incorporate them into the current state observation.
Given these two perspectives, it got me thinking, is there REALLY a place for predictive state representations? What is that place? Can they just be replaced by historical ones? What follows invokes more questions than answers … but is my way of flushing out some thoughts regarding this topic.
First off, With respect to predictive features, I’m going to define useful as being able to provide information that allows the agent to identify the state. A useful feature would allow you to generalize across similar states and likewise, allow the agent to differentiate from unlike states.
For both predictive and historical features, I think they’re really “only” useful when the data stream of observation is sparse in information. In other words, it lacks information at each tilmestep which would allow the agent to uniquely identify the current state. The tabular grid world where current state is uniquely identified is the opposite of such an information lacking observation stream. In contrast to the tabular word, information sparse / partially observable settings, historical or predictive features can fill the gaps where the current observation is lacking information. If informative data exists in the past data, then historical features become useful. Likewise, if informative data exists in the future data stream, then predictive features become more useful.
The B MDP where there is unique information at the very beginning of each episode, which directly informs the agent which action to take in the future, is somewhat of a special case where all of the useful information in the stream exists only in the past (the first state). Therefor, predictive features are of no use here. But in continuous (real life) tasks, I don’t think this would ever be the case. Informative information “should”? always be observable in the future, if the agent takes the correct steps So if that’s the case (as long as the agent can navigate back to these states, then predictive features “can do whatever the historical ones” can. It’s only when the agent can’t get back that historical features are required. But these cases seem somewhat contrived.
In either case though … maybe this is stating the obvious … but historical and predictive features seem to share something in common: They are useful when the current observation is starving for more informative data. Historical features help when there is informative data in the rear view mirror. Predictive when there’s informative data ahead such as the following MDP.
So which (predictive or historical) really depends on what the data stream looks like … and where you are within it.
I spent some time mingling with some CTOs of iNovia portfolio companies today. It was fun getting a bit back into the startup world, especially with the obvious AI slant to the meeting today. One subject that I discussed a lot with some of them was the apparent lack of startups leveraging Reinforcement Learning.
I’m assuming that this perception is true, but admit that I have no data on this other than personal and anecdotal. The number of startups leveraging deep learning outnumber those using RL significantly – based on the press clippings I read, and the first hand startup founders I know. So I believe there is something to this. So this begged the natural question of why?
I’ve blogged about this before, but the discussions today a bit beyond this. No one, obviously including myself, claimed to have the answer. That said, I believe there are several reasons for the apparent lack of RL startups, most of them pivoting around the availability of data (or lack there of).
Data in RL is temporally based. Unless you’re simulating games on the atari, it is difficult to come by. Especially because of the often “real world” action based nature of it. You can’t just use amazon turk and label 500,000 images of cats to train your network. You have to generate a real stream of experience. That’s more difficult to do.
More specific challenges of the acquisition of data might be:
- The cost of experimentation – Exploration is key in creating data used for reinforcement learning; When simulating games of GO, one can experiment and make a move that is thought to be poor. The worst that can happen is that you agents self esteem takes a hit because it loses another game! The stakes are low. But in a real time stock trader, the reinforcement learning suffers a significant cost when experimenting. This exploration cost isn’t suffered in supervised classification type approaches.
- Delayed value – Any value derived from a reinforcement agent solution doesn’t occur until after the agent has been trained. It’s hard to convince an organization to adopt your product if they have to wait a month for it to learn before providing value. Because of the lack of simulators, these agents must learn on the fly. So when they’re deployed the don’t provide immediate value.
- Temporal nature of data – I’ve blogged about this before. But the nature of the reinforcement data is temporal in nature. Often times rooted in “real time.” Agriculture data for example. It takes a full calendar year to acquire results of how a crop may perform. Rather than 0.001 seconds to simulate a game of GO. This is similar to a degree for acquiring data for a manufacturing plant. We’re at the early stages of RL adoption, so this data just isn’t there yet, and beginning to acquire it is difficult because of the reason above – why acquire the data in the first place if there’s no immediate pay off?
- Infancy of the technology – I don’t buy this one, but it was mentioned that RL is a more novel approach than supervised learning. RL, Supervised, Unsupervised. They’ve all been “around” a long time. The computational power and availability of data has been what’s given rise to supervised learning. I’d conclude by saying that for RL we now have the computational power, but we still lake the data.
Not to be confused with “the hard thing about hard things!” … which … incidentally is a fantastic read. The TLDR: of which is that nothing in business (and in life?) follows any sort of recipe. You’ll be faced with countless difficult situations in leadership, for which there are no silver bullets. So Ben Horowitz shares a whole bunch of stories about hard situations, and how they were handled (not always well).
Ok … so enough about the book plug. Back to “hard questions,” this time, with respect to “questions” that a General value function is looking to answer.
I’ve thought about this a bit recently. In a family of GVFs (questions) that an agent could ask about it’s data stream, some questions, seem intuitively easy to answer. While other types of questions, seem hopeless. I hadn’t given it much thought until I heard another grad student make a similar reference, during a really interesting talk about general value functions today. Her presentation was comparing different algorithms, used by a GVF in a robot setting. The comment in particular, was that the algorithms all performed similarly if the question “was easy”, but when the question was “hard,” the different algorithms performed differently.
This obviously reminded me again about what a “hard question” really is. What do we mean when a prediction, or GVF is “hard”?
In a robot setting, a GVF answers a question about it’s sensorimotor stream. Given a policy, “how much light will I see in the near future,” how many steps will I take before I hit a wall?” are examples of a GVF question. Some of these seem difficult to answer.
I think it’s fair to say that when we say that a GVF is hard to learn, it means that it is difficult to come up with an accurate prediction.
So the next question then becomes, what makes it difficult to form an accurate prediction? I can see two reasons that a prediction may be difficult to approximate.
- There isn’t enough relevant experience to learn from. Learning how long it would take to get from Edmonton to Calgary would clearly be “hard” to learn if you lived in Regina, and rarely travelled to Alberta.
- Your feature representation is insufficient to answer this type of question. I’m less clear about what I mean by this. But it seems as though the features required to answer one question would be quite different than the features required to answer another type of questions. “What color will I see in the near future?” quite clearly requires a different feature representation than “What sound will I hear in the near future?” If the robot agent only represents it’s state with visual sensors, the latter type of question will clearly be incredibly hard to answer. This question, to be fair, might illogical. How can you ask a question about what you’d hear, if you weren’t made available audio senses. So perhaps a better example might be a question like “What is the chance I will see an ambulance pass me on the highway?” If my representation includes audio signals, this question may be easy to answer. Because as soon as I “hear” a siren, I can quite accurately predict that I’d be able to see the ambulance. Without this audio signal however, suddenly this question becomes much more difficult, if not impossible (if you remove my rear view mirrors).
Clearly there’s more to a “hard” question than this. But it seems these two attributes are a good starting place.
The “knowledge” is in the data.
In other words, predictions are made directly from this data stream, and these predictions are all that you need to represent “knowledge.”
That said, each agent (person) is exposed to a different subset of this data.
Therefore, everyone’s “knowledge” is personal. No?
What does this say about any sort of ground truth about knowledge? Ones “knowledge” may be in direct conflict with anothers. And both may be correct?
So, knowledge can conflict. Well, actually, since knowledge is completely subjective, there is really no universal ground truth to it. It’s personal.
What then about “facts?” Aren’t facts, thought to be bits of knowledge that are “true?” Wait … Oh no … ALTERNATIVE FACTS!!!
Atari is a popular environment for testing reinforcement learning algorithms. This got me thinking about the possibility of using mobile applications for such purposes. There might be some practical advantages for doing – both from a researcher and commercializers perspective.
For researchers, the ability to be exposed to multiple different sensors might provide value. Not only would you have access to pixel and audio data from the game, but you may also have microphone data from the environment. Also, accelerometer data can be included.
One of the biggest technical challenges to instrumenting an application for testing within this environment might be accessing the pixel stream in an efficient manner. From what I remember, there’s no way to efficiently access the current UIView pixel data. Creating the pixel data from the UIView is expensive computationally (TBD by how much), which may be problematic since this would need to be done at each time step. This is especially problematic if one were to provide RL tools / services to application developers (more on this later), since your solution would introduce latency – something to be avoided like the plague in mobile applications.
There would be obvious commercial appeal to a RL SDK + service that was easily integrated into an application. Such a service might, using senses common to any app (pixel data, sound) rather than needing to access any non observable application state, be able to make predictions about user behavior. Such predictions would be valuable for an application developer. For example, the sdk could trigger a notification that there was a good chance the user would quit the app within 5 seconds. (Yes, enabling application developers to suck users into spending even more time staring at their phones may be a particular type of evil).
Another challenge technically is that a couple test applications would need to be used to develop such an SDK / service. Perhaps a larger company with a popular app would allow you to integrate with their application to perform research. Or perhaps better yet, in an Android environment, this pixel data may be available, so you could experiment with such a solution even without the cooperation of the app developer.