History of state indexing

The trajectory of an MDP is given as:

screen shot 2019-01-12 at 10.42.18 am

An agent is given a state, takes an action, receives a state, a reward … and the cycle continues. Why is, by convention, the indexing of states, actions, and rewards done in this way?

I think it’s important to note that this is convention, and convention only. R1 above could just as easily been denoted as R0. I think the decision of indexes comes down to which fits our mental model of how things work. You could argue both sides.

One mental model is that a given state, action and resulting reward are each required at each learning step in most algorithms, then it would make sense that these states, actions, and rewards share the same index. In such a mental model, The above would be organized as S0, A0, R0. This would seem reasonable.

But I think of the stream of states, actions, and rewards from a more temporally consistent data stream perspective. From this view, states and rewards enter the agent at the exact same temporal step. The indexes have temporal significance, so the state and reward should share the same index.  Therefore the above indexing sequence makes most sense to me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s