The trajectory of an MDP is given as:
An agent is given a state, takes an action, receives a state, a reward … and the cycle continues. Why is, by convention, the indexing of states, actions, and rewards done in this way?
I think it’s important to note that this is convention, and convention only. R1 above could just as easily been denoted as R0. I think the decision of indexes comes down to which fits our mental model of how things work. You could argue both sides.
One mental model is that a given state, action and resulting reward are each required at each learning step in most algorithms, then it would make sense that these states, actions, and rewards share the same index. In such a mental model, The above would be organized as S0, A0, R0. This would seem reasonable.
But I think of the stream of states, actions, and rewards from a more temporally consistent data stream perspective. From this view, states and rewards enter the agent at the exact same temporal step. The indexes have temporal significance, so the state and reward should share the same index. Therefore the above indexing sequence makes most sense to me.