A Markov decision process is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
An example of a Markov Process is shown below.
A Practical Approach:
In consideration to reinforcement learning, MDP consists of a 5 element tuple (S, A, P, R, D).
S – Number of states
A – Set of actions
P – State transition distributions
R – Reward function
D – Discount factor
We are going to discuss this theory with the help of an example:
Consider a 6x7 matrix where we start from a certain point and want to reach a certain point along with that we want to avoid some points in the grid.
The grey square represents the starting point. The point must try to avoid the orange squares and its target is to reach the green square. There are total 42 states.
Reward for entering:
Orange square = -1
Blue square = -0.01
Green square = +2
Actions are {North: N, East: E, West: W, South: S}
Probability distribution
The above figure signifies the probability of choosing a particular direction of movement.
D belongs to some value between 0 and 1.
Hence we define all the 5 elements of MDP.
Let the reward function or the total payoff be R (total).
R (total) = R(6,1) + D*R(2nd state) + D2 *R(3rd state) .
Now we will maximize this reward function and find the optimal policy P0 to reach the target green square.
The above example shows us the working behind a basic reinforcement model. The optimal policy gives us a set of actions to be performed in order to reach the target with maximum value of reward function.
Comments