Main Content

createMDP

Create Markov decision process object

Description

A Markov decision process (MDP) is a discrete-time stochastic control process in which the state and observation belong to finite spaces, and stochastic rules govern state transitions. MDPs are useful for studying optimization problems solved using reinforcement learning. Use the createMDP function to create a GenericMDP object with specified states and transitions. You can then modify some of the object properties and pass it to rlMDPEnv to create an environment that agents can interact with.

mdp = createMDP(states,actions) creates a Markov decision process object with the specified states and actions.

example

Examples

collapse all

Create a GenericMDP object with eight states and two possible actions.

mdp = createMDP(8,["up";"down"])
mdp = 
  GenericMDP with properties:

            CurrentState: "s1"
                  States: [8×1 string]
                 Actions: [2×1 string]
                       T: [8×8×2 double]
                       R: [8×8×2 double]
          TerminalStates: [0×1 string]
    ProbabilityTolerance: 8.8818e-16

Specify the state transitions and their associated rewards.

% State 1 transition and reward
mdp.T(1,2,1) = 1;
mdp.R(1,2,1) = 3;
mdp.T(1,3,2) = 1;
mdp.R(1,3,2) = 1;

% State 2 transition and reward
mdp.T(2,4,1) = 1;
mdp.R(2,4,1) = 2;
mdp.T(2,5,2) = 1;
mdp.R(2,5,2) = 1;

% State 3 transition and reward
mdp.T(3,5,1) = 1;
mdp.R(3,5,1) = 2;
mdp.T(3,6,2) = 1;
mdp.R(3,6,2) = 4;

% State 4 transition and reward
mdp.T(4,7,1) = 1;
mdp.R(4,7,1) = 3;
mdp.T(4,8,2) = 1;
mdp.R(4,8,2) = 2;

% State 5 transition and reward
mdp.T(5,7,1) = 1;
mdp.R(5,7,1) = 1;
mdp.T(5,8,2) = 1;
mdp.R(5,8,2) = 9;

% State 6 transition and reward
mdp.T(6,7,1) = 1;
mdp.R(6,7,1) = 5;
mdp.T(6,8,2) = 1;
mdp.R(6,8,2) = 1;

% State 7 transition and reward
mdp.T(7,7,1) = 1;
mdp.R(7,7,1) = 0;
mdp.T(7,7,2) = 1;
mdp.R(7,7,2) = 0;

% State 8 transition and reward
mdp.T(8,8,1) = 1;
mdp.R(8,8,1) = 0;
mdp.T(8,8,2) = 1;
mdp.R(8,8,2) = 0;

Specify the terminal states of the model.

mdp.TerminalStates = ["s7";"s8"];

Use state2idx to obtain the index associated with "s7".

state2idx(mdp,"s7")
ans = 
7

You can now use rlMDPEnv to convert mdp into an environment object for which you can train and simulate your agents.

Input Arguments

collapse all

Model states, specified as one of the following:

  • Positive integer — Specify the number of model states. In this case, each state has a default name, such as "s1" for the first state.

  • String vector — Specify the state names. In this case, the total number of states is equal to the length of the vector.

Example: ["America";"Europe";"China"];

Model actions, specified as one of the following:

  • Positive integer — Specify the number of model actions. In this case, each action has a default name, such as "a1" for the first action.

  • String vector — Specify the action names. In this case, the total number of actions is equal to the length of the vector.

Example: ["GoWest";"GoEast"];

Output Arguments

collapse all

MDP model, returned as a GenericMDP object with these properties.

Name of the current state, specified as a string.

Example: "Europe";

State names, specified as a string vector with length equal to the number of states.

Example: ["America";"Europe";"China"];

Action names, specified as a string vector with length equal to the number of actions.

Example: ["GoWest";"GoEast"];

State transition matrix, specified as a 3-D array, which determines the possible movements of the agent in an environment. State transition matrix T is a probability matrix that indicates the likelihood of the agent position moving from the current state s to any possible next state s' when performing action a. T is an S-by-S-by-A array, where S is the number of states and A is the number of actions. It is given by:

T(s,s',a) = probability(s'|s,a)

The sum of the transition probabilities out from a nonterminal state s following a given action must add up to either one or zero. So, all stochastic transitions out of a given state must be specified at the same time.

For example, to indicate that in state 1 following action 4 there is an equal probability of moving to states 2 or 3, use this command:

mdp.T(1,[2 3],4) = [0.5 0.5];

You can also specify that, following an action, there is some probability of remaining in the same state.

mdp.T(1,[1 2 3 4],1) = [0.25 0.25 0.25 0.25];

Example: mdp.T(1,[1 2 3],1) = [0.25 0.5 0.25] assigns the first three elements of the first row of T in the object mdp.

Reward transition matrix, specified as a 3-D array, which determines how much reward the agent receives after performing an action in the environment. R has the same shape and size as state transition matrix T. The reward for moving from state s to state s' by performing action a is given by:

r = R(s,s',a).

Example: mdp.R(1,[1 2 3],1) = [-1 0.5 2] assigns the first three elements of the first row of R in the object mdp.

Terminal state names, specified as a string vector of state names.

Example: mdp.TerminalStates = "s3" assigns the name "s3" to the terminal state in the object mdp.

Tolerance for the sum of probabilities along a row of the transition matrix.

Because the sum of numbers along a row of the transition matrix represents the probability of moving into the state indexed by the row number, all the numbers along a row must add to either one or zero, within the tolerance specified in ProbabilityTolerance. If this condition is not verified, an error is thrown.

To set transition probabilities, first, set an entire row to zero, then set the non-zero probabilities all at once. For an example, see createGridWorld. Alternatively, copy the transition matrix into a variable, modify the variable, and then assign it back as transition matrix of your grid world object.

Example: mdp.ProbabilityTolerance = 1e-15; sets to 1e-15 the probability tolerance of the object mdp.

Version History

Introduced in R2019a