createMDP

Create Markov decision process object

Syntax

mdp = createMDP(states,actions)

Description

A Markov decision process (MDP) is a discrete-time stochastic control process in which the state and observation belong to finite spaces, and stochastic rules govern state transitions. MDPs are useful for studying optimization problems solved using reinforcement learning. Use the createMDP function to create a GenericMDP object with specified states and transitions. You can then modify some of the object properties and pass it to rlMDPEnv to create an environment that agents can interact with.

mdp = createMDP(states,actions) creates a Markov decision process object with the specified states and actions.

example

Examples

collapse all

Create MDP Model

Open Live Script

Create a GenericMDP object with eight states and two possible actions.

mdp = createMDP(8,["up";"down"])

mdp = 
  GenericMDP with properties:

            CurrentState: "s1"
                  States: [8×1 string]
                 Actions: [2×1 string]
                       T: [8×8×2 double]
                       R: [8×8×2 double]
          TerminalStates: [0×1 string]
    ProbabilityTolerance: 8.8818e-16

Specify the state transitions and their associated rewards.

% State 1 transition and reward
mdp.T(1,2,1) = 1;
mdp.R(1,2,1) = 3;
mdp.T(1,3,2) = 1;
mdp.R(1,3,2) = 1;

% State 2 transition and reward
mdp.T(2,4,1) = 1;
mdp.R(2,4,1) = 2;
mdp.T(2,5,2) = 1;
mdp.R(2,5,2) = 1;

% State 3 transition and reward
mdp.T(3,5,1) = 1;
mdp.R(3,5,1) = 2;
mdp.T(3,6,2) = 1;
mdp.R(3,6,2) = 4;

% State 4 transition and reward
mdp.T(4,7,1) = 1;
mdp.R(4,7,1) = 3;
mdp.T(4,8,2) = 1;
mdp.R(4,8,2) = 2;

% State 5 transition and reward
mdp.T(5,7,1) = 1;
mdp.R(5,7,1) = 1;
mdp.T(5,8,2) = 1;
mdp.R(5,8,2) = 9;

% State 6 transition and reward
mdp.T(6,7,1) = 1;
mdp.R(6,7,1) = 5;
mdp.T(6,8,2) = 1;
mdp.R(6,8,2) = 1;

% State 7 transition and reward
mdp.T(7,7,1) = 1;
mdp.R(7,7,1) = 0;
mdp.T(7,7,2) = 1;
mdp.R(7,7,2) = 0;

% State 8 transition and reward
mdp.T(8,8,1) = 1;
mdp.R(8,8,1) = 0;
mdp.T(8,8,2) = 1;
mdp.R(8,8,2) = 0;

Specify the terminal states of the model.

mdp.TerminalStates = ["s7";"s8"];

Use state2idx to obtain the index associated with "s7".

state2idx(mdp,"s7")

ans = 
7

You can now use rlMDPEnv to convert mdp into an environment object for which you can train and simulate your agents.

Input Arguments

collapse all

`states` — Model states
positive integer | string vector

Model states, specified as one of the following:

Positive integer — Specify the number of model states. In this case, each state has a default name, such as "s1" for the first state.
String vector — Specify the state names. In this case, the total number of states is equal to the length of the vector.

Example: ["America";"Europe";"China"];

`actions` — Model actions
positive integer | string vector

Model actions, specified as one of the following:

Positive integer — Specify the number of model actions. In this case, each action has a default name, such as "a1" for the first action.
String vector — Specify the action names. In this case, the total number of actions is equal to the length of the vector.

Example: ["GoWest";"GoEast"];

Output Arguments

collapse all

`mdp` — MDP model
`GenericMDP` object

MDP model, returned as a GenericMDP object with these properties.

`CurrentState` — Name of the current state
string

Name of the current state, specified as a string.

Example: "Europe";

`States` — State names
string vector

State names, specified as a string vector with length equal to the number of states.

Example: ["America";"Europe";"China"];

`Actions` — Action names
string vector

Action names, specified as a string vector with length equal to the number of actions.

Example: ["GoWest";"GoEast"];

`T` — State transition matrix
3-D array

State transition matrix, specified as a 3-D array, which determines the possible movements of the agent in an environment. State transition matrix T is a probability matrix that indicates the likelihood of the agent position moving from the current state s to any possible next state s' when performing action a. T is an S-by-S-by-A array, where S is the number of states and A is the number of actions. It is given by:

$T (s, s', a) = p r o b a b i l i t y (s' | s, a)$

The sum of the transition probabilities out from a nonterminal state s following a given action must add up to either one or zero. So, all stochastic transitions out of a given state must be specified at the same time.

For example, to indicate that in state 1 following action 4 there is an equal probability of moving to states 2 or 3, use this command:

mdp.T(1,[2 3],4) = [0.5 0.5];

You can also specify that, following an action, there is some probability of remaining in the same state.

mdp.T(1,[1 2 3 4],1) = [0.25 0.25 0.25 0.25];

Example: mdp.T(1,[1 2 3],1) = [0.25 0.5 0.25] assigns the first three elements of the first row of T in the object mdp.

`R` — Reward transition matrix
3-D array

Reward transition matrix, specified as a 3-D array, which determines how much reward the agent receives after performing an action in the environment. R has the same shape and size as state transition matrix T. The reward for moving from state s to state s' by performing action a is given by:

$r = R (s, s', a) .$

Example: mdp.R(1,[1 2 3],1) = [-1 0.5 2] assigns the first three elements of the first row of R in the object mdp.

`TerminalStates` — Terminal state names
string vector

Terminal state names, specified as a string vector of state names.

Example: mdp.TerminalStates = "s3" assigns the name "s3" to the terminal state in the object mdp.

`ProbabilityTolerance` — Tolerance for the sum of probabilities
positive scalar

Tolerance for the sum of probabilities along a row of the transition matrix.

Because the sum of numbers along a row of the transition matrix represents the probability of moving into the state indexed by the row number, all the numbers along a row must add to either one or zero, within the tolerance specified in ProbabilityTolerance. If this condition is not verified, an error is thrown.

To set transition probabilities, first, set an entire row to zero, then set the non-zero probabilities all at once. For an example, see createGridWorld. Alternatively, copy the transition matrix into a variable, modify the variable, and then assign it back as transition matrix of your grid world object.

Example: mdp.ProbabilityTolerance = 1e-15; sets to 1e-15 the probability tolerance of the object mdp.

Version History

Introduced in R2019a

createMDP

Syntax

Description

Examples

Create MDP Model

Input Arguments

`states` — Model states
positive integer | string vector

`actions` — Model actions
positive integer | string vector

Output Arguments

`mdp` — MDP model
`GenericMDP` object

`CurrentState` — Name of the current state
string

`States` — State names
string vector

`Actions` — Action names
string vector

`T` — State transition matrix
3-D array

`R` — Reward transition matrix
3-D array

`TerminalStates` — Terminal state names
string vector

`ProbabilityTolerance` — Tolerance for the sum of probabilities
positive scalar

Version History

See Also

Functions

Objects

Topics

createMDP

Syntax

Description

Examples

Create MDP Model

Input Arguments

states — Model states positive integer | string vector

actions — Model actions positive integer | string vector

Output Arguments

mdp — MDP model GenericMDP object

CurrentState — Name of the current state string

States — State names string vector

Actions — Action names string vector

T — State transition matrix 3-D array

R — Reward transition matrix 3-D array

TerminalStates — Terminal state names string vector

ProbabilityTolerance — Tolerance for the sum of probabilities positive scalar

Version History

See Also

Functions

Objects

Topics

`states` — Model states
positive integer | string vector

`actions` — Model actions
positive integer | string vector

`mdp` — MDP model
`GenericMDP` object

`CurrentState` — Name of the current state
string

`States` — State names
string vector

`Actions` — Action names
string vector

`T` — State transition matrix
3-D array

`R` — Reward transition matrix
3-D array

`TerminalStates` — Terminal state names
string vector

`ProbabilityTolerance` — Tolerance for the sum of probabilities
positive scalar