Train DQN Agent with LSTM Network to Control House Heating System

This example shows how to train a deep Q-learning network (DQN) agent with a Long Short-Term Memory (LSTM) network to control house heating system modeled in Simscape®. For more information on DQN agents, see Deep Q-Network (DQN) Agents.

House Heating Model

The reinforcement learning (RL) environment for this example uses a model from the House Heating System (Simscape). The model in this RL example contains a heater, a thermostat controlled by an RL agent, a house, outside temperatures, and a reward function. Heat is transferred between the outside environment and the interior of the home through the walls, windows, and roof. Weather station data from the MathWorks® campus in Natick MA is used to simulate the outside temperature between March 21st and April 15th, 2022. ThingSpeak™ was used to obtain the data. The data, "temperatureMar21toApr15_20022.mat", is located in this example folder. For more information about the data acquisition, see Compare Temperature Data from Three Different Days (ThingSpeak).

The training goal for the agent is to minimize the energy cost and maximize the comfort of the room by turning on/off the heater. The house is comfortable when the room temperature ${\mathit{T}}_{\mathrm{room}}$is between ${\mathit{T}}_{\mathrm{comfortMin}}$ and ${\mathit{T}}_{\mathrm{comfortMax}}$.

• Observation is a 6-dimensional column vector that consists of room temperature (${}^{\circ }C$) , outside temperature (${}^{\circ }C$) , max comfort temperature (${}^{\circ }C$) , min comfort temperature (${}^{\circ }C$) , last action, and price per kWh (USD). The max comfort temperature, the min comfort temperature, and the price per kWh in this example doesn't change over time, and it is unnecessary to use them to train the agent. However, you can extend this example by varying these values over time.

• Action is discrete. Either turning on the heater or turning off. A = {`0`, `1`}. We use `0` for off and `1` for on.

• Reward consists of three parts: energy cost, comfort level reward, and switching penalty. These three units are different, and users are expected to take balance, especially between energy cost and comfort level, by changing the coefficients of these terms. The reward function is inspired by [1].

`$\mathrm{reward}=\mathrm{comfortReward}+\mathrm{switchPenalty}-\mathrm{energyCost}$`

`$\mathrm{comfortReward}=\left\{\begin{array}{ll}0.1& \mathrm{if}\text{\hspace{0.17em}}{{\mathit{T}}_{\mathrm{comfortMin}}\le \text{\hspace{0.17em}}\mathit{T}}_{\mathrm{room}}\le {\mathit{T}}_{\mathrm{comfortMax}}\\ -\mathit{w}|{\mathit{T}}_{\mathrm{room}}-{\mathit{T}}_{\mathrm{comfortMin}}|& \mathrm{if}\text{\hspace{0.17em}}{\mathit{T}}_{\mathrm{room}}<{\mathit{T}}_{\mathrm{comfortMin}}\\ -\mathit{w}|{\mathit{T}}_{\mathrm{room}}-{\mathit{T}}_{\mathrm{comfortMax}}|& \mathrm{if}\text{\hspace{0.17em}}{\mathit{T}}_{\mathrm{room}}>{\mathit{T}}_{\mathrm{comfortMax}}\text{\hspace{0.17em}}\end{array}$`

,where $\mathit{w}=0.1$, ${\mathit{T}}_{\mathrm{comfortMin}}=18,$ and ${\mathit{T}}_{\mathrm{comfortMax}}=23$.

`$\mathrm{switchPenalty}=\left\{\begin{array}{ll}-0.01& \mathrm{if}\text{\hspace{0.17em}}{\mathit{a}}_{\mathit{t}}\ne {\mathit{a}}_{\mathit{t}-1}\text{\hspace{0.17em}}\mathrm{where}\text{\hspace{0.17em}}{\mathit{a}}_{\mathit{t}}\text{\hspace{0.17em}}\mathrm{is}\text{\hspace{0.17em}}\mathrm{the}\text{\hspace{0.17em}}\mathrm{current}\text{\hspace{0.17em}}\mathrm{action}\text{\hspace{0.17em}}\mathrm{and}\text{\hspace{0.17em}}{\mathit{a}}_{\mathit{t}-1}\text{\hspace{0.17em}}\mathrm{is}\text{\hspace{0.17em}}\mathrm{the}\text{\hspace{0.17em}}\mathrm{previous}\text{\hspace{0.17em}}\mathrm{action}\\ 0& \mathrm{otherwise}\end{array}$`

`$\mathrm{energyCost}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\mathrm{CostPerStep}=\mathrm{PricePerKwh}*\mathrm{ElectricityUsed}$`

• IsDone signal is always `0`, which means that there is no early termination condition.

Open the model and set a sample time.

```% For reproducibility rng(0); % Open the model mdl = 'rlHouseHeatingSystem'; open_system(mdl) % Assign the agent block path information. agentBlk = [mdl '/Smart Thermostat/RL Agent']; sampleTime = 120; % seconds maxStepsPerEpisode = 1000;```

Load the outside temperature data to simulate the environment temperature.

```data = load('temperatureMar21toApr15_2022.mat'); temperatureData = data.temperatureData; temperatureMarch21 = temperatureData(1:60*24,:); % For validation temperatureApril15 = temperatureData(end-60*24+1:end,:); % For validation temperatureData = temperatureData(60*24+1:end-60*24,:); % For training ```

The Simulink® model loads the following variables as a part of observations.

```outsideTemperature = temperatureData; comfortMax = 23; comfortMin = 18;```

Define observation and action specifications.

```% Define observation spec obsInfo = rlNumericSpec([6,1]); % Define action spec: 0 --- off, 1 --- on actInfo = rlFiniteSetSpec([0,1]);```

Create DQN Agent with LSTM network

A DQN agent approximates the discounted cumulative long-term reward using a vector Q-value function critic. To approximate the Q-value function within the critic, the DQN agent in this example uses an LSTM, which can capture the effect of previous observations. By setting `UseRNN` option in `rlAgentInitializationOptions`, you can create a default DQN agent with an LSTM network. Alternatively, you can manually configure the LSTM network. See this Water Distribution System Scheduling Using Reinforcement Learning to create an LSTM network for the DQN agent manually. For more information about LSTM layers, see Long Short-Term Memory Networks. Note that you must set `SequenceLength` greater than `1 `in `rlDQNAgentOptions`. This option is used in training to determine the length of the minibatch used to calculate the gradient.

```criticOpts = rlOptimizerOptions( ... LearnRate=0.001, ... GradientThreshold=1); agentOpts = rlDQNAgentOptions(... UseDoubleDQN = false, ... TargetSmoothFactor = 1, ... TargetUpdateFrequency = 4, ... ExperienceBufferLength = 1e6, ... CriticOptimizerOptions = criticOpts, ... MiniBatchSize = 64); agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.0001; useRNN = true; initOpts = rlAgentInitializationOptions( ... UseRNN=useRNN, ... NumHiddenUnit=64); if useRNN agentOpts.SequenceLength = 20; end agent = rlDQNAgent(obsInfo, actInfo, initOpts, agentOpts); agent.SampleTime = sampleTime;```

Create an environment interface for the house heating environment.

```% Define simulink environment env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);```

Use `hRLHeatingSystemResetFcn` to reset the environment at the beginning of each episode. `hRLHeatingSystemResetFcn` randomly selects the time between March 22nd and April 14th. The environment uses this time as the initial time for the outside temperatures.

```env.ResetFcn = @(in) hRLHeatingSystemResetFcn(in); validateEnvironment(env)```

Train Agent

To train the agent, first, specify the training options. For this example, use the following options.

• Run training for at most `150` episodes, with each episode lasting `1000` time steps.

• Set the `Plots` option to `"training-progress"` which displays training progress in the Reinforcement Learning Episode Manager.

• Set the `Verbose` option to `false` to disable the command line display

• Stop training when the agent receives an average cumulative reward greater than `85` over `5` consecutive episodes.

For more information, see `rlTrainingOptions`.

```maxEpisodes = 150; trainOpts = rlTrainingOptions(... MaxEpisodes = maxEpisodes, ... MaxStepsPerEpisode = maxStepsPerEpisode, ... ScoreAveragingWindowLength = 5,... Verbose = false, ... Plots = "training-progress",... StopTrainingCriteria = "AverageReward",... StopTrainingValue = 85);```

Train the agent using the `train` function. Training this agent is a computationally-intensive process that takes several hours to complete. To save time while running this example, load a pretrained agent by setting `doTraining` to `false`. To train the agent yourself, set `doTraining` to `true`.

```doTraining = false; if doTraining % Train the agent. trainingStats = train(agent,env,trainOpts); else % Load the pretrained agent for the example. load("HeatControlDQNAgent.mat","agent") end```

Simulate DQN Agent

To validate the performance of the trained agent, simulate it within the house heating system. For more information on agent simulation, see `rlSimulationOptions` and `sim`.

We first evaluate the agent's performance using the temperature data from March 21st, 2022. The agent didn't use this temperature data during training.

```% Validate agent using the data from March 21 maxSteps= 720; validationTemperature = temperatureMarch21; env.ResetFcn = @(in) hRLHeatingSystemValidateResetFcn(in); simOptions = rlSimulationOptions(MaxSteps = maxSteps); experience1 = sim(env,agent,simOptions);```

Use the `localPlotResults` function provided at the end of the script to analyze the performance.

`localPlotResults(experience1, maxSteps, comfortMax, comfortMin, sampleTime,1)`

```Comfort Temperature violation: 0/1440 minutes, cost: 8.038489 dollars ```

Secondly, we evaluate the agent's performance using the temperature data from April 15th, 2022. The agent didn't use this temperature data during training.

```% Validate agent using the data from April 15 validationTemperature = temperatureApril15; experience2 = sim(env,agent,simOptions); localPlotResults( ... experience2, ... maxSteps, ... comfortMax, ... comfortMin, ... sampleTime,2)```

```Comfort Temperature violation: 0/1440 minutes, cost: 8.088640 dollars ```

Evaluate the agent's performance when the temperature is mild. Eight degrees are added to the temperature from April 15th to create data for mild temperatures.

```% Validate agent using the data from April 15 + 8 degrees validationTemperature = temperatureApril15; validationTemperature(:,2) = validationTemperature(:,2) + 8; experience3 = sim(env,agent,simOptions); localPlotResults(experience3, ... maxSteps, ... comfortMax, ... comfortMin, ... sampleTime, ... 3)```

```Comfort Temperature violation: 0/1440 minutes, cost: 1.340312 dollars ```

Local Function

```function localPlotResults(experience, maxSteps, comfortMax, comfortMin, sampleTime, figNum) % localPlotResults plots results of validation % Compute comfort temperature violation minutesViolateComfort = ... sum(experience.Observation.obs1.Data(1,:,1:maxSteps) < comfortMin) ... + sum(experience.Observation.obs1.Data(1,:,1:maxSteps) > comfortMax); % Cost of energy totalCosts = experience.SimulationInfo(1).househeat_output{1}.Values; totalCosts.Time = totalCosts.Time/60; totalCosts.TimeInfo.Units='minutes'; totalCosts.Name = "Total Energy Cost"; finalCost = experience.SimulationInfo(1).househeat_output{1}.Values.Data(end); % Cost of energy per step costPerStep = experience.SimulationInfo(1).househeat_output{2}.Values; costPerStep.Time = costPerStep.Time/60; costPerStep.TimeInfo.Units='minutes'; costPerStep.Name = "Energy Cost per Step"; minutes = (sampleTime/60)*[0:maxSteps]; % Plot results fig = figure(figNum); % Change the size of the figure; fig.Position = fig.Position + [0, 0, 0, 200]; % Temperatures layoutResult = tiledlayout(3,1); nexttile plot(minutes, ... reshape(experience.Observation.obs1.Data(1,:,:), ... [1,length(experience.Observation.obs1.Data)]),'k') hold on plot(minutes, ... reshape(experience.Observation.obs1.Data(2,:,:), ... [1,length(experience.Observation.obs1.Data)]),'g') yline(comfortMin,'b') yline(comfortMax,'r') lgd = legend("T_{room}", "T_{outside}","T_{comfortMin}", ... "T_{comfortMax}","location","northoutside"); lgd.NumColumns = 4; title('Temperatures') ylabel("Temperature") xlabel('Time (minutes)') hold off % Total cost nexttile plot(totalCosts) title('Total Cost') ylabel("Energy cost") % Cost per step nexttile plot(costPerStep) title('Cost per step') ylabel("Energy cost") fprintf("Comfort Temperature violation:" + ... " %d/1440 minutes, cost: %f dollars\n", ... minutesViolateComfort, finalCost); end```

Reference

[1]. Y. Du, F. Li, K. Kurte, J. Munk and H. Zandi, "Demonstration of Intelligent HVAC Load Management With Deep Reinforcement Learning: Real-World Experience of Machine Learning in Demand Control," in IEEE Power and Energy Magazine, vol. 20, no. 3, pp. 42-53, May-June 2022, doi: 10.1109/MPE.2022.3150825.