Main Content

Train DQN Agent with LSTM Network to Control House Heating System

This example shows how to train a deep Q-learning network (DQN) agent with a Long Short-Term Memory (LSTM) network to control house heating system modeled in Simscape®. For more information on DQN agents, see Deep Q-Network (DQN) Agents.

House Heating Model

The reinforcement learning (RL) environment for this example uses a model from the House Heating System (Simscape). The model in this RL example contains a heater, a thermostat controlled by an RL agent, a house, outside temperatures, and a reward function. Heat is transferred between the outside environment and the interior of the home through the walls, windows, and roof. Weather station data from the MathWorks® campus in Natick MA is used to simulate the outside temperature between March 21st and April 15th, 2022. ThingSpeak™ was used to obtain the data. The data, "temperatureMar21toApr15_20022.mat", is located in this example folder. For more information about the data acquisition, see Compare Temperature Data from Three Different Days (ThingSpeak).

The training goal for the agent is to minimize the energy cost and maximize the comfort of the room by turning on/off the heater. The house is comfortable when the room temperature Troomis between TcomfortMin and TcomfortMax.

  • Observation is a 6-dimensional column vector that consists of room temperature (C) , outside temperature (C) , max comfort temperature (C) , min comfort temperature (C) , last action, and price per kWh (USD). The max comfort temperature, the min comfort temperature, and the price per kWh in this example doesn't change over time, and it is unnecessary to use them to train the agent. However, you can extend this example by varying these values over time.

  • Action is discrete. Either turning on the heater or turning off. A = {0, 1}. We use 0 for off and 1 for on.

  • Reward consists of three parts: energy cost, comfort level reward, and switching penalty. These three units are different, and users are expected to take balance, especially between energy cost and comfort level, by changing the coefficients of these terms. The reward function is inspired by [1].

reward=comfortReward+switchPenalty-energyCost

comfortReward={0.1ifTcomfortMinTroomTcomfortMax-w|Troom-TcomfortMin|ifTroom<TcomfortMin-w|Troom-TcomfortMax|ifTroom>TcomfortMax

,where w=0.1, TcomfortMin=18, and TcomfortMax=23.

switchPenalty={-0.01ifatat-1whereatisthecurrentactionandat-1isthepreviousaction0otherwise

energyCost=CostPerStep=PricePerKwh*ElectricityUsed

  • IsDone signal is always 0, which means that there is no early termination condition.

Open the model and set a sample time.

% For reproducibility
rng(0); 

% Open the model
mdl = 'rlHouseHeatingSystem';
open_system(mdl)

% Assign the agent block path information.
agentBlk = [mdl '/Smart Thermostat/RL Agent'];
sampleTime = 120; % seconds
maxStepsPerEpisode = 1000;

Load the outside temperature data to simulate the environment temperature.

data = load('temperatureMar21toApr15_2022.mat');
temperatureData = data.temperatureData;
temperatureMarch21 = temperatureData(1:60*24,:);         % For validation
temperatureApril15 = temperatureData(end-60*24+1:end,:); % For validation
temperatureData = temperatureData(60*24+1:end-60*24,:);  % For training 

The Simulink® model loads the following variables as a part of observations.

outsideTemperature = temperatureData;
comfortMax = 23;
comfortMin = 18;

Define observation and action specifications.

% Define observation spec
obsInfo = rlNumericSpec([6,1]);

% Define action spec:  0 --- off,  1 --- on
actInfo = rlFiniteSetSpec([0,1]);

Create DQN Agent with LSTM network

A DQN agent approximates the discounted cumulative long-term reward using a vector Q-value function critic. To approximate the Q-value function within the critic, the DQN agent in this example uses an LSTM, which can capture the effect of previous observations. By setting UseRNN option in rlAgentInitializationOptions, you can create a default DQN agent with an LSTM network. Alternatively, you can manually configure the LSTM network. See this Water Distribution System Scheduling Using Reinforcement Learning to create an LSTM network for the DQN agent manually. For more information about LSTM layers, see Long Short-Term Memory Networks. Note that you must set SequenceLength greater than 1 in rlDQNAgentOptions. This option is used in training to determine the length of the minibatch used to calculate the gradient.

criticOpts = rlOptimizerOptions( ...
    LearnRate=0.001, ...
    GradientThreshold=1);
agentOpts = rlDQNAgentOptions(...
    UseDoubleDQN = false, ...    
    TargetSmoothFactor = 1, ...
    TargetUpdateFrequency = 4, ...   
    ExperienceBufferLength = 1e6, ...
    CriticOptimizerOptions = criticOpts, ...
    MiniBatchSize = 64);
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.0001;

useRNN = true;
initOpts = rlAgentInitializationOptions( ...
    UseRNN=useRNN, ...
    NumHiddenUnit=64);
if useRNN
    agentOpts.SequenceLength = 20;
end
agent = rlDQNAgent(obsInfo, actInfo, initOpts, agentOpts);
agent.SampleTime = sampleTime;

Define Simulink Environment

Create an environment interface for the house heating environment.

% Define simulink environment
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);

Use hRLHeatingSystemResetFcn to reset the environment at the beginning of each episode. hRLHeatingSystemResetFcn randomly selects the time between March 22nd and April 14th. The environment uses this time as the initial time for the outside temperatures.

env.ResetFcn = @(in) hRLHeatingSystemResetFcn(in);
validateEnvironment(env)

Train Agent

To train the agent, first, specify the training options. For this example, use the following options.

  • Run training for at most 150 episodes, with each episode lasting 1000 time steps.

  • Set the Plots option to "training-progress" which displays training progress in the Reinforcement Learning Episode Manager.

  • Set the Verbose option to false to disable the command line display

  • Stop training when the agent receives an average cumulative reward greater than 85 over 5 consecutive episodes.

For more information, see rlTrainingOptions.

maxEpisodes = 150;

trainOpts = rlTrainingOptions(...
    MaxEpisodes = maxEpisodes, ...
    MaxStepsPerEpisode = maxStepsPerEpisode, ...
    ScoreAveragingWindowLength = 5,...    
    Verbose = false, ...
    Plots = "training-progress",...
    StopTrainingCriteria = "AverageReward",...
    StopTrainingValue = 85);

Train the agent using the train function. Training this agent is a computationally-intensive process that takes several hours to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.

doTraining = false;
if doTraining
    % Train the agent.
    trainingStats = train(agent,env,trainOpts);
else
    % Load the pretrained agent for the example.
    load("HeatControlDQNAgent.mat","agent")
end

Simulate DQN Agent

To validate the performance of the trained agent, simulate it within the house heating system. For more information on agent simulation, see rlSimulationOptions and sim.

We first evaluate the agent's performance using the temperature data from March 21st, 2022. The agent didn't use this temperature data during training.

% Validate agent using the data from March 21
maxSteps= 720;
validationTemperature = temperatureMarch21;
env.ResetFcn = @(in) hRLHeatingSystemValidateResetFcn(in);
simOptions = rlSimulationOptions(MaxSteps = maxSteps);
experience1 = sim(env,agent,simOptions);

Use the localPlotResults function provided at the end of the script to analyze the performance.

localPlotResults(experience1, maxSteps, comfortMax, comfortMin, sampleTime,1)

Figure contains 3 axes objects. Axes object 1 with title Temperatures contains 4 objects of type line, constantline. These objects represent T_{room}, T_{outside}, T_{comfortMin}, T_{comfortMax}. Axes object 2 with title Total Cost contains an object of type line. Axes object 3 with title Cost per step contains an object of type line.

Comfort Temperature violation: 0/1440 minutes, cost: 8.038489 dollars

Secondly, we evaluate the agent's performance using the temperature data from April 15th, 2022. The agent didn't use this temperature data during training.

% Validate agent using the data from April 15
validationTemperature = temperatureApril15;
experience2 = sim(env,agent,simOptions);
localPlotResults( ...
    experience2, ...
    maxSteps, ...
    comfortMax, ...
    comfortMin, ...
    sampleTime,2)

Figure contains 3 axes objects. Axes object 1 with title Temperatures contains 4 objects of type line, constantline. These objects represent T_{room}, T_{outside}, T_{comfortMin}, T_{comfortMax}. Axes object 2 with title Total Cost contains an object of type line. Axes object 3 with title Cost per step contains an object of type line.

Comfort Temperature violation: 0/1440 minutes, cost: 8.088640 dollars

Evaluate the agent's performance when the temperature is mild. Eight degrees are added to the temperature from April 15th to create data for mild temperatures.

% Validate agent using the data from April 15 + 8 degrees
validationTemperature = temperatureApril15;
validationTemperature(:,2) = validationTemperature(:,2) + 8;
experience3 = sim(env,agent,simOptions);
localPlotResults(experience3, ...
    maxSteps, ...
    comfortMax, ...
    comfortMin, ...
    sampleTime, ...
    3)

Figure contains 3 axes objects. Axes object 1 with title Temperatures contains 4 objects of type line, constantline. These objects represent T_{room}, T_{outside}, T_{comfortMin}, T_{comfortMax}. Axes object 2 with title Total Cost contains an object of type line. Axes object 3 with title Cost per step contains an object of type line.

Comfort Temperature violation: 0/1440 minutes, cost: 1.340312 dollars

Local Function

function localPlotResults(experience, maxSteps, comfortMax, comfortMin, sampleTime, figNum)
    % localPlotResults plots results of validation

    % Compute comfort temperature violation
    minutesViolateComfort = ...
    sum(experience.Observation.obs1.Data(1,:,1:maxSteps) < comfortMin) ...
    + sum(experience.Observation.obs1.Data(1,:,1:maxSteps) > comfortMax);
    
    % Cost of energy
    totalCosts = experience.SimulationInfo(1).househeat_output{1}.Values;
    totalCosts.Time = totalCosts.Time/60;
    totalCosts.TimeInfo.Units='minutes';
    totalCosts.Name = "Total Energy Cost";
    finalCost = experience.SimulationInfo(1).househeat_output{1}.Values.Data(end);

    % Cost of energy per step
    costPerStep = experience.SimulationInfo(1).househeat_output{2}.Values;
    costPerStep.Time = costPerStep.Time/60;
    costPerStep.TimeInfo.Units='minutes';    
    costPerStep.Name = "Energy Cost per Step";
    minutes = (sampleTime/60)*[0:maxSteps];

    % Plot results   
    fig = figure(figNum);
    % Change the size of the figure;
    fig.Position = fig.Position + [0, 0, 0, 200];
    % Temperatures
    layoutResult = tiledlayout(3,1);
    nexttile
    plot(minutes, ...
        reshape(experience.Observation.obs1.Data(1,:,:), ...
        [1,length(experience.Observation.obs1.Data)]),'k')
    hold on
    plot(minutes, ...
        reshape(experience.Observation.obs1.Data(2,:,:), ...
        [1,length(experience.Observation.obs1.Data)]),'g')
    yline(comfortMin,'b')
    yline(comfortMax,'r')
    lgd = legend("T_{room}", "T_{outside}","T_{comfortMin}", ...
        "T_{comfortMax}","location","northoutside");
    lgd.NumColumns = 4;
    title('Temperatures')
    ylabel("Temperature")
    xlabel('Time (minutes)')
    hold off

    % Total cost
    nexttile
    plot(totalCosts)    
    title('Total Cost')
    ylabel("Energy cost")

    % Cost per step
    nexttile
    plot(costPerStep)  
    title('Cost per step')
    ylabel("Energy cost")    
    fprintf("Comfort Temperature violation:" + ...
        " %d/1440 minutes, cost: %f dollars\n", ...
        minutesViolateComfort, finalCost);
end

Reference

[1]. Y. Du, F. Li, K. Kurte, J. Munk and H. Zandi, "Demonstration of Intelligent HVAC Load Management With Deep Reinforcement Learning: Real-World Experience of Machine Learning in Demand Control," in IEEE Power and Energy Magazine, vol. 20, no. 3, pp. 42-53, May-June 2022, doi: 10.1109/MPE.2022.3150825.