Abstract
This paper develops deep reinforcement learning (DRL) algorithms for optimizing the operation of home energy system which consists of photovoltaic (PV) panels, battery energy storage system, and household appliances. Model-free DRL algorithms can efficiently handle the difficulty of energy system modeling and uncertainty of PV generation. However, discrete-continuous hybrid action space of the considered home energy system challenges existing DRL algorithms for either discrete actions or continuous actions. Thus, a mixed deep reinforcement learning (MDRL) algorithm is proposed, which integrates deep Q-learning (DQL) algorithm and deep deterministic policy gradient (DDPG) algorithm. The DQL algorithm deals with discrete actions, while the DDPG algorithm handles continuous actions. The MDRL algorithm learns optimal strategy by trial-and-error interactions with the environment. However, unsafe actions, which violate system constraints, can give rise to great cost. To handle such problem, a safe-MDRL algorithm is further proposed. Simulation studies demonstrate that the proposed MDRL algorithm can efficiently handle the challenge from discrete-continuous hybrid action space for home energy management. The proposed MDRL algorithm reduces the operation cost while maintaining the human thermal comfort by comparing with benchmark algorithms on the test dataset. Moreover, the safe-MDRL algorithm greatly reduces the loss of thermal comfort in the learning stage by the proposed MDRL algorithm.
DEMAND response (DR), which offers consumers the opportunity to change their consumption patterns in response to incentives or electricity prices to balance power demand and power supply, is considered as an integral part of smart grid [
In the residential sector, price-based DR programs including time-of-use (TOU) pricing program and real-time (RT) pricing program are most frequently studied [
The objectives of HEMSs are usually to minimize electricity cost and maximize human comfort [
With model-based methods, a numerical model is required to characterize home energy system and an optimization problem is formulated considering the objective and system constraints [
With the advancement of artificial intelligence, model-free methods based on reinforcement learning (RL) have been developed for home energy management [
The RL-based methods learn optimal decision-making strategy by iteratively interacting with the energy system, which do not require prior knowledge on the energy system [
In many practical engineering problems, however, unsafe actions, which violate system constraints, can lead to system damages and high cost, especially during the learning stage [
This paper investigates deep reinforcement learning (DRL) based optimization algorithm for HEMS. The main contributions of the paper are outlined below.
1) The operation cost optimization problem of grid-connected home energy system including various household appliances, e.g., HVAC system, wash machine, dish washer, etc., renewable generation, and battery energy storage system (BESS) is formulated as a Markov decision process (MDP) without the prediction of unknown variables or thermal dynamic model. The operation modes of household appliances and BESSs constitute discrete-continuous hybrid action space for the MDP, which challenges existing RL algorithms for either discrete action space or continuous action space.
2) A mixed deep reinforcement learning (MDRL) algorithm that integrates DQL and DDPG is developed to solve the MDP. The proposed MDRL algorithm inherits the merits of DQL in handling discrete action space and takes advantages of DDPG in dealing with continuous action space. More precisely, the MDRL algorithm leverages the actor-critic framework as in the DDPG algorithm. The actor network with the proposed MDRL algorithm, however, receives discrete action and state as input and outputs continuous actions. The critic network evaluates the combination of discrete action and continuous action for the given state. Similar to DQL, the optimal combination of discrete action and continuous action is determined by selecting the one that maximizes the Q-value. Meanwhile, to facilitate the training of the proposed MDRL algorithm, a special exploration policy is designed for discrete-continuous hybrid action space.
3) To avoid high loss of human thermal comfort with the HVAC system in the learning stage, a prediction model guided safe-MDRL algorithm is further proposed. In the safe-MDRL algorithm, an online prediction model is developed and applied to evaluate actions associated with the HVAC system to avoid severe violation of thermal constraints.
4) Simulation studies based on real data illustrate that the proposed MDRL algorithm can efficiently reduce operation cost while maintaining human thermal comforts compared with benchmark algorithms on the test dataset. Moreover, the safe-MDRL algorithm greatly reduces the loss of human thermal comfort in the learning stage by the MDRL algorithm.
The remainder of the paper is organized as follows. In Section II, the HEMS is introduced with mathematical formulations. In Section III, the optimization problem of HEMS is firstly formulated as an MDP, which is followed by the development of the proposed MDRL algorithm and its safe version. Simulation results are provided in Section IV, and conclusions are given in Section V.
The HEMS considered in this paper is illustrated in

Fig. 1 Considered HEMS.
The home is equipped with PV panels, BESS, and household appliances. The household appliances can be generally classified into non-shiftable loads, shiftable and non-interruptible loads, and controllable loads in terms of their characteristics [
Consider a set of N shiftable and non-interruptible loads. For each individual load n, , it is characterized by a tuple , where Tn,ini and Tn,end are the initial time and end time of working period, respectively; Tn,d is the time slot required to complete the task; and Pn is the power demand. For shiftable and non-interruptible loads, there are two operation modes, i.e., “on” and “off”. Power demand for all this kind of appliances in time slot t is obtained by:
(1) |
where xn,t is a binary decision variable for appliance n and 1/0 corresponds to “on/off”, respectively. The operation of shiftable and non-interruptible loads should satisfy following constraints:
(2) |
(3) |
(4) |
(5) |
where is the remaining time slot required to complete the task at the end of time slot for appliance n satisfying and The constraint (2) ensures that the appliance should be “off” before initial time of the working period; the constraint (3) enforces the starting of the task to ensure the completion of the task in the working period; the constraint (4) ensures non-interruption of the task; and the constraint (5) enforces the appliance to be “off” once the task has been completed.
This paper considers an HVAC system that can adjust its input power continuously to maintain human thermal comforts.
(6) |
where and are the input power of the HVAC system at t and its maximum power, respectively.
Indoor air conditions such as air temperature, air speed, and relative humidity are essential for the determination of human thermal comfort level. To simplify the representation of human thermal comfort, human comfort temperature zone is considered as in [
(7) |
where is the indoor temperature at t; and is the human comfort temperature zone. Indoor temperature depends on many factors including HVAC input power, outdoor temperature, and home thermal dynamics, which is difficult to model. However, thermal dynamic model for HVAC system is not required by the proposed MDRL/safe-MDRL algorithm because it can learn such dependence from experiences by trial-and-error. This demonstrates the advantage of model-free RL algorithm for HVAC system control.
Consider a BESS with the maximum capacity of Bmax. The dynamics of the BESS in terms of state of charge (SoC) is given by:
(8) |
where is the level of available energy Bt with respect to BESS capacity; is the charging (if ) or discharging (if ) power; and is the charging/discharging efficiency with for charging process and for discharging process.
To sustain lifespan of the BESS, the following operation constraints are considered:
(9) |
(10) |
where and are the limitations of charging and discharging power, respectively; and SoCmin and SoCmax are the minimum and maximum levels of SoC, respectively.
The home energy system exchanges energy with the utility grid to balance supply and demand:
(11) |
where , , and are the power demand from non-shiftable loads, PV generation power, and power exchanged with utility grid, respectively. represents electricity purchased from the utility grid with TOU electricity price, while represents surplus energy sold to the utility grid with fixed feed-in tariff (FT).
The operation cost of the home energy system for each time slot t is given by:
(12) |
where ut is the electricity price; and vB is the degradation cost coefficient of the BESS. In (12), the first term represents the electricity cost, while the second term represents the BESS degradation cost, which is proportional to charging/discharging power [
The objective of the scheduling problem is to minimize operation cost of the home energy system while maintaining human thermal comforts and satisfying constraints over scheduling horizon. Such optimization problem is summarized as:
(13) |
Decision variables in (13) include , , and for . It is a great challenge to solve the mixed-integer optimization problem due to the following difficulties. Firstly, due to the randomness of PV generation, power demand from non-shiftable loads, and outdoor temperature, it is difficult to make leading decisions. Secondly, indoor temperature is not only affected by input power of HVAC system but also highly depends on outdoor temperature and thermal properties of the home, while it is not easy to develop a proper model to describe such dependence. In this paper, DRL algorithms will be developed to solve the optimization problem without thermal dynamic model for HVAC system or prediction of unknown variables.
RL is an area of machine learning concerned with how artificial agents take actions in an environment in order to maximize accumulative future rewards. The fundamental principle underlying RL is the MDP. In this section, the formulation of household sequential scheduling problem as an MDP will firstly be investigated, which is followed by the development of the MDRL algorithm and its safe version to solve the problem.
An MDP is usually defined by a 4-tuple , where S is the state space consisting of a set of environment states; A is a set of actions called action space; P: is a function which determines the state transition probability considering environment uncertainty; and R: is the reward function which returns immediate reward after state transition [
Considering the framework of MDP in

Fig. 2 Framework of MDP.
1) State: the state st is composed of information available at the end of time slot t, which reflects the status of components in home energy systems. It is defined by a high dimensional vector , where h denotes hour of day for time slot t. Lagged values of PV generation, non-shiftable loads, and outdoor temperature are considered to capture their patterns of variation.
2) Action: the agent receives state st at the end of time slot t and takes control actions following a policy. The action vector determines the operation of the home energy system for time slot . It is noticeable that the action vector consists of both discrete action and continuous action. To ensure non-violation of SoC constraints, PB,t+1 should be bounded to for charging process and to for discharging process.
3) State transition: the transitions of SoCt and have been discussed in Section II. The transitions of state features including PV generation, non-shiftable loads, and outdoor temperature are random, while indoor temperature depends not only on actions but also on outdoor temperature and home thermal properties. The values of these features indexed at will be taken from observations. The developed DRL algorithms will learn their correlations from the training data to make optimal decisions.
4) Reward: the objective of the HEMS is to minimize operation costs while maintaining human thermal comforts considering constraints. Hence, the reward consisting of operation cost and penalty for temperature deviation from comfort zone is given by:
(14) |
where ; and is a parameter which balances the operation cost and penalty for temperature deviation.
5) State-action value function: the goal of the agent in RL is to construct an optimal policy that maximizes accumulated discounted rewards in the future, i.e., [
(15) |
where is the state-action value; is a function that returns expected value; and is the action to be taken at the following time step.
From the above analysis, this paper develops a DRL-based algorithm for one-step ahead control of the home energy system based on currently available information. The underlying principle of using currently available measurements of PV generation, outdoor temperature, and non-shiftable loads instead of their predictions is that these values are highly temporally correlated and their temporal evolution can be learned by the proposed MDRL algorithm. Moreover, the dependence of indoor temperature variation on controlled HVAC power, outdoor temperature, and building thermal property is also learned from experiences by trail-and-error in the learning stage. Hence, the proposed MDRL algorithm does not need thermal dynamic model for HVAC system or prediction of unknown variables.
For the existing DRL algorithms, most of them require action space to be either discrete or continuous. For instance, DQL as well as its variants are applicable for discrete action space; while DDPG is widely used for continuous action space. To handle the discrete-continuous hybrid action space with the HEMS, an MDRL algorithm that integrates DQL and DDPG is developed.
Let and denote the discrete action and continuous action, respectively, where and denote the discrete action space and continuous action space, respectively. The discrete-continuous hybrid action is represented by . Then Bellman equation becomes:
(16) |
where and are the discrete and continuous actions at time slot t, respectively; and are the discrete and continuous actions to be taken at the following time slot, respectively.
If holds, (16) can be re-written as:
(17) |
It is noticeable that the right side of (17) deals with continuous action only, which can be efficiently handled by actor-critic framework. Similar to DDPG, a deep critic network is deployed to approximate state-action value function while a deterministic deep policy network is used to generate continuous action , where and are the corresponding network parameters including weights and biases.
The illustration of networks of MDRL algorithm is depicted in

Fig. 3 Illustration of networks of MDRL algorithm.
Similar to DDPG, the critic network parameter is optimized by minimizing the squared loss in (18) with gradient descent methods [
(18) |
where is the target Q-value. To optimize the actor network parameter, the basic idea is to adjust in the direction of the performance gradient that boosts Q-value. With the application of chain rule, the performance gradient can be decomposed into gradient of state-action value function with respect to continous actions and gradient of policy with respect to policy parameters, which results in policy gradient for the update of policy parameters considering state distribution [
(19) |
In DRL, the balance between exploration and exploitation is critical to train an efficient agent for decision-making. To facilitate the training of deep networks considering discrete-continuous hybrid action space, a special exploration policy in (20) which integrates the -greedy policy for DQL and the policy by adding Gaussian noise into the actions from actor network for DDPG is developed.
(20) |
To handle the challenges caused by temporal correlation of samples for network optimization in DRL, experience reply is considered [
To stabilize the learning process, target networks are introduced for actor network and critic network, denoted as and , respectively, to evaluate the target Q-value [
(21) |
where ensures slow change of target network parameters, and consequently improves the stability of learning process. Procedures for the training of networks are summarized in
The fundamental idea of safe-RL is to develop a prediction model for action evaluation where safe actions are executed by the system while unsafe actions are modified to satisfy safe constraints. In this paper, indoor temperature is expected to stay in comfort zone with well-controlled HVAC input power. Thereby, unsafe actions refer to those that will lead to violation of constraints on indoor temperature. To ensure thermal comfort, an indoor temperature prediction model based on multilayer perception (MLP) is developed for HVAC input power evaluation.
(22) |
The model in (22) predicts indoor temperature from the most influential factors including lagged indoor temperature, outdoor temperature, and HVAC input power. The term e captures modeling error due to unconsidered weather conditions such as wind speed and humidity as well as uncertainty associated with thermal dynamic process.
Since leading outdoor temperature Kout,t+1 is usually unknown at time slot t, a probabilistic outdoor temperature prediction model based on Gaussian process regression [
(23) |
The model in (23) predicts the mean value and standard deviation of outdoor temperature from its lagged values and temporal information h. Outdoor temperature illustrates the diurnal cycle that the sine and cosine functions are used to capture temporal periodicity. The input features are contained in the state st, hence, outdoor temperature prediction model is simplied as:
(24) |
With (22) and (24), it is easy to construct outdoor temperature prediction interval and indoor temperature prediction interval :
(25) |
(26) |
(27) |
(28) |
where is a parameter which controls the confidence level that actual outdoor temperature falls in the constructed interval.
The safety-checking function fsc in
The parameter ( for heating system and for cooling system) denotes the moving step of HVAC input power and the parameter compensates modeling errors. In
1) Home energy system: normalized PV generation and outdoor temperature obtained from National Renewable Energy Laboratory (NREL), USA [
(29) |
where [
Outdoor temperature prediction model is trained on the data from December 2011 to February 2012. The MDRL algorithm and safe-MDRL algorithm are trained on the data from December 2012 to January 2013 and tested on data in February 2013. The parameters for the home energy system are listed in
The profiles of PV generation and outdoor temperature in February 2013 are illustrated in

Fig. 4 Profiles of PV generation and outdoor temperature. (a) PV generation. (b) Outdoor temperature.
2) DRL algorithms: deep neural networks consisting of input layer, hidden layers, and output layer are considered. Rectified linear unit (ReLU) activation function is used for hidden layers of both actor network and critic network; while hyperbolic tangent activation function and linear activation function are used for the output layers of actor network and critic network, respectively. Adam optimizer [
To facilitate the training of deep neural networks, states are normalized into . The outputs from actor network are in and should be mapped into the range of continuous action space. For the exploration policy in (20), parameters and decay with training episode as and . Indoor temperature prediction model is represented by an MLP with one hidden layer. There are three neurons in the hidden layer. Hyperbolic tangent activation function and linear activation function are used for hidden layer and output layer, respectively. Parameters associated with safe-MDRL in
This paper considers the following benchmark algorithms to illustrate the effectiveness of the proposed MDRL/safe-MDRL algorithm for home energy management with discrete-continuous hybrid action space.
1) B1: the “on/off” operation modes are considered by this benchmark algorithm. With this benchmark algorithm, the shiftable and non-interruptible load is switched “on” at its initial working time and maintains “on” until the completion of the task. The HVAC system is turned “on” with the maximum power if and turned “off” if ; otherwise, it maintains its operation mode. However, this benchmark algorithm does not consider BESS.
2) B2: an algorithm based on MILP is developed for the scheduling of home energy system supposing that all the information including PV generation, outdoor temperature, non-shiftable loads, and home thermal dynamics are known. This is an ideal case that sets the lower limit in energy cost while keeping thermal comforts.
3) DDPG algorithm: classical DDPG algorithm is applied for the home energy system control where discretization is used to derive the decisions for shiftable and non-interruptible loads. The studies in [
The objective of simulation study is twofold: ① through the comparison between the proposed MDRL algorithm and its safe version to illustrate the effectiveness of the safe-MDRL algorithm in reducing the loss of human thermal comfort in the learning stage; and ② through the comparison among all the applied algorithms to illustrate the merits of the proposed MDRL algorithm and its safe version in home energy management in terms of operation cost and satisfaction of human comforts on the test dataset. To verify their robustness, the DDPG algorithm, the MDRL algorithm, and the safe-MDRL algorithm are executed for 5 independent runs.
1) To illustrate the effectiveness of the safe-MDRL algorithm in reducing the loss of human thermal comfort thereby in improving rewards, average episode rewards over 5 runs by the proposed MDRL algorithm and the safe-MDRL algorithm during the training process are depicted in

Fig. 5 Average episode rewards during training process.
To further illustrate the effectiveness of the safe-MDRL algorithm in maintaining thermal comforts thereby in improving rewards, average episode operation cost (including electricity cost and battery degradation cost) and temperature deviation from comfort zone for the first 2500, 5000, 7500, and 10000 episodes over 5 runs are reported in
It can be observed that both the MDRL algorithm and the safe-MDRL algorithm improve decision quality in term of operation cost and thermal comfort with increasing number of training episodes. The difference in operation cost between the MDRL algorithm and its safe version is minor. The safe-MDRL algorithm reduces temperature deviation from comfort zone by almost 80% compared with the proposed MDRL algorithm and, thereby greatly improves rewards.
2) The statistics (mean value and standard deviation) over 5 runs on average daily operation cost and temperature deviation from comfort zone by the proposed algorithms and benchmark algorithms on the test dataset are presented in
From
Temperature deviations from comfort zone are observed with the DDPG algorithm, the MDRL algorithm, and the safe-MDRL algorithm. This is because indoor temperature dynamic model in (29) considers the impact of uncertainty of outdoor temperature on indoor temperature. At the end of time slot t when the decision on PHVAC,t+1 is issued, outdoor temperature Kout,t+1 is actually unknown. The proposed MDRL/safe-MDRL algorithm learns to handle the challenge, however, it cannot be fully addressed in the extreme cases where large variation of outdoor temperature occurs.

Fig. 6 Illustration of simulation results. (a) Indoor temperature. (b) SoC of BESS. (c) HVAC input power. (d) Grid power.
From
In this paper, a novel DRL-based algorithm is developed for home energy management under TOU pricing program. The operation modes of various household appliances constitute discrete-continuous hybrid action space, which challenges the existing RL frameworks for either discrete action space or continuous action space. The proposed MDRL algorithm integrates DQL and DDPG where the DQL deals with discrete action space and the DDPG handles continuous action space. To reduce the loss of human thermal comfort during the learning stage with the MDRL algorithm, a safe version (safe-MDRL algorithm) which deploys a prediction model to guide the exploration of the MDRL algorithm is further developed.
To verify the effectiveness of the MDRL algorithm in cost saving for home energy management and the safe-MDRL algorithm in reducing the loss of human thermal comfort in the learning stage, simulation studies based on real data are conducted. The results illustrate that the MDRL algorithm can efficiently handle the challenges from discrete-continuous hybrid action space for the existing RL frameworks. Meanwhile, the MDRL algorithm reduces the operation cost while keeping human thermal comforts by comparing with benchmark algorithms including classical DDPG on the test dataset. Simulation results also illustrate that the safe-MDRL algorithm can greatly reduce the loss of human thermal comforts in the learning stage.
References
F. Zeng, Z. Bie, S. Liu et al., “Trading model combining electricity, heating, and cooling under multi-energy demand response,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 1, pp. 133-141, Jan. 2020. [Baidu Scholar]
National Energy Administration. (2020, Jan.). The electricity consumption by the whole society. [Online]. Available: http://www.nea.gov.cn/2020-01/20/c_138720877.htm [Baidu Scholar]
S. Xu, X. Chen, J. Xie et al., “Agent-based modeling and simulation for the electricity market with residential demand response,” CSEE Journal of Power and Energy Systems, vol. 7, no. 2, pp. 368-380, Mar. 2021. [Baidu Scholar]
F. Luo, W. Kong, G. Ranzi et al., “Optimal home energy management system with demand charge tariff and appliance operational dependencies,” IEEE Transactions on Smart Grid, vol. 11, no. 1, pp. 4-14, Jan. 2020. [Baidu Scholar]
X. Wang, Y. Liu, J. Zhao et al., “A hybrid agent-based model predictive control scheme for smart community energy system with uncertain DGs and loads,” Journal of Modern Power Systems and Clean Energy, vol. 9, no. 3, pp. 573-584, May 2021. [Baidu Scholar]
S. Althaher, P. Mancarella, and J. Mutale, “Automated demand response from home energy management system under dynamic pricing and power and comfort constraints,” IEEE Transactions on Smart Grid, vol. 6, no. 4, pp. 1874-1883, Jul. 2015. [Baidu Scholar]
T. Yoshihisa, N. Fujita, and M. Tsukamoto, “A rule generation method for electrical appliances management systems with home EoD,” in Proceedings of the 1st IEEE Global Conference on Consumer Electronics 2012, Tokyo, Japan, Oct. 2012, pp. 248-250. [Baidu Scholar]
A. Keshtkar, S. Arzanpour, and F. Keshtkar, “Adaptive residential demand-side management using rule-based techniques in smart grid environments,” Energy and Buildings, vol. 133, pp. 281-294, Dec. 2016. [Baidu Scholar]
M. J. Sanjari, H. Karami, and H. B. Gooi, “Analytical rule-based approach to online optimal control of smart residential energy system,”IEEE Transactions on Industrial Informatics, vol. 13, no. 4, pp. 1586-1597, Aug. 2017. [Baidu Scholar]
Y. Huang, L. Wang, W. Guo et al., “Chance constrained optimization in a home energy management system,” IEEE Transactions on Smart Grid, vol. 9, no. 1, pp. 252-260, Jan. 2018. [Baidu Scholar]
T. Molla, B. Khan, B. Moges et al., “Integrated optimization of smart home appliances with cost-effective energy management system,”CSEE Journal of Power and Energy Systems, vol. 5, no. 2, pp. 249-258, Jun. 2019. [Baidu Scholar]
N. G. Paterakis, O. Erdinc, A. G. Bakirtzis et al., “Optimal household appliances scheduling under day-ahead pricing and load-shaping demand response strategies,” IEEE Transactions on Industrial Informatics, vol. 11, no. 6, pp. 1509-1519, Dec. 2015. [Baidu Scholar]
M. Shafie-Khah and P. Siano, “A stochastic home energy management system considering satisfaction cost and response fatigue,” IEEE Transactions on Industrial Informatics, vol. 14, no. 2, pp. 629-638, Feb. 2018. [Baidu Scholar]
M. Yousefi, A. Hajizadeh, M. N. Soltani et al., “Predictive home energy management system with photovoltaic array, heat pump, and plug-in electric vehicle,” IEEE Transactions on Industrial Informatics, vol. 17, no. 1, pp. 430-440, Jan. 2021. [Baidu Scholar]
A. Mondal, S. Misra, and M. S. Obaidat, “Distributed home energy management system with storage in smart grid using game theory,”IEEE Systems Journal, vol. 11, no. 3, pp. 1857-1866, Sept. 2017. [Baidu Scholar]
Q. Wei, D. Liu, and G. Shi, “A novel dual iterative Q-learning method for optimal battery management in smart residential environments,”IEEE Transactions on Industrial Electronics, vol. 62, no. 4, pp. 2509-2518, Apr. 2015. [Baidu Scholar]
M. N. Faqiry, L. Wang, and H. Wu, “HEMS-enabled transactive flexibility in real-time operation of three-phase unbalanced distribution systems,” Journal of Modern Power Systems and Clean Energy, vol. 7, no. 6, pp. 1434-1449, Nov. 2019. [Baidu Scholar]
R. Lu, S. Hong, and M. Yu, “Demand response for home energy management using reinforcement learning and artificial neural network,”IEEE Transactions on Smart Grid, vol. 10, no. 6, pp. 6629-6639, Nov. 2019. [Baidu Scholar]
S. Bahraini, V. Wong, and J. Huang, “An online learning algorithm for demand response in smart grid,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4712-4725, Sept. 2018. [Baidu Scholar]
Q. Wei, Z. Liao, and G. Shi, “Generalized actor-critic learning optimal control in smart home energy management,” IEEE Transactions on Industrial Informatics, vol. 17, no. 10, pp. 6614-6623, Oct. 2021. [Baidu Scholar]
L. Yu, W. Xie, D. Xie et al., “Deep reinforcement learning for smart home energy management,” IEEE Internet of Things Journal, vol. 7, no. 4, pp. 2751-2762, Apr. 2020. [Baidu Scholar]
C. Qiu, Y. Hu, Y. Chen et al., “Deep deterministic policy gradient (DDPG)-based energy harvesting wireless communications,” IEEE Internet of Things Journal, vol. 6, no. 5, pp. 8577-8588, Oct. 2019. [Baidu Scholar]
D. Cao, W. Hu, J. Zhao et al., “Reinforcement learning and its applications in modern power and energy systems: a review,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1029-1042, Nov. 2020. [Baidu Scholar]
E. Mocanu, D. Mocanu, P. Nguyen et al., “On-line building energy optimization using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 10, no. 4, pp. 3698-3708, Jul. 2019. [Baidu Scholar]
Y. Ye, D. Qiu, X. Wu et al., “Model-free real-time autonomous control for a residential multi-energy system using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3068-3082, Jul. 2020. [Baidu Scholar]
M. Sun, I. Konstantelos, and G. Strbac, “A deep learning-based feature extraction framework for system security assessment,” IEEE Transactions on Smart Grid, vol. 10, no. 5, pp. 5007-5020, Sept. 2019. [Baidu Scholar]
H. Zhao, J. Zhao, J. Qiu et al., “Cooperative wind farm control with deep reinforcement learning and knowledge-assisted learning,” IEEE Transactions on Industrial Informatics, vol. 16, no. 11, pp. 6912-6921, Nov. 2020. [Baidu Scholar]
J. Garcia and F. Fernandez, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, pp. 1437-1480, Aug. 2015. [Baidu Scholar]
M. Wen and T. Ufuk, “Constrained cross-entropy method for safe reinforcement learning,” IEEE Transactions on Automatic Control, vol. 66, no. 7, pp. 3123-3137, Jul. 2021. [Baidu Scholar]
L. Yu, Y. Sun, Z. Xu et al., “Multi-agent deep reinforcement learning for HVAC control in commercial buildings,” IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 407-419, Jan. 2021. [Baidu Scholar]
Y. Gao, W. Wang, J. Shi et al., “Batch-constrained reinforcement learning for dynamic distribution network reconfiguration,” IEEE Transactions on Smart Grid, vol. 11, no. 6, pp. 5357-5369, Nov. 2020. [Baidu Scholar]
X. Xu, Y. Jia, Y. Xu et al., “A multi-agent reinforcement learning-based data-driven method for home energy management,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3201-3211, Jul. 2020. [Baidu Scholar]
D. Zhang, S. Li, M. Sun et al., “An optimal and learning-based demand response and home energy management system,” IEEE Transactions on Smart Grid, vol. 7, no. 4, pp. 1790-1801, Jul. 2016. [Baidu Scholar]
H. Li, A. T. Eseye, J. Zhang et al., “Optimal energy management for industrial microgrids with high-penetration renewables,” Protection and Control of Modern Power Systems, vol. 2, no. 1, p. 12, Apr. 2017. [Baidu Scholar]
K. Arulkumaran, M. P. Deisenroth, M. Brundage et al. (2017, Aug.). A brief survey of deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1708.05866v2 [Baidu Scholar]
V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529-533, Feb. 2015. [Baidu Scholar]
T. Lillicrap, J. Hunt, A. Pritzel et al. (2015, Sept.). Continuous control with deep reinforcement learning. [Online]. Available: https://arxiv.org/abs/1509.02971 [Baidu Scholar]
D. Silver, G. Lever, N. Heess et al., “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on Machine Learning, Beijing, China, Jun. 2014, pp. 387-395. [Baidu Scholar]
C. K. Williams and C. E. Rasmussen, Gaussian Processes for Machine Learning. Cambridge: MIT Press, 2006. [Baidu Scholar]
National Renewable Energy Laboratory. (2021, Mar.). PVDAQ. [Online]. Available: http://maps.nrel.gov/pvdaq [Baidu Scholar]
E. Wilson. (2014, Nov.). Commercial and residential hourly load profiles for all TMY3 locations in the United States. [Online]. Available: https://data.openei.org/submissions/153 [Baidu Scholar]
N. Lu, “An evaluation of the HVAC load potential for providing load balancing service,” IEEE Transactions on Smart Grid, vol. 3, no. 3, pp. 1263-1270, Sept. 2012. [Baidu Scholar]
Y. Hong, J. Lin, C. Wu et al., “Multi-objective air-conditioning control considering fuzzy parameters using immune clonal selection programming,” IEEE Transactions on Smart Grid, vol. 3, no. 4, pp. 1603-1610, Dec. 2012. [Baidu Scholar]
D. P. Kingma and J. Ba. (2014, Dec.). Adam: a method for stochastic optimization. [Online]. Available: https://arxiv.org/abs/1412.6980 [Baidu Scholar]